The How-To Guide: Step-by-Step EDA in Python

A step-by-step EDA guide shows how to load data, inspect structure, check types, summarize statistics, visualize distributions, detect outliers, explore correlations, and document findings before building a model.

Illustration generated with Nano Banana 2 Pro. EDA step-by-step guide

Let us use a simple house pricing dataset. Assume we have a CSV file called:

house_prices.csv

With columns like:

price
size_sqft
bedrooms
bathrooms
house_age
distance_to_center_km
neighborhood

We will use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Load the data

# 1 import
import pandas as pd

# 2 csv
df = pd.read_csv("house_prices.csv")

# 3 show the 1st few rows
df.head()

You are asking:

What does the data look like?

Step 2: Check the shape

# 1 shape
df.shape

# Example output:
(10000, 7)

This means:

10,000 rows
7 columns

Ask:

Do I have enough rows?
Are there fewer rows than expected?
Are there more columns than expected?

Step 3: Inspect column types

df.info()

This tells you:

Column names
Data types
Missing values
Number of non-null values

Important questions:

Is price numeric?
Is neighborhood categorical?
Are dates stored as strings?
Are numbers accidentally stored as text?

Example issue:

price = "$350,000"

This may be read as text, not a number. You may need to clean it.

Step 4: Generate descriptive statistics

df.describe()

This gives summary statistics for numerical columns:

Count
Mean
Standard deviation
Minimum
25th percentile
Median
75th percentile
Maximum

Example:

price mean: 520,000
price median: 410,000
price max: 9,800,000

This tells a story. If the mean is much higher than the median, the data may be right-skewed. For house prices, this is common.

Step 5: Calculate mean, median, variance, and standard deviation manually

price_mean = df["price"].mean()
price_median = df["price"].median()
price_variance = df["price"].var()
price_std = df["price"].std()

print("Mean:", price_mean)
print("Median:", price_median)
print("Variance:", price_variance)
print("Standard deviation:", price_std)

Interpretation:

Mean: What is the average price?
Median: What is the middle price?
Variance: How spread out are prices?
Standard deviation: How far from the mean are prices typically?

If:

Mean = 520,000
Median = 410,000

Then expensive houses are pulling the mean upward. That is a clue.

Step 6: Compare mean and median

df[["price", "size_sqft", "bedrooms", "house_age"]].agg(["mean", "median"])

This helps you detect skew.

If mean and median are close, the distribution may be fairly balanced.

If mean and median are far apart, investigate the distribution.

Step 7: Plot distributions

Use histograms to understand the shape.

plt.figure(figsize=(8, 5))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()

Ask:

Is the distribution symmetric?
Is it skewed?
Are there multiple peaks?
Are there extreme values?
Does the distribution match business expectations?

For house prices, you may see a long right tail. That means most homes are in a normal range, but a few are very expensive.

Step 8: Use box plots to detect outliers

plt.figure(figsize=(8, 5))
sns.boxplot(x=df["price"])
plt.title("Box Plot of House Prices")
plt.xlabel("Price")
plt.show()

A box plot shows:

Median
Middle 50 percent of values
Potential outliers

Outliers often appear as points beyond the whiskers.

You can compare prices by neighborhood:

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x="neighborhood", y="price")
plt.title("House Prices by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.show()

This helps answer: Are some neighborhoods more expensive than others?

Step 9: Detect outliers using the IQR method

IQR means Interquartile Range. It measures the spread of the middle 50 percent of the data.

q1 = df["price"].quantile(0.25)
q3 = df["price"].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df["price"] < lower_bound) | (df["price"] > upper_bound)]
outliers.shape

Interpretation:

Values below lower_bound are unusually low
Values above upper_bound are unusually high

But remember:
The IQR method flags unusual values. It does not prove they are wrong.

Inspect them:

outliers.sort_values("price", ascending=False).head(10)

Step 10: Create scatter plots

Now let us examine relationships.

Size vs price:

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="size_sqft", y="price")
plt.title("House Size vs Price")
plt.xlabel("Size in Square Feet")
plt.ylabel("Price")
plt.show()

Ask:

Do larger houses generally cost more?
Is the relationship linear?
Are there clusters?
Are there large houses with surprisingly low prices?
Are there small houses with surprisingly high prices?

Distance vs price:

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="distance_to_center_km", y="price")
plt.title("Distance to City Center vs Price")
plt.xlabel("Distance to City Center in km")
plt.ylabel("Price")
plt.show()

Ask:

Do prices decrease as distance increases?
Is the pattern strong or weak?
Are some distant houses still expensive?
Could neighborhood explain those exceptions?

Step 11: Calculate correlations

numeric_df = df.select_dtypes(include=["number"])
correlations = numeric_df.corr()
correlations["price"].sort_values(ascending=False)

Example output:

Feature	Correlation
price	1.00
size_sqft	0.82
bathrooms	0.65
bedrooms	0.58
house_age	-0.31
distance_to_center_km	-0.55

Interpretation:

size_sqft has a strong positive relationship with price
distance_to_center_km has a negative relationship with price
house_age may have a weak negative relationship

But do not stop here.

Correlation is a clue, not a final answer.

Step 12: Create a correlation heatmap

plt.figure(figsize=(10, 8))
sns.heatmap(correlations, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()

Look for:

Strong correlations with price
Strong correlations between features
Variables that may duplicate each other
Unexpected relationships

Example:

If bedrooms and size_sqft are highly correlated, that makes sense.
Larger houses usually have more bedrooms.
But if house_age and price are strongly positive, that may require investigation. Maybe older houses are in premium historical neighborhoods.

Step 13: Group by categories

EDA is not only numerical. You should also inspect categorical variables.

Example:

df.groupby("neighborhood")["price"].agg(["count", "mean", "median", "std"]).sort_values("median", ascending=False)

This answers:

Which neighborhoods are most expensive?
Which neighborhoods have the most listings?
Which neighborhoods have the most price variation?

A useful visualization:

plt.figure(figsize=(12, 6))
sns.barplot(data=df, x="neighborhood", y="price", estimator="median")
plt.title("Median House Price by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Median Price")
plt.xticks(rotation=45)
plt.show()

Why median instead of mean?
Answer: "Because house prices often have outliers. Median gives a more stable comparison."

Step 14: Check missing values

df.isna().sum().sort_values(ascending=False)

Missing values can reveal important issues.

Example:

Feature	Missing Values
bathrooms	400
house_age	120
neighborhood	5

Ask:

Why is this data missing?
Is it missing randomly?
Is it missing more often for certain neighborhoods?
Should I impute, remove, or investigate?

Missingness itself can be informative.

For example, luxury properties may hide exact location or size.

Step 15: Document your findings

EDA is not just making charts. You need to write down what you learned.

Example notes:

House prices are right-skewed. Median is lower than mean.
Several luxury properties above $5M strongly affect the mean.
Size has a strong positive correlation with price.
Distance to city center has a moderate negative correlation with price.
Neighborhood explains large price differences.
Some records have suspiciously low house size values.
Price may need log transformation before modeling.

This step is important because EDA should guide modeling decisions.

Series Parts

Managing Data Science – From Concept to Governance