The How-To Guide: Step-by-Step EDA in Python
A step-by-step EDA guide shows how to load data, inspect structure, check types, summarize statistics, visualize distributions, detect outliers, explore correlations, and document findings before building a model.
Let us use a simple house pricing dataset. Assume we have a CSV file called:
house_prices.csv
With columns like:
- price
- size_sqft
- bedrooms
- bathrooms
- house_age
- distance_to_center_km
- neighborhood
We will use:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Step 1: Load the data
# 1 import
import pandas as pd
# 2 csv
df = pd.read_csv("house_prices.csv")
# 3 show the 1st few rows
df.head()You are asking:
What does the data look like?
Step 2: Check the shape
# 1 shape
df.shape
# Example output:
(10000, 7)This means:
- 10,000 rows
- 7 columns
Ask:
- Do I have enough rows?
- Are there fewer rows than expected?
- Are there more columns than expected?
Step 3: Inspect column types
df.info()This tells you:
- Column names
- Data types
- Missing values
- Number of non-null values
Important questions:
- Is price numeric?
- Is neighborhood categorical?
- Are dates stored as strings?
- Are numbers accidentally stored as text?
Example issue:
price = "$350,000"
This may be read as text, not a number. You may need to clean it.
Step 4: Generate descriptive statistics
df.describe()
This gives summary statistics for numerical columns:
- Count
- Mean
- Standard deviation
- Minimum
- 25th percentile
- Median
- 75th percentile
- Maximum
Example:
- price mean: 520,000
- price median: 410,000
- price max: 9,800,000
This tells a story. If the mean is much higher than the median, the data may be right-skewed. For house prices, this is common.
Step 5: Calculate mean, median, variance, and standard deviation manually
price_mean = df["price"].mean()
price_median = df["price"].median()
price_variance = df["price"].var()
price_std = df["price"].std()
print("Mean:", price_mean)
print("Median:", price_median)
print("Variance:", price_variance)
print("Standard deviation:", price_std)
Interpretation:
- Mean: What is the average price?
- Median: What is the middle price?
- Variance: How spread out are prices?
- Standard deviation: How far from the mean are prices typically?
If:
- Mean = 520,000
- Median = 410,000
Then expensive houses are pulling the mean upward. That is a clue.
Step 6: Compare mean and median
df[["price", "size_sqft", "bedrooms", "house_age"]].agg(["mean", "median"])
This helps you detect skew.
If mean and median are close, the distribution may be fairly balanced.
If mean and median are far apart, investigate the distribution.
Step 7: Plot distributions
Use histograms to understand the shape.
plt.figure(figsize=(8, 5))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()
Ask:
- Is the distribution symmetric?
- Is it skewed?
- Are there multiple peaks?
- Are there extreme values?
- Does the distribution match business expectations?
For house prices, you may see a long right tail. That means most homes are in a normal range, but a few are very expensive.
Step 8: Use box plots to detect outliers
plt.figure(figsize=(8, 5))
sns.boxplot(x=df["price"])
plt.title("Box Plot of House Prices")
plt.xlabel("Price")
plt.show()
A box plot shows:
- Median
- Middle 50 percent of values
- Potential outliers
Outliers often appear as points beyond the whiskers.
You can compare prices by neighborhood:
plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x="neighborhood", y="price")
plt.title("House Prices by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.show()
This helps answer: Are some neighborhoods more expensive than others?
Step 9: Detect outliers using the IQR method
IQR means Interquartile Range. It measures the spread of the middle 50 percent of the data.
q1 = df["price"].quantile(0.25)
q3 = df["price"].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = df[(df["price"] < lower_bound) | (df["price"] > upper_bound)]
outliers.shape
Interpretation:
- Values below lower_bound are unusually low
- Values above upper_bound are unusually high
The IQR method flags unusual values. It does not prove they are wrong.
Inspect them:
outliers.sort_values("price", ascending=False).head(10)
Step 10: Create scatter plots
Now let us examine relationships.
Size vs price:
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="size_sqft", y="price")
plt.title("House Size vs Price")
plt.xlabel("Size in Square Feet")
plt.ylabel("Price")
plt.show()
Ask:
- Do larger houses generally cost more?
- Is the relationship linear?
- Are there clusters?
- Are there large houses with surprisingly low prices?
- Are there small houses with surprisingly high prices?
Distance vs price:
plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="distance_to_center_km", y="price")
plt.title("Distance to City Center vs Price")
plt.xlabel("Distance to City Center in km")
plt.ylabel("Price")
plt.show()
Ask:
- Do prices decrease as distance increases?
- Is the pattern strong or weak?
- Are some distant houses still expensive?
- Could neighborhood explain those exceptions?
Step 11: Calculate correlations
numeric_df = df.select_dtypes(include=["number"])
correlations = numeric_df.corr()
correlations["price"].sort_values(ascending=False)
Example output:
| Feature | Correlation |
|---|---|
| price | 1.00 |
| size_sqft | 0.82 |
| bathrooms | 0.65 |
| bedrooms | 0.58 |
| house_age | -0.31 |
| distance_to_center_km | -0.55 |
Interpretation:
size_sqfthas a strong positive relationship with pricedistance_to_center_kmhas a negative relationship with pricehouse_agemay have a weak negative relationship
But do not stop here.
Correlation is a clue, not a final answer.
Step 12: Create a correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlations, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()
Look for:
- Strong correlations with price
- Strong correlations between features
- Variables that may duplicate each other
- Unexpected relationships
Example:
- If
bedroomsandsize_sqftare highly correlated, that makes sense. - Larger houses usually have more bedrooms.
- But if
house_ageandpriceare strongly positive, that may require investigation. Maybe older houses are in premium historical neighborhoods.
Step 13: Group by categories
EDA is not only numerical. You should also inspect categorical variables.
Example:
df.groupby("neighborhood")["price"].agg(["count", "mean", "median", "std"]).sort_values("median", ascending=False)
This answers:
- Which neighborhoods are most expensive?
- Which neighborhoods have the most listings?
- Which neighborhoods have the most price variation?
A useful visualization:
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x="neighborhood", y="price", estimator="median")
plt.title("Median House Price by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Median Price")
plt.xticks(rotation=45)
plt.show()
Why median instead of mean?
Answer: "Because house prices often have outliers. Median gives a more stable comparison."
Step 14: Check missing values
df.isna().sum().sort_values(ascending=False)
Missing values can reveal important issues.
Example:
| Feature | Missing Values |
|---|---|
| bathrooms | 400 |
| house_age | 120 |
| neighborhood | 5 |
Ask:
- Why is this data missing?
- Is it missing randomly?
- Is it missing more often for certain neighborhoods?
- Should I impute, remove, or investigate?
Missingness itself can be informative.
For example, luxury properties may hide exact location or size.
Step 15: Document your findings
EDA is not just making charts. You need to write down what you learned.
Example notes:
- House prices are right-skewed. Median is lower than mean.
- Several luxury properties above $5M strongly affect the mean.
- Size has a strong positive correlation with price.
- Distance to city center has a moderate negative correlation with price.
- Neighborhood explains large price differences.
- Some records have suspiciously low house size values.
- Price may need log transformation before modeling.
Series Parts
Managing Data Science – From Concept to Governance