The How-To Guide: Step-by-Step EDA in Python

A step-by-step EDA guide shows how to load data, inspect structure, check types, summarize statistics, visualize distributions, detect outliers, explore correlations, and document findings before building a model.

The How-To Guide: Step-by-Step EDA in Python
Illustration generated with Nano Banana 2 Pro. EDA step-by-step guide

Let us use a simple house pricing dataset. Assume we have a CSV file called:

house_prices.csv

With columns like:

  • price
  • size_sqft
  • bedrooms
  • bathrooms
  • house_age
  • distance_to_center_km
  • neighborhood

We will use:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Step 1: Load the data

# 1 import
import pandas as pd

# 2 csv
df = pd.read_csv("house_prices.csv")

# 3 show the 1st few rows
df.head()

You are asking:

What does the data look like?

Step 2: Check the shape

# 1 shape
df.shape

# Example output:
(10000, 7)

This means:

  • 10,000 rows
  • 7 columns

Ask:

  • Do I have enough rows?
  • Are there fewer rows than expected?
  • Are there more columns than expected?

Step 3: Inspect column types

df.info()

This tells you:

  • Column names
  • Data types
  • Missing values
  • Number of non-null values

Important questions:

  • Is price numeric?
  • Is neighborhood categorical?
  • Are dates stored as strings?
  • Are numbers accidentally stored as text?

Example issue:

price = "$350,000"

This may be read as text, not a number. You may need to clean it.

Step 4: Generate descriptive statistics

df.describe()

This gives summary statistics for numerical columns:

  • Count
  • Mean
  • Standard deviation
  • Minimum
  • 25th percentile
  • Median
  • 75th percentile
  • Maximum

Example:

  • price mean: 520,000
  • price median: 410,000
  • price max: 9,800,000

This tells a story. If the mean is much higher than the median, the data may be right-skewed. For house prices, this is common.

Step 5: Calculate mean, median, variance, and standard deviation manually

price_mean = df["price"].mean()
price_median = df["price"].median()
price_variance = df["price"].var()
price_std = df["price"].std()
print("Mean:", price_mean)
print("Median:", price_median)
print("Variance:", price_variance)
print("Standard deviation:", price_std)

Interpretation:

  • Mean: What is the average price?
  • Median: What is the middle price?
  • Variance: How spread out are prices?
  • Standard deviation: How far from the mean are prices typically?

If:

  • Mean = 520,000
  • Median = 410,000

Then expensive houses are pulling the mean upward. That is a clue.

Step 6: Compare mean and median

df[["price", "size_sqft", "bedrooms", "house_age"]].agg(["mean", "median"])

This helps you detect skew.

If mean and median are close, the distribution may be fairly balanced.

If mean and median are far apart, investigate the distribution.

Step 7: Plot distributions

Use histograms to understand the shape.

plt.figure(figsize=(8, 5))
sns.histplot(df["price"], bins=50, kde=True)
plt.title("Distribution of House Prices")
plt.xlabel("Price")
plt.ylabel("Count")
plt.show()

Ask:

  • Is the distribution symmetric?
  • Is it skewed?
  • Are there multiple peaks?
  • Are there extreme values?
  • Does the distribution match business expectations?

For house prices, you may see a long right tail. That means most homes are in a normal range, but a few are very expensive.

Step 8: Use box plots to detect outliers

plt.figure(figsize=(8, 5))
sns.boxplot(x=df["price"])
plt.title("Box Plot of House Prices")
plt.xlabel("Price")
plt.show()

A box plot shows:

  • Median
  • Middle 50 percent of values
  • Potential outliers

Outliers often appear as points beyond the whiskers.

You can compare prices by neighborhood:

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x="neighborhood", y="price")
plt.title("House Prices by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Price")
plt.xticks(rotation=45)
plt.show()

This helps answer: Are some neighborhoods more expensive than others?

Step 9: Detect outliers using the IQR method

IQR means Interquartile Range. It measures the spread of the middle 50 percent of the data.

q1 = df["price"].quantile(0.25)
q3 = df["price"].quantile(0.75)
iqr = q3 - q1

lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr

outliers = df[(df["price"] < lower_bound) | (df["price"] > upper_bound)]
outliers.shape

Interpretation:

  • Values below lower_bound are unusually low
  • Values above upper_bound are unusually high
But remember:
The IQR method flags unusual values. It does not prove they are wrong.

Inspect them:

outliers.sort_values("price", ascending=False).head(10)

Step 10: Create scatter plots

Now let us examine relationships.

Size vs price:

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="size_sqft", y="price")
plt.title("House Size vs Price")
plt.xlabel("Size in Square Feet")
plt.ylabel("Price")
plt.show()

Ask:

  • Do larger houses generally cost more?
  • Is the relationship linear?
  • Are there clusters?
  • Are there large houses with surprisingly low prices?
  • Are there small houses with surprisingly high prices?

Distance vs price:

plt.figure(figsize=(8, 5))
sns.scatterplot(data=df, x="distance_to_center_km", y="price")
plt.title("Distance to City Center vs Price")
plt.xlabel("Distance to City Center in km")
plt.ylabel("Price")
plt.show()

Ask:

  • Do prices decrease as distance increases?
  • Is the pattern strong or weak?
  • Are some distant houses still expensive?
  • Could neighborhood explain those exceptions?

Step 11: Calculate correlations

numeric_df = df.select_dtypes(include=["number"])
correlations = numeric_df.corr()
correlations["price"].sort_values(ascending=False)

Example output:

FeatureCorrelation
price1.00
size_sqft0.82
bathrooms0.65
bedrooms0.58
house_age-0.31
distance_to_center_km-0.55

Interpretation:

  • size_sqft has a strong positive relationship with price
  • distance_to_center_km has a negative relationship with price
  • house_age may have a weak negative relationship

But do not stop here.

Correlation is a clue, not a final answer.

Step 12: Create a correlation heatmap

plt.figure(figsize=(10, 8))
sns.heatmap(correlations, annot=True, cmap="coolwarm", center=0)
plt.title("Correlation Heatmap")
plt.show()

Look for:

  • Strong correlations with price
  • Strong correlations between features
  • Variables that may duplicate each other
  • Unexpected relationships

Example:

  • If bedrooms and size_sqft are highly correlated, that makes sense.
  • Larger houses usually have more bedrooms.
  • But if house_age and price are strongly positive, that may require investigation. Maybe older houses are in premium historical neighborhoods.

Step 13: Group by categories

EDA is not only numerical. You should also inspect categorical variables.

Example:

df.groupby("neighborhood")["price"].agg(["count", "mean", "median", "std"]).sort_values("median", ascending=False)

This answers:

  • Which neighborhoods are most expensive?
  • Which neighborhoods have the most listings?
  • Which neighborhoods have the most price variation?

A useful visualization:

plt.figure(figsize=(12, 6))
sns.barplot(data=df, x="neighborhood", y="price", estimator="median")
plt.title("Median House Price by Neighborhood")
plt.xlabel("Neighborhood")
plt.ylabel("Median Price")
plt.xticks(rotation=45)
plt.show()

Why median instead of mean?
Answer: "Because house prices often have outliers. Median gives a more stable comparison."

Step 14: Check missing values

df.isna().sum().sort_values(ascending=False)

Missing values can reveal important issues.

Example:

FeatureMissing Values
bathrooms400
house_age120
neighborhood5

Ask:

  • Why is this data missing?
  • Is it missing randomly?
  • Is it missing more often for certain neighborhoods?
  • Should I impute, remove, or investigate?

Missingness itself can be informative.

For example, luxury properties may hide exact location or size.

Step 15: Document your findings

EDA is not just making charts. You need to write down what you learned.

Example notes:

  • House prices are right-skewed. Median is lower than mean.
  • Several luxury properties above $5M strongly affect the mean.
  • Size has a strong positive correlation with price.
  • Distance to city center has a moderate negative correlation with price.
  • Neighborhood explains large price differences.
  • Some records have suspiciously low house size values.
  • Price may need log transformation before modeling.
This step is important because EDA should guide modeling decisions.

Series Parts

Managing Data Science – From Concept to Governance

  1. The Analytics Continuum
  2. Exploratory Data Analysis EDA& statistics
  3. A Practical EDA Checklist
  4. The How-To Guide: Step-by-Step EDA in Python
  5. From Big Data to Smart Data - The Art of Data Engineering & Data Management; next