Exploratory Data Analysis and Statistics: Learning to Listen to Data Before Modeling

Learn how Exploratory Data Analysis helps you listen to data before modeling. This beginner-friendly guide explains mean, median, variance, distributions, correlation, scatter plots, heatmaps, and outliers using practical house pricing examples.

Illustration generated with Nano Banana 2 Pro. Exploratory data analysis (EDA)

Before you train a model, tune hyperparameters, or build a dashboard, you need to do something more basic and more important:

You need to listen to the data.

Exploratory Data Analysis, usually called EDA, is the practice of examining a dataset before modeling so you can understand what is typical, what is unusual, what variables move together, and what problems might distort your conclusions.

A good data scientist does not jump straight into machine learning. A data scientist first asks:

What does this dataset look like?
What is normal here?
What is strange?
Which variables seem related?
Are there missing values, outliers, or suspicious patterns?
Could the data be misleading me?

EDA is where you build your first mental model of the data.

In this post, we will learn the core ideas behind EDA and descriptive statistics using a real-world example: house prices.

We will cover:

Descriptive statistics: mean, median, variance, standard deviation
Data distributions
Outliers
Correlation
Scatter plots
Heatmaps
Practical EDA workflow in Python
Common mistakes and anti-patterns

1 The Real World Why

Imagine you work for a real estate company.

Business problem: Why do we care about outliers in house pricing?

You want to build a model that predicts house prices. The dataset contains information like:

House size in square feet
Number of bedrooms
Neighborhood
Age of the house
Distance to city center
Sale price

At first, this sounds straightforward. Bigger houses probably cost more. Houses in expensive neighborhoods probably cost more. Older houses may cost less, unless they are historic or renovated. But then you inspect the data and see this:

House	Size	Bedrooms	Price
A	1,400 sqft	3	$350,000
B	1,600 sqft	3	$390,000
C	1,800 sqft	4	$450,000
D	1,500 sqft	3	$375,000
E	1,700 sqft	3	$410,000
F	1,550 sqft	3	$4,200,000

House F is unusual.

Maybe it is a mansion incorrectly recorded as 1,550 sqft. Maybe it is in a luxury neighborhood. Maybe the price has an extra zero. Maybe it includes a large amount of land. Maybe it is simply a valid luxury property.

If you ignore this outlier and train a model immediately, your model may learn a distorted relationship between size and price.

The trained model may think:
"Some average-sized houses cost millions, so I should predict higher prices for many similar houses."

That is dangerous. This is why EDA matters. EDA helps us avoid blindly trusting the dataset. It helps us ask better questions before we build models.

In business terms, EDA protects us from:

Bad pricing decisions
Misleading dashboards
Broken forecasts
Poor model performance
Wrong strategic conclusions
Expensive mistakes caused by bad data assumptions

In simple terms:
EDA is the conversation you have with the data before asking it to make predictions.

2 Intuition: How to Think About EDA

EDA is like inspecting a used car before buying it. Imagine you want to buy a used car. You would not just look at the price and immediately pay.

You would inspect it first:

How many miles does it have?
Has it been in an accident?
Does the engine sound normal?
Are the tires worn out?
Is the price suspiciously low?
Is the seller hiding something?

EDA is the same idea, but for data. Before trusting a dataset, you inspect it.

You look for:

Typical values
Extreme values
Strange patterns
Missing values
Relationships between variables
Distributions
Possible errors

A model is only as good as the data and assumptions behind it. EDA helps you check both. Descriptive statistics are your data's first summary.

Imagine you walk into a classroom of students and want to understand their test scores. You could read every single score one by one, but that would take time.

Instead, you might ask:

What is the average score?
What is the middle score?
What is the highest score?
What is the lowest score?
Are most students close together, or are scores spread out?

These are descriptive statistics.

Descriptive statistics:
They help you summarize a dataset without looking at every single row.

3 Core Concepts

You'll learn about 4 core concepts:

Mean: balance point
Median: middle value
Variance: how spread out data is
Standard deviation: spread in original unit

3.1 Mean: The balance point

The mean is what most people call the average.

If house prices are:

The mean is:

(300,000 + 350,000 + 400,000 + 450,000 + 500,000) / 5 = 400,000

So the average house price is $400,000.

Intuition: Think of the mean as the balance point of a seesaw.

If every data point has weight, the mean is the point where the seesaw balances.

Formula

For values:

x1, x2, x3, ..., xn

The mean is:

mean = sum of all values / number of values

mean = (x1 + x2 + ... + xn) / n

The mean is useful when:

The data is roughly balanced
There are no extreme outliers
You want a general central value
The variable is numerical

When the mean can mislead you:

The mean is sensitive to outliers. It gets pulled upward immediately by one extreme value.

The mean is powerful, but dangerous

3.2 Median: The middle value

The median is the middle value after sorting the data.

Example:

The middle value is: 400,000

So the median is $400,000.

Intuition:
If the mean is the balance point, the median is the person standing in the middle of a line.

Imagine everyone lines up from cheapest house to most expensive house. The median is the house exactly in the middle.

Why the median is useful:

The median is more resistant to outliers.

In our example:

Mean = 1,140,000
Median = 400,000

Which one better describes a typical house?
Answer: "The median"

Use the median when:

Your data has extreme values
Your distribution is skewed
You want a robust idea of typical
You are working with income, prices, wealth, rent, or other variables with long tails

💡

House prices, salaries, startup valuations, and customer spending often have outliers. The median is often more honest than the mean.

3.3 Variance: How spread out the data is

Mean and median tell us where the center is. But they do not tell us how spread out the data is.

Consider 2 neighborhoods:

Neighborhood A:

Neighborhood B:

Both have the same mean:

400,000

But they feel very different. Neighborhood A has stable prices. Neighborhood B has much more variation.

This is where variance helps.

Intuition:
Variance measures how much values wander away from the average.
Low variance means → values stay close to the mean.
High variance means → values are spread far away from the mean.

Simple explanation:

To calculate variance, ask:
1. How far is each value from the mean?
2. Square those differences so negative and positive differences do not cancel out.
3. Average those squared differences.

Why square the differences?
Answer: "If we did not square the differences, values below the mean and above the mean would cancel each other."

Example:

10 + 10 = 0

That would incorrectly suggest no spread.

Squaring fixes this:

(-10)^2 + 10^2 = 100 + 100 = 200

Formula:

Population variance:

variance = sum((x - mean)^2) / n

Sample variance:

variance = sum((x - mean)^2) / (n - 1)

In practice, data scientists often use the sample version when working with a sample from a larger population.

3.4 Standard deviation: Spread in the original unit

Variance is useful, but it has one problem. If house prices are measured in dollars, variance is measured in squared dollars. That is hard to interpret. Standard deviation solves this by taking the square root of variance.

standard deviation = sqrt(variance)

If the standard deviation of house prices is $50,000, that is easier to understand than saying the variance is 2,500,000,000.

Intuition:
Standard deviation tells you the typical distance from the mean.

A small standard deviation means → values are packed closely together.
A large standard deviation means → values are spread out.

4 Data Distributions: The Shape of the Story

A distribution shows how values are arranged.

Instead of only asking: What is the average? we ask: We ask: What does the full shape look like?

For house prices, a distribution can tell us:

Are most houses affordable?
Are there a few extremely expensive houses?
Are there two separate markets, such as apartments and luxury homes?
Are prices roughly symmetric or heavily skewed?

Common distribution shapes are:

Normal distribution
Right-skewed distribution
Left-skewed distribution
Bimodal distribution

Normal distribution

A normal distribution looks like a bell curve. Most values are near the center, and fewer values appear at the extremes. Many natural measurements roughly follow this shape, such as height.

Right-skewed distribution

A right-skewed distribution has a long tail on the right. House prices and income often look like this. Most values are moderate, but a few are extremely high. The mean is usually higher than the median because extreme high values pull the mean upward.

Left-skewed distribution

A left-skewed distribution has a long tail on the left.

Example: test scores on an easy exam where most students score high but a few score low.

Bimodal distribution

A bimodal distribution has two peaks. This may indicate two different groups mixed together.

For example:

Small apartments and large houses
New customers and loyal customers
Economy cars and luxury cars

If your data has two peaks, do not rush to average everything together. You may need to analyze the groups separately.

5 Correlation: Do Variables Move Together?

Correlation measures how two numerical variables move together.

For example:

Does house size increase with price?
Does distance from city center decrease price?
Does number of bedrooms relate to price?
Does advertising spend relate to sales?

The most common correlation measure is Pearson correlation. It ranges from:

1 to +1

Positive correlation

If one variable increases and the other also tends to increase, correlation is positive.

Example: larger house size means higher price

A correlation near +1 means a strong positive linear relationship.

Negative correlation

If one variable increases and the other tends to decrease, correlation is negative.

Example: greater distance from city center means lower price

A correlation near -1 means a strong negative linear relationship.

No linear correlation

If there is no clear straight-line relationship, correlation is close to '0' (zero).

Be aware:
correlation close to '0' does not always mean no relationship. It may mean no linear relationship.

There could still be a curved pattern.

Correlation intuition:
Imagine two people walking.

If they walk in the same direction → they have positive correlation.
If one walks forward while the other walks backward → they have negative correlation.
If they move randomly with no shared pattern → their correlation is near zero.

Correlation examples:

Relationship	Correlation
House size and house price	Positive
Distance from city center and price	Often negative
Number of bathrooms and price	Often positive
Random ID number and price	Usually near zero
Ice cream sales and summer temperature	Positive

6 Scatter Plots: Seeing Relationships

A scatter plot shows the relationship between two numerical variables. A scatter plot shows the relationship between two numerical variables.

For house prices:

x-axis: house size
y-axis: house price

A scatter plot helps you see:

Positive relationships
Negative relationships
Clusters
Outliers
Curved relationships
Unequal spread
Strange data errors

Example interpretation:

If points rise from bottom-left to top-right, that suggests a positive relationship between size and price.
If points fall from top-left to bottom-right, that suggests a negative relationship.
If points form a cloud with no direction, there may be no strong linear relationship.

7 Heatmaps: Seeing Many Relationships at Once

A heatmap uses color to show values. In EDA, heatmaps are commonly used for correlation matrices. A correlation matrix shows correlations between many variables.

Example:

Variable	Price	Size	Bedrooms	Age	Distance
Price	1.00	0.82	0.61	-0.35	-0.58
Size	0.82	1.00	0.70	-0.20	-0.30
Bedrooms	0.61	0.70	1.00	-0.10	-0.22
Age	-0.35	-0.20	-0.10	1.00	0.25
Distance	-0.58	-0.30	-0.22	0.25	1.00

A heatmap makes this easier to read visually. Strong positive correlations may appear in one color. Strong negative correlations may appear in another color. Weak correlations may appear neutral.

Heatmaps are useful. They help you quickly identify:

Variables strongly related to the target
Variables strongly related to each other
Redundant features
Possible multicollinearity
Patterns worth investigating further

For example, if size and bedrooms are highly correlated, they may carry overlapping information.

That does not mean you must remove one immediately, but it tells you to pay attention.

8 Outlier Handling: Do Not Delete Strange Data Too Quickly

An outlier is a value that is far away from most other values.

In house pricing:

Most houses: $300,000 to $700,000
One house: $8,500,000

The $8.5 million house is an outlier. But outliers are not always errors.

They can be:

Data entry mistakes
Rare but valid cases
Fraud
Luxury products
Important edge cases
Measurement errors
Different population groups
Business-critical exceptions

The key rule:
Do not automatically delete outliers.

First ask: Is this outlier wrong, or is it telling me something important?

Common outlier handling strategies are:

Investigating
Keeping outlier
Removing outlier
Capping / winsorizing
Transforming variable
Segmenting data

Investigating

Before changing anything, inspect the record.

Ask:

Is the value possible?
Does it violate business rules?
Is it a typo?
Is it from a different segment?
Does it have missing related values?
Was it measured differently?

Example:

A house price of $99 might be a data error. A house price of $9,900,000 might be a luxury property.

These require different treatment.

Keeping outlier

Keep it if it is valid and relevant to the problem. If your model must predict luxury homes, you should not delete luxury homes.

Removing outlier

Remove it if it is clearly incorrect.

Example:

Negative house price
House size of 0 sqft
Age of house is 999 years due to placeholder value
Price entered in cents instead of dollars

Capping / winsorizing

Capping means limiting extreme values to a chosen threshold.

Example:

Any price above the 99th percentile becomes the 99th percentile value.

This reduces the influence of extreme values while keeping the row.

Transforming variable

For skewed variables like price or income, a log transformation can help. Instead of modeling price, you model log(price). This compresses extreme high values and often makes patterns easier to model.

Segmenting data

Sometimes an outlier is not wrong. It belongs to a different group.

For example:

Regular homes
Luxury homes
Commercial properties
Rural land
Apartments

If these are mixed together, your model may struggle. The solution may be to analyze or model segments separately.

9 The How-To Guide: Step-by-Step EDA in Python

The step-by-step EDA how-to guide is here

10 A Practical EDA Checklist

You can find the EDA checklist here

11 Pitfalls and Anti-Patterns

Pitfall 1: Correlation does not imply causation

This is one of the most important rules in data science. If two variables move together, that does not mean one causes the other.

Example:

Ice cream sales and drowning incidents may both increase in summer. Does ice cream cause drowning?
Answer: "No"

A third variable, such as "hot weather", affects both.

In house pricing, suppose we find: number of bathrooms is strongly correlated with price.
Does adding a bathroom automatically increase the house value by the full correlation amount?
Answer: "Not necessarily. Larger houses tend to have more bathrooms, and larger houses also cost more."

Size may be the deeper explanation. Correlation gives us a clue. It does not prove causality.

Pitfall 2: Trusting the mean when the data is skewed

If data has extreme values, the mean may be misleading. For house prices, salaries, revenue, and customer spending, always compare:

mean vs median

If the mean is much larger than the median, you probably have a right-skewed distribution. In that case, use median for typical value and investigate outliers.

Pitfall 3: Deleting outliers automatically

Outliers can be annoying, but they can also be valuable, such as

A fraud detection model needs fraud cases.
A luxury pricing model needs luxury properties.
A medical risk model needs rare but serious cases.

If you delete outliers without understanding them, you may delete the most important part of the dataset.

Better process:

Detect outliers
Inspect them
Understand the business context
Decide what to do
Document the decision

Pitfall 4: Only looking at summary statistics

Summary statistics are helpful, but they can hide patterns. Two datasets can have the same mean and variance but very different shapes. That is why visualization matters.

Always combine: summary statistics + visualizations

Use:

Histograms for distributions
Box plots for outliers
Scatter plots for relationships
Heatmaps for correlation patterns

Pitfall 5: Ignoring data types

Not every number should be treated as a number.

Example:

zipcode = 90210

A zip code is numeric-looking, but it is really categorical. It does not make sense to calculate:

average zipcode

Similarity:

customer_id
product_id
house_id

These are identifiers, not mathematical quantities.

Always ask:
Does arithmetic make sense for this column?
If not, treat it as categorical or identifier data.

Pitfall 6: Confusing visual patterns with proof

A scatter plot may show a trend, but your eyes can overinterpret. You might see a pattern where there is only noise. Use visualizations to generate hypotheses, then use statistical methods or validation to test them. EDA is exploratory. It helps you ask better questions. It does not automatically provide final proof.

Pitfall 7: Forgetting the business question

EDA can become endless. You can make hundreds of plots and still not answer the actual question. Always connect EDA back to the business problem.

For house pricing, useful questions include:

What factors are most associated with price?
Which records look suspicious?
Are there different market segments?
Should luxury homes be modeled separately?
What variables need cleaning before modeling?

EDA should serve decision-making.

12 Mini Case Study: Listening to House Price Data

Imagine we run EDA and discover:

Mean price: $620,000
Median price: $430,000
Max price: $12,000,000

What do we learn? The mean is much higher than the median. This suggests a right-skewed distribution with expensive outliers.

Then we plot the price distribution and see a long right tail.

Next, we create a scatter plot of size vs price. We see that larger houses generally cost more, but a few small houses are extremely expensive. We inspect those small expensive houses and find they are all in one luxury waterfront neighborhood.

Now we have learned something important: Price is not only about size. Location strongly matters.

Then we create a correlation heatmap. We find:

size_sqft and price: 0.78
distance_to_center_km and price: -0.52
bedrooms and size_sqft: 0.73

Interpretation:

Larger homes tend to cost more
Homes farther from the city center tend to cost less
Bedrooms and size overlap because larger homes usually have more bedrooms

Now we know what to consider before modeling:

Use median when summarizing prices
Consider log-transforming price
Investigate luxury outliers
Include neighborhood as an important feature
Watch for correlated features like bedrooms and size
Do not assume correlation means causation

This is what it means to listen to data.

13 From EDA to Modeling

EDA helps you make better modeling choices. For example: If the target is skewed, use:

df["log_price"] = np.log1p(df["price"])

This may make the modeling problem easier.

If outliers are valid but extreme, consider:

Robust models
Log transformation
Segmented models
Tree-based models
Median-based evaluation metrics

If features are highly correlated, consider:

Removing redundant variables
Regularization
Feature selection
Interpreting model coefficients carefully

If categories matter, encode categorical variables properly. For example:

pd.get_dummies(df, columns=["neighborhood"], drop_first=True)

Or use models that handle categorical variables directly. If relationships are nonlinear, linear regression may not be enough.

Consider:

Feature engineering
Polynomial features
Decision trees
Random forests
Gradient boosting

EDA does not replace modeling.
EDA prepares you to model intelligently.

14 Exercise

Use this small dataset:

import pandas as pd

df = pd.DataFrame({
"size_sqft": [1200, 1500, 1700, 2000, 2200, 2500, 1600, 1800, 1400, 1550],
"bedrooms": [2, 3, 3, 4, 4, 4, 3, 3, 2, 3],
"distance_to_center_km": [5, 7, 8, 10, 12, 15, 6, 9, 4, 5],
"price": [300000, 360000, 390000, 450000, 480000, 520000, 370000, 410000, 340000, 4200000]
})

Now answer:

What is the mean price?
What is the median price?
Why are they so different?
Which house is the outlier?
What happens if you remove the outlier?
Is size correlated with price?
Does the scatter plot look reasonable?
What business question would you ask next?

Try this code:

df.describe()

df["price"].mean()
df["price"].median()
df["price"].std()

df.corr(numeric_only=True)

sns.scatterplot(data=df, x="size_sqft", y="price")
plt.show()

sns.heatmap(df.corr(numeric_only=True), annot=True, cmap="coolwarm", center=0)
plt.show()

Expected lesson:
The outlier strongly affects the mean and correlation. Before modeling, you need to understand whether that $4.2 million price is an error, a luxury property, or a different market segment.

15 Final Mental Model

EDA is not just a technical step. It is a way of thinking. When you do EDA, you are asking the dataset:

What is typical?
What is unusual?
What is related?
What is missing?
What could mislead me?
What should I investigate next?

Descriptive statistics give you summaries. Distributions show shape. Scatter plots show relationships. Heatmaps show many relationships at once. Outlier analysis protects you from distorted conclusions. Correlation helps you find patterns, but it does not prove cause and effect. The goal is not to make beautiful charts. The goal is to understand the data well enough to make better decisions. Before modeling, listen. The data usually has something to say.

Series Parts

Managing Data Science – From Concept to Governance