Exploratory Data Analysis 🔍🏠📋
Know your data before you trust your data. EDA is the house inspection — find every crack before it costs you later!
Day 59: Exploratory Data Analysis — Inspect Before You Build!
Why Should I Care?
Would you buy a house without inspecting it first? No! Hidden cracks, broken pipes, faulty wiring — you would find them during inspection, not after you moved in. Data works exactly the same way. Hidden nulls, wrong data types, absurd values — they silently destroy your analysis if you skip EDA. Most data science failures happen before a single model is trained. They happen when someone skipped the inspection and built on broken ground!
The Four-Room EDA Inspection
Room 1 — Entrance: df.shape. How big is this dataset? How many rows and columns am I dealing with? Never skip the walkthrough. Room 2 — Living Room: df.info(). Are the data types correct? Is score stored as float when it should be integer? Room 3 — Bathroom: df.isnull().sum(). Find every hidden null. One missing value in the wrong column corrupts your entire analysis. Room 4 — Bedroom: df.describe(). What is the range? Are there absurd values — negative ages or scores above 100?
The Complete EDA Pipeline
import pandas as pd
data = {
"name" : ["Rohith", "Sneha", "Arjun", "Priya", "Kiran"],
"score" : [87, 92, 45, 76, None],
"city" : ["Hyderabad", "Mumbai", "Delhi", "Chennai", "Pune"]
}
df = pd.DataFrame(data)
print(f"Shape : {df.shape}")
print(f"Nulls :\n{df.isnull().sum()}")
print(f"Stats :\n{df['score'].describe()}")
Output: Shape (5, 3). Nulls — score has 1 null found! Stats — count 4, mean 75, min 45, max 92. That one null in score would have silently broken any calculation. EDA caught it before it could cause damage!
All Four Commands Together
print(df.shape) # rows and columns — how big?
print(df.info()) # column names, types, non-null counts
print(df.isnull().sum()) # exact null count per column
print(df.describe()) # min, max, mean, std — statistical snapshot
These four commands take 30 seconds and save hours of debugging later. Run them on every single dataset before touching anything. No exceptions. EDA is not optional!
Real World Connection
When a Zomato data scientist receives order data from a new city partner, they run EDA first. Are delivery times stored as text or numbers? Are there null restaurant IDs? Are there orders with negative prices? When an IPL analyst gets match data, they check — are there matches with impossible scores? Are player names inconsistent? Is any column all nulls? EDA is literally the first thing every professional data scientist does every single time they open a new dataset!
Common Mistakes
Mistake 1 — Skipping EDA entirely.
# WRONG — building on unknown broken ground!
model.fit(df)
# CORRECT — inspect first, always!
df.shape
df.info()
df.isnull().sum()
df.describe()
# then build!
Mistake 2 — Trusting describe() blindly without questioning the numbers.
# WRONG — describe() looks fine so data must be valid!
df.describe()
# CORRECT — describe() shows numbers, YOU must judge if they make sense!
# After describe() always ask:
# Any negative ages?
# Any scores above 100?
# Does min and max make real-world sense?
Mini Challenge
Mini Challenge
Create a DataFrame of 6 students with name, score and city. Deliberately add two None values in different columns. Run all four EDA commands — shape, info, isnull().sum() and describe(). Find both nulls. Then clean them with dropna(). Run EDA again and confirm the nulls are gone. You just ran the same inspection pipeline that every professional data scientist runs before touching a new dataset!
Quick Quiz
Q: Which EDA command finds hidden null values in every column? A: df.isnull().sum() — counts every missing value column by column!
Q: What does df.describe() show you? A: A statistical snapshot — count, mean, min, max and standard deviation for every numeric column!
Q: Is EDA optional if the dataset looks clean? A: Never! Hidden nulls and wrong types are invisible until you inspect — EDA is always mandatory!
Key Takeaways
Key Takeaways
- EDA is the house inspection — find every crack before it costs you later.
- Four commands every time: shape, info(), isnull().sum(), describe().
- One hidden null silently corrupts your entire analysis — always catch it first.
- describe() shows numbers — YOU must judge whether they make real-world sense.
- EDA is not optional. Inspect before you build. Every single time!
Continue Learning with Rohi
You've used your 3 free Rohi questions. Create a free account to continue learning.