DAY 59

Exploratory Data Analysis 🔍🏠📋

Know your data before you trust your data. EDA is the house inspection — find every crack before it costs you later!

⏱ 15 mins
⚡ +50 XP
Exploratory Data Analysis 🔍🏠📋

Day 59: Exploratory Data Analysis — Inspect Before You Build!

Why Should I Care?

Would you buy a house without inspecting it first? No! Hidden cracks, broken pipes, faulty wiring — you would find them during inspection, not after you moved in. Data works exactly the same way. Hidden nulls, wrong data types, absurd values — they silently destroy your analysis if you skip EDA. Most data science failures happen before a single model is trained. They happen when someone skipped the inspection and built on broken ground!

The Four-Room EDA Inspection

Room 1 — Entrance: df.shape. How big is this dataset? How many rows and columns am I dealing with? Never skip the walkthrough. Room 2 — Living Room: df.info(). Are the data types correct? Is score stored as float when it should be integer? Room 3 — Bathroom: df.isnull().sum(). Find every hidden null. One missing value in the wrong column corrupts your entire analysis. Room 4 — Bedroom: df.describe(). What is the range? Are there absurd values — negative ages or scores above 100?

The Complete EDA Pipeline


import pandas as pd

data = {
    "name"  : ["Rohith", "Sneha", "Arjun", "Priya", "Kiran"],
    "score" : [87, 92, 45, 76, None],
    "city"  : ["Hyderabad", "Mumbai", "Delhi", "Chennai", "Pune"]
}

df = pd.DataFrame(data)

print(f"Shape  : {df.shape}")
print(f"Nulls  :\n{df.isnull().sum()}")
print(f"Stats  :\n{df['score'].describe()}")

Output: Shape (5, 3). Nulls — score has 1 null found! Stats — count 4, mean 75, min 45, max 92. That one null in score would have silently broken any calculation. EDA caught it before it could cause damage!

All Four Commands Together


print(df.shape)           # rows and columns — how big?
print(df.info())          # column names, types, non-null counts
print(df.isnull().sum())  # exact null count per column
print(df.describe())      # min, max, mean, std — statistical snapshot

These four commands take 30 seconds and save hours of debugging later. Run them on every single dataset before touching anything. No exceptions. EDA is not optional!

Real World Connection

When a Zomato data scientist receives order data from a new city partner, they run EDA first. Are delivery times stored as text or numbers? Are there null restaurant IDs? Are there orders with negative prices? When an IPL analyst gets match data, they check — are there matches with impossible scores? Are player names inconsistent? Is any column all nulls? EDA is literally the first thing every professional data scientist does every single time they open a new dataset!

Common Mistakes

Mistake 1 — Skipping EDA entirely.


# WRONG — building on unknown broken ground!
model.fit(df)

# CORRECT — inspect first, always!
df.shape
df.info()
df.isnull().sum()
df.describe()
# then build!

Mistake 2 — Trusting describe() blindly without questioning the numbers.


# WRONG — describe() looks fine so data must be valid!
df.describe()

# CORRECT — describe() shows numbers, YOU must judge if they make sense!
# After describe() always ask:
# Any negative ages?
# Any scores above 100?
# Does min and max make real-world sense?

Mini Challenge

Mini Challenge

Create a DataFrame of 6 students with name, score and city. Deliberately add two None values in different columns. Run all four EDA commands — shape, info, isnull().sum() and describe(). Find both nulls. Then clean them with dropna(). Run EDA again and confirm the nulls are gone. You just ran the same inspection pipeline that every professional data scientist runs before touching a new dataset!

Quick Quiz

Q: Which EDA command finds hidden null values in every column? A: df.isnull().sum() — counts every missing value column by column!

Q: What does df.describe() show you? A: A statistical snapshot — count, mean, min, max and standard deviation for every numeric column!

Q: Is EDA optional if the dataset looks clean? A: Never! Hidden nulls and wrong types are invisible until you inspect — EDA is always mandatory!

Key Takeaways

Key Takeaways

  • EDA is the house inspection — find every crack before it costs you later.
  • Four commands every time: shape, info(), isnull().sum(), describe().
  • One hidden null silently corrupts your entire analysis — always catch it first.
  • describe() shows numbers — YOU must judge whether they make real-world sense.
  • EDA is not optional. Inspect before you build. Every single time!

← Previous Lesson