DAY 69

Overfitting 📉📈⚖️

Diagnose the two biggest ML problems — overfitting and underfitting. Learn to detect them using the train vs test R² gap and build models that generalise perfectly.

⏱ 15 mins

⚡ +50 XP

Day 69: Overfitting and Underfitting

Why Should I Care?

Imagine three students preparing for an exam. Arjun barely studied — he scores 52% in practice and 50% in the real exam. Learned nothing. Sneha memorised every single answer from the practice paper — she scores 99% in practice but only 61% in the real exam when the questions changed slightly. Rohith actually understood the concepts — 93% in practice and 91% in the real exam. That is a good model. Your ML model is one of these three students every time you train it. Your job is to make it Rohith — every single time.

Core Concept

Overfitting means your model memorised the training data. It performs brilliantly on data it has seen and fails the moment anything changes. Underfitting means your model is too simple — it learned almost nothing and performs badly everywhere, training and testing both. A good fit means the model understood the real pattern — it generalises to new data and keeps its performance close on both sets. The train versus test gap tells you exactly which zone you are in. Gap above 0.10 means overfitting. Both scores low means underfitting. Small gap, high scores — that is the goal.

How It Works

After training, calculate R-squared on both the training set and the test set separately. Subtract to get the gap. Overfit looks like: Train 99%, Test 61% — large gap, red flag. Good fit looks like: Train 93%, Test 91% — small gap, both high. Underfit looks like: Train 52%, Test 50% — both low, model learned nothing. The goal is always small gap and high scores on both sides. That is the Rohith zone.

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np

X = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y = np.array([20, 30, 35, 45, 55, 60, 68, 75, 80, 90])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

train_score = r2_score(y_train, model.predict(X_train))
test_score  = r2_score(y_test,  model.predict(X_test))
gap         = train_score - test_score

print(f"Train R2 : {train_score:.2f}")
print(f"Test  R2 : {test_score:.2f}")
print(f"Gap      : {gap:.2f} -> {'Overfit' if gap > 0.1 else 'Good Fit'}")

Real World Connection

Think about how Netflix trains its recommendation model. If it overfits, it memorises your exact past watch history and recommends the same shows you already watched — useless. If it underfits, it recommends completely random shows with no logic — equally useless. The good fit finds the real pattern — you like thrillers, evening watch sessions, short episodes — and recommends something new you have never seen but will love. PUBG anti-cheat works the same way. Overfit to known cheats and it misses new ones. Underfit and it flags every good player. The sweet spot — small gap, high scores — is what keeps every real ML system working in production.

Examples

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import numpy as np

X = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y = np.array([20, 30, 35, 45, 55, 60, 68, 75, 80, 90])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

train_score = r2_score(y_train, model.predict(X_train))
test_score  = r2_score(y_test,  model.predict(X_test))
gap         = train_score - test_score

print(f"Train R2 : {train_score:.2f}")
print(f"Test  R2 : {test_score:.2f}")
print(f"Gap      : {gap:.2f}")

if gap > 0.1:
    print("Status   : Overfit — model memorised, not learned")
elif train_score < 0.7 and test_score < 0.7:
    print("Status   : Underfit — model learned nothing useful")
else:
    print("Status   : Good Fit — model generalises well")

# Output:
# Train R2 : 0.98
# Test  R2 : 0.97
# Gap      : 0.01
# Status   : Good Fit — model generalises well

Common Mistakes

Two dangerous assumptions kill models before they ever go live. First — thinking a high training score means a great model. It means nothing on its own. Models that memorise become useless the moment real data arrives. Second — thinking more features always make the model better. Too many features on too little data gives the model more ways to memorise noise instead of learning signal. More complexity is not always better. Always compare train score versus test score. The gap is everything.

# WRONG - High train score assumed to mean great model
train_score = r2_score(y_train, model.predict(X_train))
print(f"Train R2: {train_score:.2f}")
# Train R2: 0.99 -- looks amazing
# But test score could be 0.55 -- complete overfit
# Models that memorise become useless when real data arrives

# CORRECT - Always compare train vs test score
train_score = r2_score(y_train, model.predict(X_train))
test_score  = r2_score(y_test,  model.predict(X_test))
gap         = train_score - test_score
print(f"Train: {train_score:.2f} | Test: {test_score:.2f} | Gap: {gap:.2f}")
# Gap > 0.10 = overfitting. Fix it before shipping anything.

# ------------------------------------------

# WRONG - Adding more features assuming it always helps
# X with 50 features on 30 data points
# Model memorises every noise pattern -- gap explodes

# CORRECT - Match model complexity to your data size
# More data OR simpler model OR fewer features
# Extra features give the model more ways to memorise noise
# More complexity is not always better

Mini Challenge

You are the ML engineer at a food delivery startup. Build a model that predicts delivery time from distance. After training, calculate both train R-squared and test R-squared separately. Compute the gap and print a diagnosis — Overfit, Underfit, or Good Fit — using the 0.10 gap rule. Try to intentionally create an overfit scenario by using a very small dataset of only 5 points with a complex pattern. Then fix it by adding more data points and observe how the gap shrinks. Print both reports side by side.

Quick Quiz

Q: Your model scores Train R2 of 0.99 and Test R2 of 0.61. What is the problem and what does it mean?
A: That is overfitting — the gap is 0.38 which is far above the 0.10 threshold. The model memorised training answers instead of learning the real pattern. It will fail on real data.

Q: Your model scores Train R2 of 0.52 and Test R2 of 0.50. What is the problem?
A: That is underfitting — both scores are low. The model is too simple and learned almost nothing useful from the training data either.

Q: What is the ideal result you are aiming for when you compare train and test scores?
A: Small gap — ideally below 0.10 — and both scores high. That means the model understood the real pattern and generalises well to new data.

Bonus Knowledge

When you detect overfitting, three fixes work most reliably. First — get more training data. More examples make it harder for the model to memorise. Second — simplify the model. A LinearRegression overfits less than a complex DecisionTree on small data. Third — use regularisation. Techniques like Ridge and Lasso add a penalty that punishes the model for getting too complex — built right into sklearn. When you detect underfitting, the fixes go the other direction — try a more complex model, add more features, or train longer. The train versus test gap is your compass. It always tells you which direction to move.

Key Takeaways

Overfitting means the model memorised training data — great on training, fails on new data.
Underfitting means the model is too simple — performs badly on both training and test data.
A good fit has a small gap and high scores on both sides — that is the Rohith zone.
Calculate gap as Train R2 minus Test R2. Gap above 0.10 is an overfitting red flag.
Both scores low means underfitting — the model learned nothing worth keeping.
High training accuracy alone means nothing — always compare it against test score.
More features do not always help — too many features on small data creates more ways to memorise noise.
Fixes for overfit: more data, simpler model, regularisation. Fixes for underfit: more complexity, more features.

Overfitting 📉📈⚖️

Day 69: Overfitting and Underfitting

Why Should I Care?

Core Concept

How It Works

Real World Connection

Examples

Common Mistakes

Mini Challenge

Mini Challenge

Quick Quiz

Bonus Knowledge

Key Takeaways

Key Takeaways

Continue Learning with Rohi