Evaluation 📊🎯🏆
Learn to evaluate your ML model using MAE and R² on unseen test data. Training score means nothing — only the test score is the truth.
Day 68: Model Evaluation
Why Should I Care?
Imagine you practice driving for weeks in an empty parking lot. You get 99% accuracy there. Great. But when the examiner puts you on real city roads with unknown turns and real traffic — that score is the only one that counts. Your ML model is the same. Training accuracy is always high because the model already saw that data. The only score that matters is on data it has never seen. That is model evaluation. That is the truth.
Core Concept
Two metrics measure your model on real conditions. MAE — Mean Absolute Error — tells you the average gap between what your model predicted and the actual correct value. A MAE of 4.10 means your predictions are off by about 4 marks on average. Lower is always better. R-squared tells you how much of the pattern in the data your model actually understood. It goes from 0.0 to 1.0. A score of 0.97 means the model explained 97 percent of the pattern — that is a strong model. Always compare both your training score and your testing score. Training score alone proves nothing.
How It Works
After training your model on the training set, you run predictions on the test set — the locked data it has never seen. You then pass the actual test labels and the predictions into mean_absolute_error and r2_score. Both come from sklearn.metrics. MAE gives you the error in the same units as your target — marks, rupees, minutes. R-squared gives you a proportion from 0 to 1. Always evaluate on X_test and y_test only — never on training data.
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
X = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y = np.array([20, 30, 35, 45, 55, 60, 68, 75, 80, 90])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
print(f"MAE : {mae:.2f}")
print(f"R2 : {r2:.2f}")
print(f"Pred: {predictions.round(1)}")
print(f"Real: {y_test}")Real World Connection
Think about how Zomato rates their delivery time predictor. They do not check how accurate it is on deliveries they already know — that is the parking lot. They check it on brand new orders happening right now — real city roads. That test score is what decides if the model goes live. CRED evaluates their credit score model the same way — train on old customer data, test on new customers the model has never seen. If the test MAE is acceptable, the model ships. If not, back to training. Every ML model in production passed a real test set first. Every single one.
Examples
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, r2_score
import numpy as np
X = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
y = np.array([20, 30, 35, 45, 55, 60, 68, 75, 80, 90])
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("== Model Evaluation Report ==")
print(f"MAE : {mean_absolute_error(y_test, predictions):.2f}")
print(f"R2 : {r2_score(y_test, predictions):.2f}")
print(f"Pred : {predictions.round(1)}")
print(f"Real : {y_test}")
print(f"Predict 11 hours -> {model.predict([[11]])[0]:.1f}")
# Output:
# == Model Evaluation Report ==
# MAE : 4.10
# R2 : 0.97
# Pred : [49.2 74.8]
# Real : [45 71]
# Predict 11 hours -> 117.3Common Mistakes
Two mistakes make your evaluation completely meaningless. First — evaluating on training data instead of test data. Training accuracy is always artificially high because the model already memorised those answers. It proves nothing about real performance. Second — seeing R-squared of 1.0 on training data and thinking your model is perfect. That is overfitting — the model memorised, not learned. Always check R-squared on the test set. Perfect fit on training equals memorised, not understood.
# WRONG - Evaluating on training data
predictions_train = model.predict(X_train)
mae = mean_absolute_error(y_train, predictions_train)
# Score is always artificially high — model already saw this data
# This proves nothing about real-world performance
# CORRECT - Always evaluate on the test set
predictions_test = model.predict(X_test)
mae = mean_absolute_error(y_test, predictions_test)
# This is the only score that tells you how the model performs on new data
# ------------------------------------------
# WRONG - Trusting R2 = 1.0 on training data
train_r2 = r2_score(y_train, model.predict(X_train))
# R2 = 1.0 on training = overfitting warning
# The model memorised every answer — it learned nothing
# CORRECT - Check R2 on the test set
test_r2 = r2_score(y_test, model.predict(X_test))
# R2 = 0.97 on test data = strong model
# It understood the pattern, not just the answersMini Challenge
Mini Challenge
You are evaluating a salary predictor for a job platform like LinkedIn. Create a dataset with 10 employees — years of experience as the feature and monthly salary in thousands as the target. Build the full pipeline: split, train, predict, then print a complete evaluation report showing MAE and R-squared on the test set. Also compare your training R-squared with your testing R-squared and note whether the gap looks suspicious. Predict the salary for someone with 12 years of experience and print the result cleanly.
Quick Quiz
Q: Why is training accuracy always artificially high and meaningless for evaluation?
A: Because the model already saw and memorised the training data. It is like scoring 99% on questions you already know the answers to — it proves nothing about real performance on new data.
Q: What does an R-squared score of 0.97 on the test set mean?
A: The model explained 97 percent of the pattern in the unseen test data — that is a strong model. R-squared ranges from 0.0 (random guessing) to 1.0 (perfect fit).
Q: What does it mean if your training R-squared is 1.0 but your test R-squared is 0.60?
A: That is overfitting — the model memorised the training data perfectly but failed to generalise to new data. Perfect fit on training equals memorised, not learned.
Bonus Knowledge
MAE and R-squared are for regression models that predict numbers. For classification models that predict categories, you use different metrics — accuracy score, precision, recall, and F1 score. But the principle is identical: always measure on the test set only. Also — a small gap between training score and testing score is healthy and normal. A large gap is a red flag called overfitting. When you see Training R2 of 0.99 and Testing R2 of 0.55, your model is too complex. It learned noise, not signal. The fix is to simplify the model, get more training data, or use regularisation — all topics coming up in the next few days.
Key Takeaways
Key Takeaways
- Model evaluation = test score on unseen data — that is the only truth that matters.
- MAE measures the average gap between predictions and actual values — lower is better, same units as your target.
- R-squared measures how much of the pattern the model understood — closer to 1.0 is stronger.
- Always evaluate on X_test and y_test — never on training data, ever.
- Training score alone proves nothing — it is always artificially high because the model saw that data.
- R-squared of 1.0 on training data is an overfitting warning — the model memorised, not learned.
- Compare training score vs testing score — a large gap is a red flag for overfitting.
- Both MAE and R-squared come from sklearn.metrics and plug into the same six-step pipeline from Day 67.
Continue Learning with Rohi
You've used your 3 free Rohi questions. Create a free account to continue learning.