Prediction App 🚀🤖📊
Phase 4 complete — combine every ML skill into one real working prediction system. Data, split, train, evaluate, predict. A full pipeline any data team would recognise.
Day 70: Mini Project — Prediction System
Why Should I Care?
70 days ago you printed Hello World. Today you are building a machine that learns from data, evaluates its own performance, and predicts the future. That is not a student project anymore — that is a working AI system. Every skill from Phase 4 — NumPy arrays, train-test split, LinearRegression, MAE, R-squared, overfitting detection — snaps together here into one complete pipeline. This is the exact structure that data teams at real companies build, review, and ship to production. You built it. From scratch. In 70 days.
Core Concept
A real ML system has five stages — every single time, for every single model, at every company in the world. Stage 1 is Data — collect your features and labels as clean arrays. Stage 2 is Split — divide honestly into 80% train and 20% test. Stage 3 is Train — let the model learn the pattern from training data only. Stage 4 is Evaluate — measure R-squared and MAE on the test set, check the gap for overfitting. Stage 5 is Predict — give it brand new input it has never seen and get a real answer. That is the whole system. Five stages. One pipeline. Infinite use cases.
How It Works
The complete project code is structured in four clean sections marked with comments — DATA, SPLIT, TRAIN, and EVALUATE. This is professional structure. Real codebases look exactly like this. Each section does one job. Nothing bleeds into another. You can read it top to bottom and know exactly what is happening at every line. The output is a full evaluation report — Train R2, Test R2, MAE, Fit Status, and a live prediction on new data. Not a snippet. A real working system.
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# --- DATA ---
hours = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
scores = np.array([45, 55, 62, 71, 80, 85, 91, 97, 103, 110])
# --- SPLIT ---
X_train, X_test, y_train, y_test = train_test_split(
hours, scores, test_size=0.2, random_state=42
)
# --- TRAIN ---
model = LinearRegression()
model.fit(X_train, y_train)
# --- EVALUATE ---
train_score = r2_score(y_train, model.predict(X_train))
test_score = r2_score(y_test, model.predict(X_test))
mae = mean_absolute_error(y_test, model.predict(X_test))
gap = train_score - test_score
prediction = model.predict([[11]])[0]
print("=" * 40)
print(" RohithBuilds -- Score Predictor")
print("=" * 40)
print(f"Train R2 : {train_score:.2f}")
print(f"Test R2 : {test_score:.2f}")
print(f"MAE : {mae:.2f}")
print(f"Fit Status : {'Good Fit' if gap <= 0.1 else 'Overfit'}")
print(f"Predict(11h) : {prediction:.1f} marks")
print("=" * 40)Real World Connection
This exact five-stage pipeline is what a data scientist at Swiggy runs to predict delivery time. What a finance engineer at CRED runs to predict credit risk. What an ML engineer at YouTube runs to predict which video you will watch next. The features change — distance, income, watch history — but the structure never does. Data. Split. Train. Evaluate. Predict. Every ML system ever shipped to production started with these five stages in exactly this order. You now speak the same language as every data team in the world.
Examples
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error
# --- DATA ---
hours = np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10]])
scores = np.array([45, 55, 62, 71, 80, 85, 91, 97, 103, 110])
# --- SPLIT ---
X_train, X_test, y_train, y_test = train_test_split(
hours, scores, test_size=0.2, random_state=42
)
# --- TRAIN ---
model = LinearRegression()
model.fit(X_train, y_train)
# --- EVALUATE ---
train_score = r2_score(y_train, model.predict(X_train))
test_score = r2_score(y_test, model.predict(X_test))
mae = mean_absolute_error(y_test, model.predict(X_test))
gap = train_score - test_score
prediction = model.predict([[11]])[0]
print("=" * 40)
print(" RohithBuilds -- Score Predictor")
print("=" * 40)
print(f"Train R2 : {train_score:.2f}")
print(f"Test R2 : {test_score:.2f}")
print(f"MAE : {mae:.2f}")
fit_status = "Good Fit" if gap <= 0.1 else "Overfit"
print(f"Fit Status : {fit_status}")
print(f"Predict(11h) : {prediction:.1f} marks")
print("=" * 40)
# Output:
# ========================================
# RohithBuilds -- Score Predictor
# ========================================
# Train R2 : 0.98
# Test R2 : 0.97
# MAE : 4.10
# Fit Status : Good Fit
# Predict(11h) : 117.3 marks
# ========================================Common Mistakes
Two mistakes that break the entire pipeline even when every individual piece looks correct. First — mixing up the order and evaluating on training data instead of test data, making the results look perfect when they are not. Second — skipping the fit status check entirely and shipping a model that is quietly overfitting in production. Always check the gap. Always print the fit status. A system that cannot diagnose itself is not a real system.
# WRONG - Evaluating on training data by mistake
mae = mean_absolute_error(y_train, model.predict(X_train))
print(f"MAE: {mae:.2f}")
# MAE looks tiny — model already saw this data
# This score is fake. Real MAE on test set could be 3x higher.
# CORRECT - Always evaluate on the locked test set
mae = mean_absolute_error(y_test, model.predict(X_test))
print(f"MAE: {mae:.2f}")
# This is the real number. This is what goes in your report.
# ------------------------------------------
# WRONG - Shipping without checking fit status
model.fit(X_train, y_train)
prediction = model.predict([[11]])[0]
print(f"Prediction: {prediction:.1f}")
# No evaluation. No gap check. Overfit model ships silently.
# CORRECT - Always print the full evaluation report first
train_score = r2_score(y_train, model.predict(X_train))
test_score = r2_score(y_test, model.predict(X_test))
gap = train_score - test_score
print(f"Gap: {gap:.2f} -> {'Good Fit' if gap <= 0.1 else 'Overfit — fix before shipping'}")
# Never skip the diagnosis. A system that cannot check itself is broken.Mini Challenge
Mini Challenge
Build your own named prediction system — call it something real like PhonePe Fraud Scorer, Zomato Delivery Timer, or CRED Credit Ranker. Create a dataset with at least 10 data points relevant to your chosen theme. Run all five stages — DATA, SPLIT, TRAIN, EVALUATE, PREDICT. Print a full formatted report with your system name at the top, Train R2, Test R2, MAE, Fit Status, and a live prediction on one new input. The output should look like a real ML report — clean, structured, professional. Make it something you would be proud to show.
Quick Quiz
Q: What are the five stages of a complete ML prediction system in the correct order?
A: Data, Split, Train, Evaluate, Predict — always in this order, every model, every time, no exceptions.
Q: What does a Fit Status of Good Fit mean in the evaluation report and how is it calculated?
A: It means the gap between Train R2 and Test R2 is 0.10 or below — the model learned the real pattern and generalises well to new data without memorising.
Q: What is the difference between the prediction at Stage 3 and the prediction at Stage 5?
A: At Stage 3 the model is just learning from training data — there is no prediction yet. Stage 5 prediction is on brand new input the model has never seen — that is the real test of whether the system works.
Bonus Knowledge
You just completed Phase 4 — Machine Learning, Days 61 to 70. Look at how far the journey has come. Phase 1 was Python and OOP, Days 1 to 30. Phase 2 was Web and Databases, Days 31 to 49. Phase 3 was Data Science, Days 51 to 60. Phase 4 was Machine Learning, Days 61 to 70. Every phase built on the last one. Every concept unlocked the next. Phase 5 starts on Day 71 with Deep Learning — neural networks, layers, activation functions, and models that can see images and understand language. The pipeline you built today is the same foundation deep learning models run on. The stages never change. Only the algorithms get more powerful. Day 71 is going to change everything again.
Key Takeaways
Key Takeaways
- A complete ML prediction system has five stages — Data, Split, Train, Evaluate, Predict — always in this order.
- Structure your code in four clean sections — DATA, SPLIT, TRAIN, EVALUATE — just like professional ML engineers do.
- Always evaluate on the test set — Train R2, Test R2, MAE, and gap — never skip the diagnosis.
- Gap at or below 0.10 means Good Fit. Gap above 0.10 means overfitting — fix before shipping anything.
- Stage 5 prediction on new unseen input is the proof that your system actually works in the real world.
- Always evaluate before you predict — a model that cannot check itself is not a real system.
- This exact five-stage pipeline is what every data team at Swiggy, CRED, YouTube, and Amazon builds and ships.
- Phase 4 complete — 70 days down, Deep Learning awaits on Day 71.
Continue Learning with Rohi
You've used your 3 free Rohi questions. Create a free account to continue learning.