Features & Labels 📋🔢🎯
What goes in. What comes out. The whole game. Features are the inputs the model sees. Labels are the answers it must predict!
Day 62: Features and Labels — The Whole ML Game!
Why Should I Care?
Every prediction a machine makes starts with features someone chose to measure. A doctor fills a patient form — Age, Weight, Blood Pressure, Smoker, Sugar level. Those measurements are the Features. The Diagnosis at the bottom is the Label. The model studies thousands of filled forms (Features) and learns to predict the diagnosis (Label) for new patients. Choose the right features and the model finds truth almost by itself!
Features vs Labels — Crystal Clear
Features (X) are everything the model sees — the inputs, the measurements, the clues. Labels (y) are what the model must predict — the answer, the outcome, the diagnosis. X is a table. y is a column. Keep them completely separate. Never mix them up. Never leak y into X — if the model sees the answer during training it learns nothing and becomes useless!
Splitting Features and Labels in Python
import pandas as pd
df = pd.DataFrame({
"hours_studied" : [1, 2, 3, 4, 5],
"attendance" : [60, 70, 80, 90, 95],
"score" : [45, 55, 65, 75, 85]
})
X = df[["hours_studied", "attendance"]]
y = df["score"]
print("Features (X):")
print(X)
print(f"\nLabels (y):\n{y.values}")
Output: X is a table with hours_studied and attendance. y is a column of scores. Double brackets [[ ]] on X gives a DataFrame — sklearn needs a DataFrame for features. Single bracket [ ] on y gives a Series — that is correct for labels. X is what the model studies. y is what it must predict!
Real World Feature and Label Examples
Gmail spam detection — Features: sender address, subject words, number of links. Label: spam or not spam. Zomato delivery time — Features: distance, restaurant prep time, traffic level, time of day. Label: delivery minutes. IPL match winner — Features: team rankings, player form, home advantage, weather. Label: win or lose. Loan approval — Features: income, age, credit score, employment years. Label: approved or rejected. The features are always the measurements. The label is always the prediction!
Feeding X and y to a Model
from sklearn.linear_model import LinearRegression
import pandas as pd
df = pd.DataFrame({
"hours_studied" : [1, 2, 3, 4, 5],
"attendance" : [60, 70, 80, 90, 95],
"score" : [45, 55, 65, 75, 85]
})
X = df[["hours_studied", "attendance"]]
y = df["score"]
model = LinearRegression()
model.fit(X, y)
new_student = [[4, 88]]
print(f"Predicted score: {model.predict(new_student)[0]:.1f}")
model.fit(X, y) — the model studies hours and attendance to learn how to predict score. model.predict([[4, 88]]) — give a new student with 4 hours studied and 88 percent attendance. The model predicts their score. Features in, label predicted. The whole game in 10 lines!
Common Mistakes
Mistake 1 — Using single brackets for features.
X = df["hours_studied"] # WRONG — gives Series, sklearn needs DataFrame!
X = df[["hours_studied"]] # CORRECT — double brackets give DataFrame!
Mistake 2 — Including the label column inside features.
X = df[["hours_studied", "attendance", "score"]] # WRONG — score is the answer!
y = df["score"] # Model sees answer, learns nothing!
X = df[["hours_studied", "attendance"]] # CORRECT — features only!
y = df["score"] # label separate!
Mini Challenge
Mini Challenge
Create a DataFrame with 6 houses — size in square feet, number of bedrooms, and price. Split it into X (size and bedrooms) and y (price). Train a LinearRegression model on it. Predict the price of a new house with 1200 sq ft and 3 bedrooms. Print the prediction. You just built the same house price prediction model that every real estate platform like 99acres and MagicBricks uses to estimate property values!
Quick Quiz
Q: What is the difference between X and y in machine learning? A: X is the features — everything the model sees as input. y is the label — the answer the model must predict!
Q: Why use double brackets df[[''col'']] for features instead of single df[''col'']? A: Double brackets return a DataFrame which sklearn requires. Single bracket returns a Series!
Q: What happens if you accidentally include the label column inside X? A: The model sees the answer during training and learns nothing — completely useless predictions!
Key Takeaways
Key Takeaways
- Features (X) are the inputs the model studies. Labels (y) are the answers it predicts.
- X is a table — use double brackets. y is a column — use single bracket.
- Never include the label column inside X — model sees the answer and learns nothing.
- model.fit(X, y) trains the model. model.predict(new_X) predicts on new data.
- Choose the right features and the model finds truth almost by itself!
Continue Learning with Rohi
You've used your 3 free Rohi questions. Create a free account to continue learning.