NLP šš¢š¤
Learn how machines read human language by converting raw text into numbers using tokenization and vectorization, and build your first text pipeline using CountVectorizer to transform sentences into data a model can actually understand.
Day 82: Natural Language Processing ā Turn Words Into Numbers
Why Should I Care?
Every time Google autocompletes your search, WhatsApp detects spam, or YouTube generates subtitles ā NLP is working behind the scenes. Natural Language Processing is the reason machines can read, understand, and respond to human language. Without NLP there would be no ChatGPT, no Google Translate, no Siri. This is where language meets AI ā and it all starts with one simple idea: text must become numbers before any model can touch it.
Core Concept
NLP is like being an exchange student learning Hindi from scratch. On Day 1 everything is noise ā random sounds with no meaning. By Day 30 patterns start emerging ā you notice certain words appear near happy emotions, others near sad ones. By Day 90 you are fluent ā text is no longer noise, it carries meaning. A machine goes through the exact same three stages when trained on millions of sentences. Patterns emerge automatically from the data.
How It Works
The text pipeline has four steps. Raw text comes in first. Then tokenization breaks it into individual words. Then vectorization converts those words into numbers using a word count matrix. Finally those numbers go into the model. The most important step is vectorization ā models only understand numbers, never strings. CountVectorizer is the tool that does this job. It builds a vocabulary from all unique words and then represents each sentence as a row of counts showing how many times each word appears.
Text Pipeline:
Raw Text --> Tokenize --> Vectorize --> Numbers --> Model
Example sentences:
"Rohith loves Python"
"Python is powerful"
"Rohith builds AI"
After CountVectorizer ā Word Count Matrix:
ai bl is lo pw py ro to
Text 1: 0 0 0 1 0 1 1 0 ("loves Python Rohith")
Text 2: 0 0 1 0 1 1 0 0 ("Python is powerful")
Text 3: 1 1 0 0 0 0 1 1 ("Rohith builds AI")
Each sentence = one row of numbers
Each column = one unique word from the vocabulary
Model can now read and learn from this
Real World Connection
When you type a review on Zomato ā "food was amazing, delivery was fast" ā NLP converts that sentence into numbers, checks which words appear near positive ratings, and classifies it as positive sentiment. When you get a suspicious message on WhatsApp saying "you won a free iPhone, click here" ā NLP detects words that appear near spam patterns and blocks it before you even see it. Every app that reads, filters, or responds to human text is running an NLP pipeline underneath.
Examples
from sklearn.feature_extraction.text import CountVectorizer
# Three simple sentences
texts = [
"Rohith loves Python",
"Python is powerful",
"Rohith builds AI tools"
]
# CountVectorizer builds vocabulary from all words
# and converts sentences into number arrays in one call
vec = CountVectorizer()
matrix = vec.fit_transform(texts).toarray()
print("ā Vocabulary ā")
print(vec.get_feature_names_out())
print("\nā Word Count Matrix ā")
for i, row in enumerate(matrix):
print(f"Text {i+1}: {row}")
# OUTPUT:
# ā Vocabulary ā
# [''ai'' ''builds'' ''is'' ''loves'' ''powerful'' ''python'' ''rohith'' ''tools'']
#
# ā Word Count Matrix ā
# Text 1: [0 0 0 1 0 1 1 0] --> loves Python Rohith
# Text 2: [0 0 1 0 1 1 0 0] --> Python is powerful
# Text 3: [1 1 0 0 0 0 1 1] --> Rohith builds AI tools
#
# Each row is one sentence as a number array.
# Now a model can train on this data.
Common Mistakes
Two mistakes beginners make when starting with NLP. First, trying to feed raw text strings directly into a model. Second, thinking NLP is only useful for building chatbots and missing its real power.
-- MISTAKE 1: Feeding raw text into a model --
WRONG:
model.fit(["Rohith loves Python", "Python is powerful"])
Result: TypeError ā models don''t read strings
CORRECT:
vec = CountVectorizer()
X = vec.fit_transform(texts).toarray()
model.fit(X, labels)
NOTE: "ML models only understand numbers ā
text must be vectorized first."
RULE: Text must become numbers before any model can touch it.
-- MISTAKE 2: Thinking NLP is just chatbots --
WRONG:
"NLP = just chatbots"
CORRECT:
NLP powers: translation, summarization, sentiment analysis,
search, spam detection, autocomplete, and much more.
NOTE: "Limiting NLP to chatbots means
missing 90% of its real applications."
Mini Challenge
Mini Challenge
Add three more sentences to the texts list in the Examples code ā make them about things you use daily like Instagram, PUBG, or Swiggy. Run CountVectorizer again and print the new vocabulary and matrix. Notice how the vocabulary grows with every new sentence you add. Then look at the matrix and find which column represents the word "python" ā what is the count for each sentence? You just manually read a word count matrix like an NLP engineer.
Quick Quiz
Q1: Why can''t you feed raw text strings directly into a machine learning model? A1: Because ML models only understand numbers ā text must be tokenized and vectorized into a number array first. Q2: What does CountVectorizer build from the input sentences? A2: A vocabulary of all unique words, and a word count matrix where each row is a sentence and each column is a word count. Q3: Name three real-world NLP applications beyond chatbots. A3: Translation, spam detection, sentiment analysis, autocomplete, summarization, and search ā any three of these are correct.
Bonus Knowledge
CountVectorizer is powerful but has one big weakness ā it treats every word as equally important. The word "is" appears in almost every sentence but tells you almost nothing useful. There is a smarter technique called TF-IDF ā Term Frequency Inverse Document Frequency ā that gives lower scores to words that appear everywhere and higher scores to words that are rare and meaningful. This is what Google uses when ranking search results. Words that appear on every website get ignored. Words unique to a specific page get boosted. You will build on this understanding when you reach embeddings in the next lesson.
Key Takeaways
Key Takeaways
- NLP is the field of AI that teaches machines to read, understand, and generate human language.
- The text pipeline always follows: raw text, tokenize, vectorize, numbers, then model.
- CountVectorizer builds a vocabulary and converts each sentence into a row of word counts.
- Text must always be converted to numbers before any ML model can process it ā no exceptions.
- NLP powers translation, search, spam detection, sentiment analysis, autocomplete, summarization, and much more ā not just chatbots.
- The word count matrix is the foundation ā every advanced NLP technique like embeddings and LLMs builds on top of this idea.
Continue Learning with Rohi
You've used your 3 free Rohi questions. Create a free account to continue learning.