DAY 85

Attention 🔦🎭🧠

Learn how the attention mechanism gives every word a spotlight score so the model knows exactly which other words to focus on for context — and see why "bank" means something completely different depending on whether "river" is in the sentence.

⏱ 15 mins
⚡ +50 XP
Attention 🔦🎭🧠

Day 85: Attention Mechanism — Not All Words Matter Equally

Why Should I Care?

When you read "The bank by the river was steep" you instantly know bank means a riverbank — not a financial bank. You knew because your brain paid attention to the word river. Machines could not do this before 2017. The attention mechanism is what finally gave machines that same ability — to look at every word in context and decide which ones matter most for understanding right now. This is the single most important idea inside every LLM including ChatGPT, Gemini, and Claude.

Core Concept

Attention works like a spotlight operator at a theatre. All the words in a sentence are standing on stage. When the model is processing the word "bank" it shines a bright spotlight on "river" because river is the strongest context clue. Other words like "the" and "by" get dim spotlights because they give very little useful information. The brightness of each spotlight is called the attention weight. All weights add up to exactly 1.0 — the model distributes its full focus across every word on the stage.

How It Works

For each word being processed, the model calculates a raw score for every other word in the sentence. These raw scores go through a softmax function which converts them into probabilities that sum to 1.0. The result is an attention weight for every word — telling the model exactly how much to focus on each one. Words with high attention weights strongly influence the meaning of the current word. Words with low weights barely matter. This process runs for every word simultaneously — that is the parallel power of Transformers.


Sentence: "The bank by the river was steep"
Processing the word: "bank"

Attention weights calculated by softmax:

The    : 0.0456  (low  -- not relevant)
bank   : 0.0834  (self-attention)
by     : 0.0563  (low  -- preposition)
the    : 0.0456  (low  -- article)
river  : 0.4103  (HIGH -- context clue, almost half total attention)
was    : 0.0412  (low  -- verb)
steep  : 0.0620  (low  -- describes bank, some relevance)

All weights sum to 1.0 -- softmax ensures this always.
river = 0.41 -- model knows this is a natural bank, not financial.

Ambiguity example: "I saw the man with the telescope"
Pattern A: "I" attends to "telescope"   --> I had the telescope
Pattern B: "man" attends to "telescope" --> He had the telescope
Same words. Different attention. Different meaning.

Real World Connection

When you type "book a table at a good restaurant" into Google Assistant, attention helps the model focus on "restaurant" when understanding "table" — not furniture. When you search "apple store near me" on Google Maps, attention weights on "store" and "near me" are high — the word "apple" attends strongly to "store" to understand it means a tech shop, not a fruit. Every time an AI correctly understands an ambiguous word in your message, the attention mechanism is doing that disambiguation in real time.

Examples


import numpy as np

# Softmax converts raw scores to probabilities summing to 1.0
def softmax(x):
    e = np.exp(x - np.max(x))   # subtract max for numerical stability
    return e / e.sum()

words  = ["The", "bank", "by", "the", "river", "was", "steep"]

# Raw attention scores before normalisation
# river gets highest raw score -- strongest context for "bank"
scores = np.array([0.3, 0.9, 0.5, 0.3, 1.8, 0.2, 0.4])

attention_weights = softmax(scores)

print("-- Attention Weights for ''bank'' --")
for word, weight in zip(words, attention_weights):
    bar = "=" * int(weight * 40)   # visual bar = weight magnitude
    print(f"{word:<8}: {weight:.4f} {bar}")

# OUTPUT:
# -- Attention Weights for ''bank'' --
# The     : 0.0456 =
# bank    : 0.0834 ===
# by      : 0.0563 ==
# the     : 0.0456 =
# river   : 0.4103 ================
# was     : 0.0412 =
# steep   : 0.0620 ==
#
# river = 0.41 -- model knows this is a natural bank, not financial
# All weights sum to 1.0 -- softmax ensures this

Common Mistakes

Two misunderstandings about attention that beginners get wrong. First, thinking attention replaces embeddings. Second, assuming there is only one attention head doing all the work.


-- MISTAKE 1: Thinking attention replaces embeddings --

WRONG:
  "Attention replaces word embeddings —
   we don''t need both"

CORRECT:
  Attention works ON TOP of embeddings.
  Embeddings give words their meaning coordinates.
  Attention decides which meanings to focus on right now.

NOTE: "Embeddings = what words mean.
       Attention = which words matter right now."
RULE: Both are required. Neither replaces the other.


-- MISTAKE 2: Assuming one attention head captures all context --

WRONG:
  One attention head captures everything needed.

CORRECT:
  Transformers use MULTI-HEAD attention —
  multiple spotlights running simultaneously.
  Head 1 captures syntax relationships.
  Head 2 captures semantic relationships.
  Head 3 captures coreference (who refers to what).

NOTE: "Syntax, semantics, coreference —
       each needs its own attention head."
RULE: Embeddings give meaning. Attention decides focus. Both required.

Mini Challenge

Mini Challenge

Copy the attention weights code from the Examples section and change the sentence to "Virat hit the ball out of the ground" — seven words. Manually assign raw scores where "ball" and "hit" get high scores when processing "Virat", and "ground" gets a medium score. Run softmax on your scores and print the attention bar chart. Then change one score so "ground" gets the highest weight — notice how the context shifts. You just manually tuned an attention mechanism like an AI engineer.

Quick Quiz

Q1: What does a high attention weight for a word mean? A1: It means that word is strongly influencing the meaning of the word currently being processed — the model is shining a bright spotlight on it. Q2: Why must attention weights always sum to 1.0? A2: Because softmax converts raw scores into probabilities — the model distributes its full focus of 1.0 across every word, none is ignored completely. Q3: What is multi-head attention and why is it needed? A3: Multiple attention heads running simultaneously — each capturing different types of relationships like syntax, semantics, and coreference — because one head cannot capture all context at once.

Bonus Knowledge

The original attention paper published in 2017 was literally titled "Attention Is All You Need" — and the title proved prophetic. Before attention, models like LSTMs used a single fixed-size context vector to carry meaning through a sequence — it was like trying to summarise an entire movie in one sentence before watching the next scene. Attention threw that bottleneck away completely. Now every word can look directly at every other word with no compression required. GPT-4 uses 96 attention heads running in parallel. Each one learns to specialise in a different language relationship. That is why it can write code, explain philosophy, and translate Hindi all with the same weights.

Key Takeaways

Key Takeaways

  • Attention gives every word a spotlight score for every other word — showing the model exactly where to focus for context.
  • Attention weights are calculated by softmax and always sum to 1.0 — the model distributes full focus across the sentence.
  • High attention weight means that word strongly influences the meaning of the word being processed right now.
  • Embeddings give words their meaning — attention decides which meanings matter right now. Both are required.
  • Multi-head attention runs multiple spotlights simultaneously — each head specialises in syntax, semantics, or coreference.
  • Attention is the breakthrough that made Transformers possible — and Transformers made every modern LLM possible.

← Previous Lesson