Interactive Explainer

The Transformer — How AI Thinks

Self-attention, Q/K/V, multi-head attention, and next-token prediction — the architecture behind Claude, GPT, and every modern LLM.

📖 Reference ⚡ Interactive 🏦 Finance Context

🧠 The Problem: Sequential Bottleneck

Before transformers, models (RNNs/LSTMs) processed text one token at a time, squeezing all past information through a single hidden state. By token 50, information about token 1 is mostly gone.

The bottleneck problem
"The merchant who submitted the chargeback dispute last quarter ... was flagged."
↑ By the time the RNN reaches "flagged", it has mostly forgotten "merchant"
🔗

RNN: Sequential Chain

Processes tokens one by one. Information fades over distance. Can't parallelize — slow on GPUs. Token 1 is a whisper by token 50.

🌐

Transformer: Direct Access

Every token can directly look at every other token, no matter how far apart. No bottleneck. Fully parallel — fast on GPUs.

🔍

Self-Attention

The core mechanism. Each token asks: "Which other tokens are relevant to me?" and pulls information from them directly.

Parallel Processing

All tokens are processed simultaneously — not one by one. This is why transformers train 100x faster than RNNs on modern GPUs.

🔗 Where the Transformer Fits

📝Your Text
✂️Tokenizer
🔢Token IDs
📊Embeddings
🧠Transformer
(Attention)
💬Output

🔑 Query, Key, Value — The Language of Attention

Self-attention uses three concepts borrowed from information retrieval. Think of it like a library search:

🔍

Query (Q)

"What am I looking for?"
The current token asking a question

🏷️

Key (K)

"What do I contain?"
Every token advertising what it offers

📄

Value (V)

"Here's my actual content"
The information each token provides

📖 Step-by-Step: How Attention Works

Let's trace attention for the word "rate" in: "The merchant chargeback rate is high"

Step 1 — Compute similarity scores

"rate" sends out its Query and compares against every token's Key via dot product:

Dot products (raw attention scores)
q_rate · k_The = 0.1  (not relevant)
q_rate · k_merchant = 0.4  (somewhat relevant)
q_rate · k_chargeback = 0.8  (very relevant — what kind of rate?)
q_rate · k_rate = 0.9  (self-attention)
q_rate · k_is = 0.2  (low)
q_rate · k_high = 0.7  (relevant — describes the rate)

Step 2 — Normalize into probabilities (softmax)

Divide by √dk, then apply softmax so weights sum to 1:

Attention weights (after softmax)
The: 0.03 | merchant: 0.10 | chargeback: 0.28 | rate: 0.35 | is: 0.05 | high: 0.19

Step 3 — Weighted average of Values

Multiply each token's Value by its attention weight and sum:

New embedding for "rate"
output_rate = 0.03·v_The + 0.10·v_merchant + 0.28·v_chargeback + 0.35·v_rate + 0.05·v_is + 0.19·v_high

The result: "rate" now knows it's a chargeback rate that is high. Its embedding is enriched with context from the most relevant tokens.

💡
The matrix formula: Attention(Q, K, V) = softmax(Q·KT / √dk) · V
This is a single matrix multiplication — highly parallelizable on GPUs. Every token attends to every other token simultaneously.

🎮 Self-Attention Heatmap

Click a token to see what it attends to
🎓
This heatmap shows attention weights — how much each token (row) pays attention to every other token (column). Brighter = stronger attention. Click any token on the left to highlight its attention pattern.
🔍
What to notice: Each row shows what one token "pays attention to". Nouns attend strongly to their adjectives. Verbs attend to their subjects. "rate" attends to "chargeback" (what kind of rate?) and "high" (what about the rate?). This is how the model understands context.

🏗️ Multi-Head Attention: Multiple Perspectives

One attention head captures one type of relationship. But language has many simultaneous relationships. The solution: run multiple attention heads in parallel, each learning different patterns.

Head 1: Syntax

"rate" → "is" (subject-verb)

Head 2: Semantics

"rate" → "chargeback" (modifier)

Head 3: Sentiment

"rate" → "high" (descriptor)

Head 4: Position

"rate" → nearby tokens

GPT-2 has 12 heads. Claude likely has 64-128 heads. Each head independently computes Q, K, V and produces its own attention pattern. The outputs are concatenated and projected back to the original dimension.

🔄 Inside a Transformer Block

Each transformer block has two main parts, repeated N times (12 in GPT-2, 96 in GPT-4):

Input embeddings
  ↓
Multi-Head Self-Attention — tokens share information
  ↓ + residual connection + layer norm
Feed-Forward Network (MLP) — refine each token independently
  ↓ + residual connection + layer norm
Output embeddings (→ next block or final output)
💡
Residual connections add the input of each layer back to its output — like a "highway" that prevents information from fading through many layers. Without them, deep networks (96 layers!) would fail to train.

🎭 Three Superpowers After Attention

After passing through the transformer blocks, each token's embedding is simultaneously:

📝

Token-Aware

"merchant" knows what "merchant" means

📍

Position-Aware

"merchant" at position 3 ≠ position 103

🌐

Context-Aware

"bank" in "river bank" ≠ "bank account"

No previous architecture achieved all three at once. This is why transformers dominate.

🔄 Training vs Inference: What Happens When?

The same transformer architecture is used for both training and inference — but the process is very different:

🏗️

Pre-Training (happens once)

Anthropic/OpenAI/Meta train the model on billions of sentences. Takes weeks on thousands of GPUs. Costs millions of dollars.

• All tokens processed in parallel
• Model sees the full sentence
• Masking prevents peeking ahead
• Weights are updated every batch
• Learns: embeddings, Q/K/V matrices, MLP weights

Inference (every time you prompt Claude)

You send a prompt, Claude generates a response. Takes seconds. Costs fractions of a cent.

• Tokens generated one at a time
• Each new token = full forward pass
• Uses its own predictions as input
• Weights are frozen (read-only)
• No learning — just applying what was learned
What happens when you prompt Claude: "Assess this merchant"
Step 1: Process ["Assess","this","merchant"] through all blocks → predict "risk"
Step 2: Process ["Assess","this","merchant","risk"] through ALL blocks again → predict "rating"
Step 3: Process ["Assess","this","merchant","risk","rating"] again → predict ":"
... repeat until done. Each step = full forward pass, no weights updated.
Pre-TrainingInference
GoalLearn language patternsGenerate useful output
ProcessingAll tokens in parallelOne token at a time (sequential)
WeightsUpdated every batchFrozen — read-only
CostMillions of $, weeks of GPU timeFractions of a cent per call
HappensOnce (by Anthropic/OpenAI/Meta)Every time you send a prompt
Who does itAI companies with GPU clustersYou, via API or Claude chat
💡
For AnyCompany participants: When you use Claude or Bedrock, you're only doing inference — the model is frozen and just applying what it learned during pre-training. This is why it's so cheap (cents per assessment) and fast (seconds per response). The expensive training already happened.

🎯 Next Token Prediction: How AI Generates Text

After all transformer blocks process the input, the model predicts the most probable next token. This is the fundamental operation — everything Claude generates is one token at a time.

Choose an input sentence:
Input to the transformer
The merchant risk rating is
→ Transformer processes all tokens → final layer produces probabilities:
🎓 Adjust Temperature, Top-K, and Top-P to see how they filter the token pool. Then click any eligible bar to "sample" that token — just like the model does.

🎛️ Sampling Controls — The Three Knobs

These three parameters work together to control which tokens the model can pick from. Adjust them and watch the bars update in real time.

0.8
10
1.00
🎯 10 of 10 tokens eligible | No filtering — all tokens in the pool
🔒 Precise
Temp 0.2 · K=3 · P=0.5
⚖️ Balanced
Temp 0.7 · K=10 · P=0.9
🎨 Creative
Temp 1.5 · K=all · P=1.0
🔧
How the three knobs work together:
① Temperature reshapes the probability curve — low = sharp peak (deterministic), high = flat (random).
② Top-K keeps only the K most probable tokens and discards the rest. K=3 means only the top 3 are eligible.
③ Top-P (nucleus sampling) keeps the smallest set of tokens whose cumulative probability ≥ P. P=0.5 means keep tokens until you cover 50% of the probability mass.
Top-K and Top-P are applied after temperature. If both are set, the stricter filter wins for each token.
🏭
Production reality — do developers actually tune these?
The short answer: most production apps only tune temperature and leave top-k/top-p at defaults. Here's why:
Anthropic (Claude) Exposes all three. top_k disabled by default. Docs say: "recommended for advanced use cases only — you usually only need temperature." Claude 4.5+ restricts: temperature or top_p, not both.
OpenAI (GPT) Does not expose top_k at all. Only temperature + top_p. Recommends changing only one. GPT-5.2 dropped top_p support entirely.
Amazon Bedrock Supports all three via Converse API. Defaults vary by model. Bedrock Agents with reasoning: temperature=0, top-k/top-p unset.
Industry trend APIs are getting simpler, not more complex. Newer models are better at self-calibrating. The knobs matter less as models improve.

Bottom line for AnyCompany: For risk assessments and compliance reports, set temperature: 0.1-0.3 and leave everything else at defaults. The model providers have already optimized the defaults for you.
Sources: AWS Bedrock Docs, Anthropic API Docs, OpenAI Community

💡
In your daily tools (Claude Cowork, Kiro), you can't change these sliders. But you can achieve the same effect through your prompts:
Want low temperature?
Add: "Be precise and factual. Follow this exact format. Use only the data provided."
Want high temperature?
Add: "Brainstorm 10 ideas. Be creative and unconventional. Suggest alternatives."

You'll practice these prompt techniques in the labs and exercises.

🔄 Autoregressive Generation: Watch AI Write

The model generates text one token at a time, feeding each prediction back as input. Click Play to watch it happen, or Step to go one token at a time.

Step 0 / 7
The merchant risk rating is
🎓 Press Play to watch the model generate one token at a time, or Step to advance manually. Each new token becomes part of the input for the next prediction.
Training vs Inference: During training, all tokens are processed in parallel (the model sees the full answer and learns from it). During inference (when you use Claude), tokens are generated one at a time — each new token requires a full forward pass through all transformer blocks. This is why longer outputs take longer to generate.

🌳 The Transformer Family Tree

The original Transformer (2017) had both an encoder and decoder. Researchers discovered that using only one half works better for certain tasks:

ArchitectureWhat It DoesModelsUsed For
Encoder-onlyReads text bidirectionally (sees left AND right)BERT, RoBERTaUnderstanding: classification, NER, search
Decoder-onlyReads left-to-right, generates one token at a timeGPT, Claude, LLaMAGeneration: chat, code, writing
Encoder-DecoderReads input fully, then generates outputT5, BARTTranslation, summarization

📊 Model Comparison

ModelTypeParametersContextKey Innovation
BERT (2018)Encoder340M512Bidirectional attention, MLM pretraining
GPT-2 (2019)Decoder1.5B1,024Showed scale improves quality
GPT-4 (2023)Decoder~1.8T (MoE)128KMixture of Experts, multimodal
LLaMA 3 (2024)Decoder8B–405B128KOpen-source, RoPE, GQA, SwiGLU
Claude (2024)DecoderUnknown200KConstitutional AI, long context
T5 (2020)Enc-Dec11B512Everything as text-to-text

🏦 What This Means for AnyCompany

💬

Claude = Decoder-Only

When you chat with Claude, it generates one token at a time using masked self-attention. It can only look at tokens it has already generated — never peeks ahead.

🔍

RAG Uses Embeddings

When Claude searches your documents, it uses embedding similarity (cosine) to find relevant chunks. The transformer then processes those chunks to generate an answer.

🌡️

Temperature = Creativity

For risk assessments, use low temperature (0.1-0.3) for consistent ratings. For brainstorming, use higher temperature (0.7-1.0) for creative ideas.

In Claude Cowork & Kiro: You control this through tone in your prompt — "be precise" = low temp, "be creative" = high temp.

📏

Context Window = Memory

Claude's 200K context window means it can "see" ~150 pages at once. Every token in that window attends to every other token — that's the power of self-attention.