FIG. 16 How LLMs Learn

A trillion words.
A handful of equations.

Eight charts of real numbers on what actually happens inside a large language model when it learns. Scaling laws. The training stack. Tokenization. The Transformer. Training dynamics. Emergence — real or measurement artifact. Alignment. Benchmark saturation.

The page assumes you've heard the words but not necessarily seen the math behind them. Equations show up where they earn their keep, then get a sentence in plain language. Where the field is genuinely divided — emergence, the data wall, next-token prediction as a sufficient objective — both sides get the floor.

Filed

Engine

Vanilla JS · SVG · Static JSON

Coverage

8 charts · 2017 → 2026

Sources

Vaswani · Kaplan · Hoffmann · Wei · Schaeffer · Epoch AI

§ I Scaling laws — the closest thing ML has to a physical law

Plot training compute on a log axis, plot pretraining loss on the other log axis, and you get something almost too clean: a near-straight line that has held across four orders of magnitude. Kaplan et al. (2020) discovered it; Hoffmann et al. (2022) refined it with a 70-billion parameter model called Chinchilla. The chart below shows where the notable models actually land.

FIG. 16.1 · Pretraining loss vs training compute (log–log)

What this means

The Chinchilla rule: for a fixed compute budget, you should train on ~20 tokens per parameter. GPT-3 had 175B params and 300B tokens — only 1.7 tokens per param, far below the optimum. Llama 3 70B trained on 15T tokens — 214 tokens per param, ten times past the Chinchilla point. Why the overshoot? Because Chinchilla is training-optimal, not inference-optimal: a smaller, more-trained model is cheaper to run forever even if it cost more to train once. Most modern releases optimise for the model you'll serve, not the model you'll train.

§ II The training stack — what data goes in, in what order

A modern frontier model goes through four training stages, and they are not equal. 96% of compute is spent on pretraining — the model reads a substantial fraction of the public internet, predicting the next token at every step. The remaining 4% teaches it to talk like an assistant, prefer helpful answers, and — since 2024 — think before responding.

FIG. 16.2 · Training stages by compute share, dataset size, and what they teach

§ III Tokenization — how language becomes integers

Before a model sees a word, the word becomes a number. Byte-pair encoding (BPE) chops text into the most-common sub-word pieces and assigns each piece an integer. Vocabulary sizes have grown roughly five-fold since GPT-2 to better handle code, multiple languages, and rare characters.

FIG. 16.3 · Vocabulary size by model and year

Why it matters

Math: GPT-2 splits long numbers into 3-digit chunks. Multi-digit arithmetic gets harder as a result. GPT-4 keeps longer runs together. Code: doubling the vocabulary roughly halves the token count for source code, lowering inference cost. Multilingual fairness: a small English-leaning vocab can turn one Korean character into 4–6 tokens, making non-English inference systematically more expensive.

§ IV The Transformer — the one architectural idea that worked

Vaswani et al. (2017) replaced the recurrence in RNN/LSTM language models with a single mechanism: self-attention. Every token can attend to every other token in parallel, and the network learns which connections matter. Almost every modern LLM is a stack of decoder-only Transformer blocks. The breakdown of where the parameters actually live is below.

The attention equation

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

Q queries the K keys to produce attention weights, which then take a weighted average over V values. The √d_k divisor keeps the softmax from saturating at large dimensions.

FIG. 16.4 · Llama 3 70B parameter breakdown

What this means

Most readers picture an LLM as "a giant attention machine." It isn't. Attention takes about 14% of the parameters. The MLP — two big linear layers per block, with a non-linearity in between — takes about 83%. The MLP is where most learned facts live; attention just decides which facts are relevant to which token.

§ V What learning actually looks like — training dynamics

Pretraining loss falls along a startlingly clean line on a log-step axis. But "training" isn't a single curve; it's a landscape of named phenomena. Below: the loss curve from a real public training run (OLMo 2 / Pythia–style), and four phenomena that make naive extrapolation dangerous.

FIG. 16.5 · Public training loss curve — log-step axis

§ VI Emergence — real, or a measurement artifact?

Wei et al. (2022) coined the term: certain capabilities — multi-digit arithmetic, MMLU at threshold sizes, instruction-following on novel tasks — appeared to emerge suddenly past a specific compute threshold. Below that threshold, models scored near zero; above it, performance jumped to non-trivial levels. Schaeffer, Miranda & Koyejo (2023, NeurIPS Best Paper) argued the jumps largely come from discontinuous metrics like exact-match accuracy. Switch to a continuous metric and the curves smooth out.

FIG. 16.6 · Same training run, two metrics — discrete vs continuous

Where the field has landed

Most working researchers now treat both papers as partly right. Schaeffer's mirage critique is convincing for many specific "emergence" claims — the apparent jumps were artifacts of thresholded metrics. But Olsson et al.'s induction-head work (Anthropic, 2022) found real, sharp circuit-formation events during training that aren't artifacts of measurement. Practical takeaway: do not bet a product on a benchmark with a discrete metric; do read the underlying log-prob curve.

§ VII Alignment — the post-pretraining toolkit

A fresh pretrained model knows the internet but doesn't yet know how to be an assistant. The toolkit for closing that gap has grown substantially since 2017. Each method addresses a specific problem; modern frontier models stack several of them in sequence.

FIG. 16.7 · Methods, what they solve, what they cost

§ VIII The benchmark race — what 'better' has actually meant

For most of the field's history, a new benchmark would last 4–5 years before frontier systems caught up to humans. Since 2022 that window has closed to under 18 months. The chart below shows trajectories on six canonical benchmarks plus the new ones built since 2024 to outpace saturation.

FIG. 16.8 · Frontier scores on six capability benchmarks, 2020 → 2026

What this means

MMLU went from random-chance (44% at GPT-3) to past the human baseline (~89.8%) in four years. HumanEval, MATH, and the older benchmarks have all saturated. The newer ones — GPQA Diamond, SWE-Bench Verified, ARC-AGI-2/3, FrontierMath, Humanity's Last Exam — were specifically designed to not saturate quickly. As of April 2026, frontier scores are still under 1% on ARC-AGI-3 and under 17% on FrontierMath.

Coda Two threads worth flagging

§ IX Receipts

§ X Methodology, sources, caveats

Where the numbers come from

Model parameter and token counts: official model cards (Llama 3, DeepSeek-V3) and Epoch AI's notable-models database. GPT-4 architecture details are press estimates, not confirmed by OpenAI — labelled as such in the chart. Loss values are pretraining cross-entropy as reported in the relevant papers.

What's a 'token'?

BPE (byte-pair encoding) is the dominant scheme. A token is usually a sub-word: common words tokenize to one token, rare words tokenize to several. The exact split depends on the model's tokenizer. Token IDs in the worked example are correct for the respective tokenizers as of the cited model versions.

The emergence demonstration

The discrete-vs-continuous chart in § VI uses a representative trajectory in the spirit of Schaeffer et al. (2023). The original paper provides the full reproduction across multiple BIG-Bench tasks. Both the metric jump and the smooth underlying curve are real phenomena; only the framing is in dispute.

Frontier benchmark scores

Scores are from official model cards and the underlying benchmark leaderboards (lmsys, EpochAI, SWE-Bench Verified, ARC Prize Foundation). 2025–26 scores update as new model releases land.

Reading list

Honest caveats

GPT-4 architecture numbers are press estimates. OpenAI has never confirmed parameter count, MoE structure, or training-token total. The chart labels them as such.
"Loss proxy" values for the largest models are extrapolations consistent with published scaling laws, not measured pretraining loss; lab reports of pretraining loss for GPT-5/Claude 4 don't exist publicly.
Compute-share percentages for the four training stages are approximate. Reasoning-RL changes the budget materially; the share quoted (1–10%) is from the DeepSeek-R1 paper and is not directly comparable across labs.
Tokenization examples use the GPT-2 and GPT-4 tiktoken tokenizers. Other model families use different schemes; a SentencePiece BPE tokenizer can produce different splits even for the same vocab size.

← Back to the portfolio View the data on GitHub ↗

A trillion words. A handful of equations.

A trillion words.
A handful of equations.