A practical, end‑to‑end deep dive with concrete examples, shapes, and pseudo‑code

0) A quick story to anchor everything (real‑world scenario)

Task: “Translate this customer email from Spanish to English, then summarize it in 3 bullet points.”

What truly happens inside:

Your text is tokenized → a list of integers.
Each integer is looked up in a huge embedding table → vectors.
Those vectors flow through N transformer layers: masked self‑attention + feed‑forward MLPs, with residuals and normalization.
The last vector is projected to vocabulary logits → softmax → a probability for every next token.
The model samples one token, appends it, repeats.
It doesn’t learn from your prompt. Weights are frozen. Any persistence beyond RAM depends on product settings/logs.

We’ll now unpack every step, with precise shapes and a training walk‑through.

1) Inference: the inside view when you ask a question

1.1 Tokenization (text → token IDs)

Modern LLMs use subword/byte tokenizers (BPE, SentencePiece). This avoids “unknown words”.

Example text:

"Traduce y resume: Cliente dice que el envío llegó tarde y quiere reembolso."

A SentencePiece‑style tokenizer might emit (toy example):

[ "Tr", "adu", "ce", "▁y", "▁res", "ume", ":", "▁Cliente", "▁dice", ... ]

Each piece has an ID. So you get an integer array:

ids = [12094, 815, 302, 220, 9011, 506, 58, 33422, 7711, ...]

Important:

Vocabulary size |V| ~ 32k–128k (typical modern models).
Tokenization choices shape sequence length and compute.

1.2 Embedding lookup (IDs → vectors)

There’s an embedding matrix E with shape [|V|, d_model].
For each token ID, you fetch a row (a vector of length d_model, e.g., 4096 or 8192).

If your prompt has L tokens, you now have a matrix:

X0  shape = [L, d_model]

Add positional info (e.g., RoPE: rotary position embedding) so order matters.

1.3 One transformer layer (the core block)

Each of the N layers applies:

LayerNorm/RMSNorm
Multi‑Head Masked Self‑Attention
Residual add
Feed‑Forward MLP
Residual add

Shapes (typical):

d_model = 4096, heads = 32 ⇒ head_dim = d_model / heads = 128.
Q/K/V projections turn [L, d_model] into [L, heads, head_dim].

Masked attention: position i can only attend to tokens ≤ i. That’s how decoding works left‑to‑right.

1.4 Output projection → next‑token probabilities

After N layers, the last position’s vector (shape [d_model]) goes through a linear layer to [|V|] logits, then softmax → probabilities.

1.5 Decoding loop with KV cache (how it streams fast)

Prefill: process the whole prompt once; store K/V (key/value) tensors for each layer (the KV cache).
Generate: for each new token, you don’t recompute the entire prompt; you only compute attention of the new position vs stored K/V. That’s the latency win.

KV cache memory (rule of thumb):
For each layer, store K and V of shape [L, heads_kv, head_dim].
If d_model=4096, heads=32, head_dim=128, L=4,096, fp16 (2 bytes):

Per layer per token ≈ heads * head_dim * 2 (K+V) * 2 bytes
\= 32 * 128 * 2 * 2 = 16,384 bytes ≈ 16 KB/token/layer.
At L=4,096, per layer ≈ 16 KB * 4096 ≈ 64 MB.
For N=48 layers: ~3 GB cache (per sequence; sharing/reduction tricks apply like GQA).

That’s why Grouped/Multi‑Query Attention (GQA/MQA) matters: fewer K/V sets than heads → dramatic memory savings.

1.6 Minimal inference pseudo‑code (single sample)

# Given: ids (input token IDs), model with N layers, tokenizer
# Returns: generated token IDs

# 1) Prefill
X = embed(ids)                 # [L, d_model] from embedding table + position enc
for layer in layers:
    X, cache[layer] = layer.forward_prefill(X)  # stores K/V in cache
logits = out_proj(X[-1])       # last position
next_id = sample(softmax(logits))
generated = [next_id]

# 2) Decode loop (stream tokens)
for step in range(max_new_tokens):
    x = embed([generated[-1]])     # [1, d_model], position increments
    for layer in layers:
        x, cache[layer] = layer.forward_decode(x, cache[layer])
    logits = out_proj(x.squeeze(0)) # [|V|]
    next_id = sample(softmax(logits))
    generated.append(next_id)

Note: real servers batch many requests, use paged attention, quantization, etc., but this is the essence.

2) Training: how the model actually learns (clean, end‑to‑end)

We’ll do two things:

Give you the real pipeline (curation → sharding → optimization → distributed).
Walk through a tiny, concrete example so it’s not abstract.

2.1 The real pipeline (frontier‑model style)

A) Data acquisition & cleaning

Sources: curated web, books, papers, code, multilingual text.
De‑duplication: MinHash/SimHash to remove near‑duplicates (prevents overfitting).
PII filtering & safety: regex + ML filters; remove risky content.
Language ID & domain balancing: keep mixture healthy (e.g., not 90% English Reddit).
Quality scoring: remove low‑signal pages (boilerplate, spam).
Code filters: detect language, remove copied gists/keys; dedupe license issues.

B) Tokenization & packing

Run the trained tokenizer; produce token IDs.
Pack into sequences of fixed length (e.g., 2k–8k).
- Example: pack multiple short docs into one 4k sequence with separators to improve GPU utilization.

C) Objective

Causal language modeling (CLM):
For each position i in a sequence, predict token[i] from tokens[0..i‑1].
Loss = average cross‑entropy over positions.

D) Optimization & schedules

AdamW optimizer (β1,β2 tuned), weight decay on most weights (not on LayerNorm/bias).
Learning rate schedule: warmup (e.g., 1–3% of steps) → cosine decay.
Mixed precision (BF16/FP16) to speed math and reduce memory.

E) Distributed training

Data parallel: each GPU sees different sequences; gradients reduced across GPUs.
Tensor model parallel: split big matmuls across GPUs.
Pipeline parallel: split layers across stages (micro‑batching to keep them busy).
ZeRO sharding (optimizer/gradients/params sharded).
Activation checkpointing: save memory by recomputing activations in backward pass.
Gradient accumulation: simulate large batch sizes across many microbatches.

F) Monitoring & eval

Track training/val loss, learning rate, grad norms, throughput.
Periodically evaluate on held‑out corpora and task suites (MMLU, coding sets).
Early stop if diverging, adjust mixtures if skewed.

G) Post‑training (for assistants)

SFT (supervised fine‑tuning): show the model “how to chat” and use tools.
Preference optimization: RLHF / DPO / ORPO so it prefers helpful, safe answers.
Safety tuning: add refusal patterns, jailbreak resistance.

2.2 Training shapes & numbers (example plan)

Let’s say:

Model: 70B parameters, d_model=8192, layers=80, heads=64.
Sequence length L: 4,096 tokens.
Global batch (tokens/step): 4,096 sequences × 4,096 tokens ≈ 16.8M tokens/step (illustrative).
Total training tokens: 1.4T (Chinchilla‑ish for 70B).

Rough steps: 1.4T / 16.8M ≈ 83,000 steps.
On large clusters (e.g., thousands of H100s), that’s weeks of training.

Memory reality:

Weights (BF16 ~2 bytes/param): 70B → ~140 GB just for weights.
With optimizer states (Adam: ~8–10 bytes/param before sharding), you need massive sharding.
That’s why ZeRO + tensor/pipeline parallel is non‑negotiable.

2.3 A tiny, neat, concrete training example (so it “clicks”)

Goal: show exactly how embeddings and weights move for a toy CLM.

Toy world:

Vocab |V| = 10 tokens: [cat, dog, sat, on, the, mat, is, happy, sad, END]
d_model = 4, 1 layer, 1 head (keep it tiny).

Data (30 total tokens):

  cat sat on the mat END
  dog is happy END
  cat is sad END

Parameters:
- Embedding table E: [|V|, d_model] = [10, 4] (~40 params)
- Attention + MLP weights (~200 more)
- Output projection: [d_model, |V|] = [4, 10] (40 params)
  → ~272 parameters total.

One training step (predict token 2 from 0..1):
Sequence: cat (0), sat (1), on (2), ...

Inputs: tokens [cat, sat] → embeddings → through layer → output logits for next token.
True next token is "on".

Suppose model predicted probs:

  P("on") = 0.20, P("the") = 0.10, P("mat") = 0.05, ...

Loss (cross‑entropy) is high because P("on") is low.
Backprop:
- Increases weights that would push "on" higher next time.
- Nudges "cat" and "sat" embeddings so that, together, they favor continuing with "on".
Repeat for all positions in all sequences.

After many passes, you’ll see the toy model generate:

cat sat on the mat END
dog is happy END
cat is sad END

It learned the transition regularities purely from next‑token prediction.

This tiny example is exactly what happens at frontier scale, just with:

10 → 100k vocab
272 → 70B+ parameters
30 → 1T+ training tokens
1 GPU → thousands of GPUs

3) Internals that matter (and why answers are fast)

3.1 Why transformers beat RNNs for this

Attention can connect any two positions directly (no vanishing memory).
It’s parallelizable: whole sequences processed at once (prefill).

3.2 KV cache (the speed trick)

During generation, don’t reprocess the entire prompt.
Store past K/V for each layer once.
Each new token attends only against those cached K/V.
Complexity drops from quadratic (prefill) to almost linear per token (decode).

3.3 FlashAttention, fused kernels, and quantization

FlashAttention: reduces memory movement → massive speedup.
Fused ops: fewer kernel launches, better GPU occupancy.
Quantization (int8/int4): smaller weights/activations → faster, lower memory.

3.4 Serving at scale

Continuous batching/paged attention (vLLM‑style): mix many user requests, keep GPUs hot.
Speculative decoding: small draft model generates k tokens; big model verifies & accepts many at once.
GQA/MQA: cut KV memory by sharing K/V across heads.
Router/load balancing (for MoE or multi‑GPU sharding).

4) What your data “does” at inference (privacy, persistence)

During your call: your tokens/embeddings live in RAM/VRAM; the model’s weights are frozen.
Afterwards: behavior depends on product settings:
- Enterprise: often no training on your data, limited retention.
- Consumer: some providers retain logs for quality unless you opt out.
Fine‑tuning: only when you explicitly provide data for training do weights change.

Key truth: your single prompt does not imprint on the base model during normal use.

5) Where translation, reasoning, and tools come from (internally)

Translation: One shared subword/byte vocab across languages; parallel corpora during pretraining; “Translate X→Y:” is just conditioning.
Reasoning: Chain‑of‑thought patterns and multi‑step problem texts in pretraining produce emergent hierarchical skills; decoding settings (temperature, top‑p) affect style.
Tool use/functions: The model is trained (SFT) to emit structured JSON‑like “function calls”; the runtime executes tools and feeds results back into the context. Internally it’s still next‑token prediction over a grammar‑constrained format.

6) Failure modes (and why they happen)

Hallucinations: objective is “most likely next token,” not “verified truth.” Without retrieval or grounding, it can choose fluent fiction.
Lost in the middle: attention may overweight the beginning and end; mitigations include windowing and special “sink” tokens.
Prompt injection: if external content is injected (via RAG/tools), the model may follow malicious instructions. Needs input sanitization, allow‑lists, and safety prompts.

7) Concrete memory & cost math (so it’s not hand‑wavy)

Assume:

d_model=8192, heads=64, head_dim=128, layers=80, L=8k tokens, fp16.

KV per token per layer: heads * head_dim * 2 (K+V) * 2 bytes
\= 64 * 128 * 2 * 2 = 32,768 bytes = 32 KB.

Per layer (L=8k): 32 KB * 8,192 ≈ 256 MB.
All layers: 256 MB * 80 ≈ 20.5 GB per sequence (why we need GQA/MQA, quantization, and paged attention; real systems reduce this by 4–8×+ and share across batched requests).

Weights (BF16 ~2 bytes/param): 70B → ~140 GB (before tensor/pipeline sharding).
Training optimizer states: ~6–8× weights unless sharded (ZeRO).

This is why serving and training both need distributed systems.

8) Minimal pseudo‑code for a training step (single GPU, toy)

# X: batch of token IDs [B, L]
# model: embeddings + N transformer layers + output head
# target: next tokens for each position [B, L]

# 1) Embed and add positional info
H = embed(X)                    # [B, L, d_model]

# 2) Forward through N layers
for layer in model.layers:
    H = layer(H)                # masked self-attn + MLP + residuals

# 3) Output logits for each position
logits = out_proj(H)            # [B, L, |V|]

# 4) Compute loss (cross-entropy over all positions)
loss = cross_entropy(logits[:, :-1, :], X[:, 1:])  # predict token t[i+1]

# 5) Backprop + update
loss.backward()
optimizer.step()
optimizer.zero_grad()

Scale this up with data/tensor/pipeline parallelism, ZeRO sharding, gradient accumulation, activation checkpointing.

9) A complete “training day in the life”

Data mix refresh arrives (web/code/books multilingual).
Preprocessor runs: dedupe, language ID, toxicity filter, PII redaction, license checks, URL reputation, boilerplate removal.
Tokenizer converts text to token IDs; packer fills 4k–8k sequences tightly.
Scheduler dispatches shards to GPU workers; each worker loads a microbatch.
Forward pass with BF16 mixed precision; checkpoints activations to save memory.
Backward pass computes grads; ZeRO shards and reduces across nodes.
Optimizer updates; LR follows a cosine decay; gradients clipped if needed.
Metrics logged; bad spikes trigger auto‑rollback or reduced LR.
After N steps, eval job runs held‑out perplexity + task suites; quality dashboards update.
At milestones, a checkpoint is persisted; infra replicates it to multiple regions.
For chat assistants, SFT + preference optimization fine‑tunes a copy of the base.
Serving deploy uses quantized weights, FlashAttention kernels, paged attention, and speculative decoding. Canary first, then full rollout.

10) What to remember (the essence)

Inside view at inference:
tokens → embeddings → [attention+MLP]×N → logits → softmax → next token (repeat)
with KV cache so it’s fast.
Inside view at training:
same forward pass, then compare to the true next token, compute loss, backprop, update weights. Do this billions of times on trillions of tokens.
Why it seems to “understand”:
embeddings + attention discover statistical structure of language/world from massive data. What looks like reasoning is often pattern completion plus real generalization.
Why it’s fast:
GPU‑optimized matmuls, cache, quantization, fused kernels, smart batching.
Why it sometimes fails:
objective is plausibility, not truth; guard with retrieval, citations, and safety rails.

What Actually Happens Inside an LLM