the annotated microgpt
TL;DR: MicroGPT is ~200 lines of pure Python that captures the complete algorithm behind GPT: a custom autograd engine, character tokenizer, multi-head causal attention, MLP, and Adam training loop. This post simplifies it a little and annotates every section from scratch.
Last post I discussed about what happens when you press ‘submit’ on chatgippity, this post covers the actual code for it in simplest way possible.
ChatGPT and the script below share the exact same algorithm. Scale and engineering differ; the math does not. ChatGPT runs on thousands of H100s trained on the internet. This runs on your laptop in a few minutes, trained on random names.
Prerequisites: know what a gradient is. If not: read neural networks, the math, and the PyTorch layer first. This builds on all three.
Core idea: A language model has one job: predict the next token given the context so far. Everything in this post serves that goal.
1. the data (lines 1-23)
The dataset: 32,000 human first names from Karpathy’s makemore project. Instead of using BytePairEncoding (which is what gippity uses as it’s tokeniser) we use character level tokenizer - the alphabet maps to IDs 0-25, and one special token sits at ID 26: BOS (beginning-of-sequence). Vocab size = 27.
import urllib.request, os
url = "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt"
if not os.path.exists("names.txt"):
urllib.request.urlretrieve(url, "names.txt")
docs = open("names.txt").read().splitlines() # list of ~32k names
# "".join(docs) concatenates every name into one giant string,
# set() removes duplicates, sorted() gives us a stable alphabetical order
chars = sorted(set("".join(docs))) # ['a','b',...,'z'] - 26 chars
BOS = len(chars) # ID 26
vocab_size = len(chars) + 1 # 27
BOS doubles as end-of-name: every name gets wrapped as [BOS, c1, c2, ..., cn, BOS]. When the model emits BOS during inference, that’s the stop signal (here EOS and BOS are the same)
Tokenization for “emma”:
Training targets are one step ahead. The model sees previous input tokens [:-1] and must predict [1:] at each position (next token). “emma” generates 5 training pairs:
doc = "emma"
char_tokens = []
for c in doc:
char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS]
# tokens = [26, 4, 12, 12, 0, 26]
input_tokens = tokens[:-1] # [26, 4, 12, 12, 0]
target_tokens = tokens[1:] # [4, 12, 12, 0, 26]
At position t, the model sees tokens[0..t] and must predict tokens[t+1]. No labels required (the data labels itself)
BOS doubling as EOS works because the model learns from examples: every name in the training data ends with BOS. After 20 steps it starts producing BOS as the termination signal naturally.
2. the autograd engine (lines 25-121)
Training means finding parameters that minimize loss. Gradient descent needs dLoss/d(every parameter). For 4,192 parameters, that’s 4,192 separate derivative calculations per step. Autograd computes all of them by tracking every operation and reversing it.
the Value class
Every scalar is a Value node. It stores four things:
local_grads are stored as a tuple. Each operation captures its derivative formula at construction time.
# Python "dunder" methods like __add__ let you define what happens when
# you write a + b: Python calls a.__add__(b) under the hood.
# This means Value objects can participate in normal arithmetic expressions.
class Value:
def __init__(self, data, _children=(), local_grads=()):
self.data = data
self.grad = 0 # dLoss/d(this) - accumulated by backward()
self._children = _children # tuple of Value nodes that produced this one
self.local_grads = local_grads # d(this)/d(each child) - same order
# d(a+b)/da = 1, d(a+b)/db = 1
def __add__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data + other.data, _children=(self, other))
out.local_grads = (1.0, 1.0) # gradient of sum w.r.t. each input is 1
return out
# d(a*b)/da = b, d(a*b)/db = a
def __mul__(self, other):
other = other if isinstance(other, Value) else Value(other)
out = Value(self.data * other.data, _children=(self, other))
out.local_grads = (other.data, self.data) # cross-multiply
return out
# d(x^n)/dx = n * x^(n-1)
def __pow__(self, other):
return Value(self.data**other, (self,), (other * self.data**(other-1),))
# d(log x)/dx = 1/x
def log(self):
return Value(math.log(self.data + 1e-10), (self,), (1/(self.data + 1e-10),))
# d(exp x)/dx = exp(x)
def exp(self):
return Value(math.exp(self.data), (self,), (math.exp(self.data),))
# d(relu x)/dx = 1 if x>0, else 0
def relu(self):
return Value(max(0, self.data), (self,), (float(self.data > 0),))
# All convenience ops reduce to the primitives above
def __neg__(self): return self * -1
def __sub__(self, other): return self + (-other)
def __truediv__(self, other): return self * other**-1
# __radd__, __rmul__ etc. are "reflected" operators.
# Python calls __radd__(b, a) when a.__add__(b) fails (e.g. int + Value).
# Without these, `3 * some_value` would raise TypeError.
def __radd__(self, other): return self + other
def __rmul__(self, other): return self * other
Convenience operators (__neg__, __sub__, __truediv__) add no new gradient rules. They reduce to the three primitives. Backprop traces through them without any extra code.
backward()
This method goes inside the Value class you defined above, as a peer to __add__, __mul__, etc.
Topological sort first (DFS post-order): every node appears after its children. Then reverse the order - that’s the backward order. Seed the root’s gradient to 1 (dLoss/dLoss = 1). Walk in reverse: at each node, multiply its local_grads by its own grad and add to children. This is the chain rule in calculus.
backward() propagates gradients through the ENTIRE computation graph in one pass - not layer by layer, but through every node that participated in computing the loss.
def backward(self):
topological_graph = []
visited = set()
def build_top(v):
if v not in visited:
visited.add(v)
for child in v._children:
build_top(child) # recurse into children first
topological_graph.append(v) # append AFTER children = post-order
build_top(self)
self.grad = 1 # dLoss/dLoss = 1
for v in reversed(topological_graph): # process root before leaves
for child, local_grad in zip(v._children, v.local_grads):
child.grad += local_grad * v.grad # chain rule
3. hyperparameters and weight init (lines 124-160)
n_embed = 16 # every vector in the model has this length
n_head = 4 # attention heads (each sees head_dim=4 dimensions)
n_layer = 1 # one transformer block
block_size = 16 # max sequence length
head_dim = n_embed // n_head # = 4
Random matrix is initialised. Small initialization (sigma=0.08) prevents exploding activations. If weights start large, each layer multiplies by a large number (values overflow) hence we start small, learn from there.
def matrix(rows, cols):
# Small Gaussian init (sigma=0.08) prevents activation explosion
result = []
for r in range(rows):
row = []
for c in range(cols):
row.append(Value(random.gauss(0, 0.08)))
result.append(row)
return result
The full parameter list:
All six matrices are stored as lists so each transformer block gets its own independent copy:
wq = []
wk = []
wv = []
wo = []
w_fc1 = []
w_fc2 = []
for i in range(n_layer):
wq.append(matrix(n_embed, n_embed)) # 16 x 16
wk.append(matrix(n_embed, n_embed)) # 16 x 16
wv.append(matrix(n_embed, n_embed)) # 16 x 16
wo.append(matrix(n_embed, n_embed)) # 16 x 16
w_fc1.append(matrix(n_embed * 4, n_embed)) # 64 x 16 (output_dim, input_dim hence it's n*4,n and not the other way around)
w_fc2.append(matrix(n_embed, n_embed * 4)) # 16 x 64
lm_head = matrix(vocab_size, n_embed) # 27 x 16
Every Value node gets collected into one flat list so the SGD loop can update all 4,192 parameters in a single pass:
def flatten(mat):
result = []
for row in mat:
for val in row:
result.append(val)
return result
params = flatten(wte) + flatten(wpe)
for i in range(n_layer):
params += flatten(wq[i]) + flatten(wk[i]) + flatten(wv[i]) + flatten(wo[i])
params += flatten(w_fc1[i]) + flatten(w_fc2[i])
params += flatten(lm_head)
print(f"num params: {len(params)}") # 4192
This is what makes for param in params: param.data -= lr * param.grad in the training loop work (all parameters from all layers in one flat iterable)
Embeddings are vectors, not integer IDs, because vectors capture similarity. ‘anna’ and ‘ana’ share characters; their embeddings should be close in space. An integer ID for ‘a’ (=0) has no geometric relationship to ‘n’ (=13). A learned 16-dim vector does.
4. utility functions (lines 163-186)
linear: matrix-vector multiply. This is the forward pass of a single layer.
def linear(x, w):
# output[i] = dot(w[i], x) for each row of the weight matrix
result = []
for row in w: # each row = one output neuron's weights
total = Value(0.0)
for weight, inp in zip(row, x): # zip pairs each weight with its input
total = total + weight * inp # accumulate dot product
result.append(total)
return result
softmax: converts logits to probabilities. The numerical stability trick:
\[e^{x_i - c} / \sum_j e^{x_j - c} = e^{x_i} / \sum_j e^{x_j}\]The constant $c$ cancels. Subtract max before exponentiating to avoid overflow.
def softmax(logits):
max_logit = max(l.data for l in logits) # stability: subtract max before exp
exp_logits = []
for logit in logits:
exp_logits.append((logit - max_logit).exp())
sum_exp = Value(0.0)
for e in exp_logits:
sum_exp = sum_exp + e
result = []
for e in exp_logits:
result.append(e / sum_exp)
return result
def rmsnorm(x):
# Normalize by root-mean-square.
# mean_sq = average of (xi squared)
# rms = sqrt(mean_sq)
# output = xi / rms
eps = 1e-8
sq_total = Value(0.0)
for xi in x:
sq_total = sq_total + xi ** 2
mean_sq = sq_total / len(x)
rms = (mean_sq + eps) ** 0.5
result = []
for xi in x:
result.append(xi / rms)
return result
After softmax, the output is a probability distribution - non-negative values that sum to exactly 1.0. In attention_block this turns raw dot-product scores into attention weights. In the training loop it turns the model’s final logits into token probabilities.
Normalization stabilizes training by keeping activations in a reasonable range. It stablizes training; nothing fancy here.
5. attention (lines 189-235)
the problem attention solves
An MLP processes each position independently. Token at position 3 has no access to position 0. But predicting the ‘a’ in “emm_” is easier with the full context. Attention gives each token a channel to read from every past position.
Q, K, V
Before computing Q, K, V, ask: who should each token pay attention to? A token representing ‘m’ in position 2 of ‘emma’ should probably attend to position 1 (‘e’) more than position 0 (BOS). Q, K, V give the model the machinery to figure that out from content.
Each token projects its embedding through three learned matrices to produce three vectors:
- Q (query): “what am I looking for?”
- K (key): “what do I advertise?”
- V (value): “what do I carry?”
Formula: $\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V$
This code implements exactly that, one scalar at a time.
causal masking
score(t, s) = dot(Q[t], K[s]) / sqrt(head_dim). Dot product measures alignment: if Q[t] and K[s] point in similar directions in 16-dimensional space, their dot product is large. Large score means high attention weight, which means s’s value gets pulled into t’s output. Basically the model routes information based on what the query is asking and what the key is advertising.
The sqrt(head_dim) scaling keeps dot products from growing proportionally to dimensionality. Without the sqrt(head_dim) scaling, dot products grow with the embedding dimension. At high dimension, a few scores dominate, softmax saturates (one weight goes to 1.0, rest go to 0), and gradients vanish. The scaling prevents this.
Without the causal mask every position can see all future tokens (the model learns to copy from the future rather than predict it; cheating)
Result? Remove range(t+1) and replace with range(len(x_seq)) (which means seeing all tokens) and the model immediately hits zero loss on training data while failing completely on generation.
Causal mask: the loop runs s in range(t+1), so future tokens never exist in the score list. No explicit mask matrix (just don’t compute the scores)
This is O(T²) per head per layer. For T=16: 16×16=256 dot products per head, 1024 total. Fine here. At T=100K (long-context models), it’ll crash, hence algorithms like flash attention is used.
multi-head
Split Q, K, V into n_head=4 slices of head_dim=4 each. Run attention independently per head. Heads specialize: one might track character repetition (“mm” in “emma”), another common name endings, another positional distance. This comes from training. Concatenate the 4 outputs back to 16 dims, then project through wo.
The output of each attention head is a weighted average of value vectors: head_out[d] = sum over s of weights[s] * V[s][d].
Attention is just a learnable weighted sum.
the residual connection
After attention: x = x + attn_out. This is called skip connection from ResNet. Without it, gradients multiply through every layer and hit zero before reaching early parameters. With it, the gradient path from loss to input includes a direct addition - the derivative of x + f(x) w.r.t. x is 1 + df/dx, so the gradient is never zero even if f learns nothing. See attention-residuals.
def attention_block(x_seq, wq_i, wk_i, wv_i, wo_i):
Q = []
for x in x_seq:
Q.append(linear(x, wq_i))
K = []
for x in x_seq:
K.append(linear(x, wk_i))
V = []
for x in x_seq:
V.append(linear(x, wv_i))
out_seq = []
for t in range(len(x_seq)):
x_attn = []
for h in range(n_head):
start = h * head_dim
end = (h + 1) * head_dim
qh = Q[t][start:end]
scores = []
for s in range(t + 1): # causal: no peeking at future
score = Value(0.0)
for qi, ki in zip(qh, K[s][start:end]):
score = score + qi * ki
score = score / math.sqrt(head_dim) # scale to stabilize softmax
scores.append(score)
weights = softmax(scores)
head_out = []
for d in range(head_dim):
# weighted sum of values across attended positions
val = Value(0.0)
for s in range(t + 1):
val = val + weights[s] * V[s][start + d]
head_out.append(val)
for elem in head_out:
x_attn.append(elem)
out_seq.append(linear(x_attn, wo_i)) # mix across heads
return out_seq
6. the MLP (lines 238-248)
Attention mixes information across positions. MLP processes each position’s vector independently. Expand 16 -> 64 (4x width), apply ReLU, project back 64 -> 16.
The logic is simple; more dimensions = more you know about the data (more capacity; intuitive when you think about any data and increase dimensions)
MLP = per-token compute. Attention = cross-token communication. The transformer alternates between these two operations: attention lets tokens talk to each other, MLP lets each token process what it heard.
In larger models, the MLP layers are where most factual associations are stored - things like ‘anna’ implies female name. Attention routes information; MLP transforms it.
def mlp_block(x, w_fc1_i, w_fc2_i):
x = linear(x, w_fc1_i) # 16 -> 64: project to wider space
relu_out = []
for xi in x:
relu_out.append(xi.relu()) # non-linearity (ReLU, not GELU)
x = linear(relu_out, w_fc2_i) # 64 -> 16: project back
return x
7. the full forward pass (lines 251-271)
Positional embeddings are not optional. The attention mechanism is permutation equivariant: without position information, ‘emma’ and ‘mame’ produce identical representations because attention treats the set of tokens the same regardless of order. wpe encodes order so the model can distinguish position 0 from position 3.
def gpt(tokens):
# 1. Build input embeddings: add token embedding + position embedding
# enumerate(tokens) gives (0, tok0), (1, tok1), ...
# wte[tok] and wpe[pos] are each 16-dim lists; we add them element by element
x_seq = []
for pos, tok in enumerate(tokens):
emb = []
for te, pe in zip(wte[tok], wpe[pos]):
emb.append(te + pe)
x_seq.append(emb)
# x_seq is now a list of T vectors, each of length 16
for i in range(n_layer):
# 2. Normalize, run attention, add residual (x = x + attn_out)
normed = []
for x in x_seq:
normed.append(rmsnorm(x))
attn_out = attention_block(normed, wq[i], wk[i], wv[i], wo[i])
new_x_seq = []
for attn, x in zip(attn_out, x_seq):
combined = []
for a, b in zip(attn, x): # add skip connection
combined.append(a + b) # new_x[d] = attn_out[d] + old_x[d]
new_x_seq.append(combined)
x_seq = new_x_seq
# 3. Normalize, run MLP, add residual (same pattern)
new_x_seq = []
for x in x_seq:
mlp_out = mlp_block(rmsnorm(x), w_fc1[i], w_fc2[i])
combined = []
for a, b in zip(mlp_out, x):
combined.append(a + b)
new_x_seq.append(combined)
x_seq = new_x_seq
# 4. Project each position's 16-dim vector to 27 logits (one per vocab token)
logits_seq = []
for x in x_seq:
logits_seq.append(linear(x, lm_head))
return logits_seq
Shape walkthrough for input “em” (2 tokens, e=4, m=12):
tokens = [26, 4, 12] # BOS, 'e', 'm' shape: [3]
x_seq = [[...], [...], [...]] # after wte+wpe shape: [3, 16]
normed = [[...], [...], [...]] # after rmsnorm shape: [3, 16]
attn_out = [[...], [...], [...]] # after attention shape: [3, 16]
x_seq (residual) = x_seq + attn_out shape: [3, 16]
mlp_out = [[...], [...], [...]] # after mlp shape: [3, 16]
x_seq (residual) = x_seq + mlp_out shape: [3, 16]
logits = [[...], [...], [...]] # after lm_head shape: [3, 27]
Each token and its position each contribute a learned 16-dim vector. The two vectors are added element by element.
The result is the model’s complete representation of “this token at this position.”
Pre-norm (RMSNorm before each sub-layer) vs post-norm (after): pre-norm keeps the residual stream unmodified by the norm. The original Transformer used post-norm; GPT-2 switched to pre-norm after finding it more stable at scale. LLaMA kept pre-norm with RMSNorm.
Residual connections let the model start from identity (pass x through unchanged) and learn deltas. The derivative of x + f(x) with respect to x is 1 + df/dx (the gradient always has a clean path back through the 1 term) without it, gradients vanish through many layers.
The model is differentiable. Every operation (embeddings, attention scores, softmax, weighted sum, MLP, final projection) is composed of Value nodes. loss.backward() computes dLoss/d(every weight) in one pass.
8. training loop (lines 274-318)
loss: cross-entropy
Cross-entropy loss = -log(p[target]).
If the model assigns 90% probability to the correct token: loss = -log(0.9) = 0.11.
If it assigns 1%: loss = -log(0.01) = 4.6.
The log function penalizes confident wrong answers harshly and rewards confident correct ones. It’s the only loss function with that property that’s computationally stable.
Why log? Cross-entropy is the negative log-likelihood of the data under the model (log turns products of probabilities into sums, numerically stable and easier to differentiate)
backward + SGD
for step in range(num_steps):
total_loss = Value(0.0)
for b in range(batch_size):
# modulo wraps around so we cycle through all docs repeatedly
doc = docs[(step * batch_size + b) % len(docs)]
# chars.index(c) converts each character to its integer ID
char_tokens = []
for c in doc:
char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS]
tokens = tokens[:block_size + 1] # truncate if name > 16 chars
input_tokens = tokens[:-1] # model sees this
target_tokens = tokens[1:] # model must predict this
logits_seq = gpt(input_tokens)
loss = Value(0.0)
for logits, target in zip(logits_seq, target_tokens):
probs = softmax(logits)
loss = loss + (-probs[target].log()) # negative log-likelihood
total_loss = total_loss + loss / len(logits_seq)
avg_loss = total_loss / batch_size
avg_loss.backward() # backprop through the ENTIRE Value graph
# SGD: param -= lr * grad
for param in params:
param.data -= learning_rate * param.grad
param.grad = 0 # must reset or gradients accumulate
print(f"Step {step+1}/{num_steps}, Loss: {avg_loss.data:.4f}")
Gradients accumulate by += in backward(). If you skip the zero step, gradients from step 1 add to gradients from step 2, tripling the effective learning rate and causing divergence. Zero every step.
Loss starts at ~3.3 (-log(1/27) ≈ 3.3, random model with 27 classes). It drops to ~2.37 over 20 steps. The model has seen 160 names (20 steps × 8 batch). Output at this point: names are phonetically plausible but rough - “aria”, “eman”, “marn”. Train for 500 steps and they get much better.
$ python train.py
num docs: 32033
vocab size: 27
num params: 4192
step 1 / 1000 | loss 3.3660
step 6 / 1000 | loss 2.9452
step 11 / 1000 | loss 2.7964
step 13 / 1000 | loss 3.0544
...
9. inference (lines 321-343)
Start from [BOS]. Run the full forward pass. Take logits at the last position. Apply temperature. Softmax. Sample. Append. Repeat until BOS comes back out.
Logits are unnormalized scores before softmax. A logit of 5.0 doesn’t mean ‘probability 5.0’ - it just means this token is more likely than one with logit 2.0. Softmax turns the relative differences into a probability distribution.
That’s all autoregressive generation is: run the model, get probabilities over next tokens, pick one, condition on it, repeat. The model never “sees” what it’s about to generate - it only sees what it has generated so far.
Temperature is logit / T before softmax. If the model’s top logit is 2.0 and second is 1.5:
- T=0.5: logits become 4.0 and 3.0 - gap widens, top token dominates
- T=1.0: unchanged
- T=2.0: logits become 1.0 and 0.75 - gap narrows, more even distribution
Temperature controls the entropy of the output distribution.
- T=0 gives argmax (always pick the most likely token, no diversity)
- T=1 samples from the raw model distribution
- T>1 flattens the distribution (more surprising outputs)
Sampling instead of argmax gives diversity: two runs with the same context produce different names. This is why when you enter the same prompt more than once in chatgpt; you get different results.
T=0.5 produces repetitive but valid looking names. T=1.5 produces novel but sometimes nonsensical ones.
Attention weights depend on the content of Q and K, not their positions. Position information enters only through the initial wpe embeddings. Once inside the transformer, everything is content-based.
Name generation animation:
temperature = 0.5
for _ in range(10):
context = [BOS]
name = ""
for _ in range(block_size):
logits_seq = gpt(context)
# apply temperature: divide logits before softmax
# lower temperature = model picks safer, more common tokens
logits_t = []
for l in logits_seq[-1]:
logits_t.append(l / temperature)
probs = softmax(logits_t)
# random.choices samples one item from range(vocab_size)
# weighted by probability - higher prob tokens get picked more often
# [0] because random.choices returns a list; we want the single item
prob_data = []
for p in probs:
prob_data.append(p.data)
next_token = random.choices(range(vocab_size), weights=prob_data)[0]
if next_token == BOS: # BOS doubles as end-of-name sentinel
break
context.append(next_token)
name += chars[next_token]
print(name)
Running this after 1000 training steps:
sample 1: kamon
sample 2: ann
sample 3: karai
sample 4: jaire
sample 5: vialan
sample 6: karia
sample 7: yeran
sample 8: anna
sample 9: areli
sample 10: kaina
sample 11: konna
sample 12: keylen
sample 13: liole
sample 14: alerin
sample 15: earan
sample 16: lenne
sample 17: kana
sample 18: lara
sample 19: alela
sample 20: anton
The model has never seen most of these, it learns the pattern (vowel/consonant rhythm, common endings like -a, -an, -ina) from examples in 1000 steps.
10. from microgpt to chatgpt
| microGPT | ChatGPT |
|---|---|
| 32K names | trillions of internet tokens |
| 27-token char vocab | ~50K BPE subword tokens |
| 4,192 params, 1 layer | 175B params, 96 layers |
| single CPU | thousands of H100s |
| vanilla SGD | Adam + LR schedule + grad clipping |
| no post-training | SFT + RLHF + safety alignment |
| ReLU | GELU |
| RMSNorm | LayerNorm (GPT-2) / RMSNorm (LLaMA) |
Same algorithm. Scale and engineering differ.
The core attention mechanism is from Attention paper (2017) and hasn’t changed. Only thing from 2017-GPT4 now is data pipeline, post training (SFT + RLHF), infra, compute budget.
references
- Karpathy’s makemore - the dataset and spirit
- Andrej Karpathy: The spelled-out intro to language modeling
- Jay Alammar: The Illustrated Transformer
- Jay Alammar: The Illustrated GPT-2
- Vaswani et al. (2017): Attention Is All You Need
- Own posts: neural networks, the math, tensor library, minitorch, attention + residuals