TL;DR: MicroGPT is ~200 lines of pure Python that captures the complete algorithm behind GPT: a custom autograd engine, character tokenizer, multi-head causal attention, MLP, and Adam training loop. This post simplifies it a little and annotates every section from scratch.

Last post I discussed about what happens when you press ‘submit’ on chatgippity, this post covers the actual code for it in simplest way possible.

ChatGPT and the script below share the exact same algorithm. Scale and engineering differ; the math does not. ChatGPT runs on thousands of H100s trained on the internet. This runs on your laptop in a few minutes, trained on random names.

Prerequisites: know what a gradient is. If not: read neural networks, the math, and the PyTorch layer first. This builds on all three.

Core idea: A language model has one job: predict the next token given the context so far. Everything in this post serves that goal.


1. the data (lines 1-23)

The dataset: 32,000 human first names from Karpathy’s makemore project. Instead of using BytePairEncoding (which is what gippity uses as it’s tokeniser) we use character level tokenizer - the alphabet maps to IDs 0-25, and one special token sits at ID 26: BOS (beginning-of-sequence). Vocab size = 27.

import urllib.request, os

url = "https://raw.githubusercontent.com/karpathy/makemore/master/names.txt"
if not os.path.exists("names.txt"):
    urllib.request.urlretrieve(url, "names.txt")
docs = open("names.txt").read().splitlines()  # list of ~32k names

# "".join(docs) concatenates every name into one giant string,
# set() removes duplicates, sorted() gives us a stable alphabetical order
chars = sorted(set("".join(docs)))   # ['a','b',...,'z'] - 26 chars
BOS = len(chars)                     # ID 26
vocab_size = len(chars) + 1          # 27

BOS doubles as end-of-name: every name gets wrapped as [BOS, c1, c2, ..., cn, BOS]. When the model emits BOS during inference, that’s the stop signal (here EOS and BOS are the same)

Tokenization for “emma”:

26 (BOS)
e
4
m
12
m
12
a
0
26 (BOS)
Each character maps to an integer ID. BOS wraps the sequence.

Training targets are one step ahead. The model sees previous input tokens [:-1] and must predict [1:] at each position (next token). “emma” generates 5 training pairs:

input context target e (predict 'e' from BOS) e m (predict 'm' from BOS,e) e m m (predict 'm' from BOS,e,m) e m m a (predict 'a' from BOS,e,m,m) e m m a (predict BOS = end)
Each row is one training pair. Context grows left to right. Target (orange) is the next token. Animated: rows highlight sequentially.
doc = "emma"
char_tokens = []
for c in doc:
    char_tokens.append(chars.index(c))
tokens = [BOS] + char_tokens + [BOS]
# tokens = [26, 4, 12, 12, 0, 26]

input_tokens  = tokens[:-1]  # [26, 4, 12, 12, 0]
target_tokens = tokens[1:]   # [4, 12, 12, 0, 26]

At position t, the model sees tokens[0..t] and must predict tokens[t+1]. No labels required (the data labels itself)

BOS doubling as EOS works because the model learns from examples: every name in the training data ends with BOS. After 20 steps it starts producing BOS as the termination signal naturally.


2. the autograd engine (lines 25-121)

Training means finding parameters that minimize loss. Gradient descent needs dLoss/d(every parameter). For 4,192 parameters, that’s 4,192 separate derivative calculations per step. Autograd computes all of them by tracking every operation and reversing it.

the Value class

Every scalar is a Value node. It stores four things:

Value data = 6.0 grad = 0.0 local_grads = (b, a) a = 2.0 child[0] b = 3.0 child[1] local_grad=b=3 local_grad=a=2
For a=2, b=3, out=a*b=6. local_grads store d(out)/d(a)=b=3 and d(out)/d(b)=a=2.

local_grads are stored as a tuple. Each operation captures its derivative formula at construction time.

# Python "dunder" methods like __add__ let you define what happens when
# you write a + b: Python calls a.__add__(b) under the hood.
# This means Value objects can participate in normal arithmetic expressions.

class Value:
    def __init__(self, data, _children=(), local_grads=()):
        self.data = data
        self.grad = 0            # dLoss/d(this) - accumulated by backward()
        self._children = _children  # tuple of Value nodes that produced this one
        self.local_grads = local_grads  # d(this)/d(each child) - same order

    # d(a+b)/da = 1, d(a+b)/db = 1
    def __add__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data + other.data, _children=(self, other))
        out.local_grads = (1.0, 1.0)   # gradient of sum w.r.t. each input is 1
        return out

    # d(a*b)/da = b, d(a*b)/db = a
    def __mul__(self, other):
        other = other if isinstance(other, Value) else Value(other)
        out = Value(self.data * other.data, _children=(self, other))
        out.local_grads = (other.data, self.data)  # cross-multiply
        return out

    # d(x^n)/dx = n * x^(n-1)
    def __pow__(self, other):
        return Value(self.data**other, (self,), (other * self.data**(other-1),))

    # d(log x)/dx = 1/x
    def log(self):
        return Value(math.log(self.data + 1e-10), (self,), (1/(self.data + 1e-10),))

    # d(exp x)/dx = exp(x)
    def exp(self):
        return Value(math.exp(self.data), (self,), (math.exp(self.data),))

    # d(relu x)/dx = 1 if x>0, else 0
    def relu(self):
        return Value(max(0, self.data), (self,), (float(self.data > 0),))

    # All convenience ops reduce to the primitives above
    def __neg__(self):   return self * -1
    def __sub__(self, other): return self + (-other)
    def __truediv__(self, other): return self * other**-1

    # __radd__, __rmul__ etc. are "reflected" operators.
    # Python calls __radd__(b, a) when a.__add__(b) fails (e.g. int + Value).
    # Without these, `3 * some_value` would raise TypeError.
    def __radd__(self, other): return self + other
    def __rmul__(self, other): return self * other

Convenience operators (__neg__, __sub__, __truediv__) add no new gradient rules. They reduce to the three primitives. Backprop traces through them without any extra code.

backward()

This method goes inside the Value class you defined above, as a peer to __add__, __mul__, etc.

Topological sort first (DFS post-order): every node appears after its children. Then reverse the order - that’s the backward order. Seed the root’s gradient to 1 (dLoss/dLoss = 1). Walk in reverse: at each node, multiply its local_grads by its own grad and add to children. This is the chain rule in calculus.

backward() propagates gradients through the ENTIRE computation graph in one pass - not layer by layer, but through every node that participated in computing the loss.

a = 2.0 grad = 0 → grad = 3 b = 3.0 grad = 0 → grad = 2 c = 1.0 grad = 0 → grad = 1 a*b = 6.0 grad = 0 → 1 loss = a*b+c = 7 grad = 1 (seed) 1*3=3 1*2=2 1*1=1 1*1=1 green = forward orange = backward gradient flow
loss = a*b + c = 7. Forward pass builds values (green). Backward seeds grad=1 at loss, then chain-rule propagates to a (grad=3), b (grad=2), c (grad=1).
def backward(self):
    topological_graph = []
    visited = set()

    def build_top(v):
        if v not in visited:
            visited.add(v)
            for child in v._children:
                build_top(child)          # recurse into children first
            topological_graph.append(v)  # append AFTER children = post-order

    build_top(self)
    self.grad = 1                          # dLoss/dLoss = 1
    for v in reversed(topological_graph): # process root before leaves
        for child, local_grad in zip(v._children, v.local_grads):
            child.grad += local_grad * v.grad  # chain rule
Checkpoint: autograd - You now have a scalar autograd engine. Every arithmetic operation records its local gradient. backward() does one reverse-topological pass and applies the chain rule everywhere.

3. hyperparameters and weight init (lines 124-160)

n_embed = 16    # every vector in the model has this length
n_head  = 4     # attention heads (each sees head_dim=4 dimensions)
n_layer = 1     # one transformer block
block_size = 16 # max sequence length
head_dim = n_embed // n_head  # = 4

Random matrix is initialised. Small initialization (sigma=0.08) prevents exploding activations. If weights start large, each layer multiplies by a large number (values overflow) hence we start small, learn from there.

def matrix(rows, cols):
    # Small Gaussian init (sigma=0.08) prevents activation explosion
    result = []
    for r in range(rows):
        row = []
        for c in range(cols):
            row.append(Value(random.gauss(0, 0.08)))
        result.append(row)
    return result

The full parameter list:

4,192 total parameters wte 432 (27×16) wpe 256 (16×16) wq+wk+wv+wo 1024 (4×16×16) w_fc1 1024 (64×16) w_fc2 1024 (16×64) lm_head 432 (27×16) embeddings attention MLP output head
MLP and attention each hold 1024 params. Embeddings and output head mirror each other at 432 each.

All six matrices are stored as lists so each transformer block gets its own independent copy:

wq   = []
wk   = []
wv   = []
wo   = []
w_fc1 = []
w_fc2 = []

for i in range(n_layer):
    wq.append(matrix(n_embed, n_embed))          # 16 x 16
    wk.append(matrix(n_embed, n_embed))          # 16 x 16
    wv.append(matrix(n_embed, n_embed))          # 16 x 16
    wo.append(matrix(n_embed, n_embed))          # 16 x 16
    w_fc1.append(matrix(n_embed * 4, n_embed))   # 64 x 16 (output_dim, input_dim hence it's n*4,n and not the other way around)
    w_fc2.append(matrix(n_embed, n_embed * 4))   # 16 x 64

lm_head = matrix(vocab_size, n_embed)            # 27 x 16

Every Value node gets collected into one flat list so the SGD loop can update all 4,192 parameters in a single pass:

def flatten(mat):
    result = []
    for row in mat:
        for val in row:
            result.append(val)
    return result

params = flatten(wte) + flatten(wpe)
for i in range(n_layer):
    params += flatten(wq[i]) + flatten(wk[i]) + flatten(wv[i]) + flatten(wo[i])
    params += flatten(w_fc1[i]) + flatten(w_fc2[i])
params += flatten(lm_head)

print(f"num params: {len(params)}")   # 4192

This is what makes for param in params: param.data -= lr * param.grad in the training loop work (all parameters from all layers in one flat iterable)

Embeddings are vectors, not integer IDs, because vectors capture similarity. ‘anna’ and ‘ana’ share characters; their embeddings should be close in space. An integer ID for ‘a’ (=0) has no geometric relationship to ‘n’ (=13). A learned 16-dim vector does.


4. utility functions (lines 163-186)

linear: matrix-vector multiply. This is the forward pass of a single layer.

def linear(x, w):
    # output[i] = dot(w[i], x) for each row of the weight matrix
    result = []
    for row in w:               # each row = one output neuron's weights
        total = Value(0.0)
        for weight, inp in zip(row, x):   # zip pairs each weight with its input
            total = total + weight * inp  # accumulate dot product
        result.append(total)
    return result

softmax: converts logits to probabilities. The numerical stability trick:

\[e^{x_i - c} / \sum_j e^{x_j - c} = e^{x_i} / \sum_j e^{x_j}\]

The constant $c$ cancels. Subtract max before exponentiating to avoid overflow.

naive softmax logits: [100, 101, 99] e^100 e^101 e^99 overflow → NaN / Inf stable softmax subtract max=101 first e^-1 e^0 e^-2 correct probabilities
Naive: identical logits → identical bars, yet still overflows to NaN. Stable: subtract max first → tiny exponents, same relative proportions. Animated: naive bars flicker (unstable); stable bars glow steadily.
def softmax(logits):
    max_logit = max(l.data for l in logits)    # stability: subtract max before exp
    exp_logits = []
    for logit in logits:
        exp_logits.append((logit - max_logit).exp())
    sum_exp = Value(0.0)
    for e in exp_logits:
        sum_exp = sum_exp + e
    result = []
    for e in exp_logits:
        result.append(e / sum_exp)
    return result

def rmsnorm(x):
    # Normalize by root-mean-square.
    # mean_sq = average of (xi squared)
    # rms = sqrt(mean_sq)
    # output = xi / rms
    eps = 1e-8
    sq_total = Value(0.0)
    for xi in x:
        sq_total = sq_total + xi ** 2
    mean_sq = sq_total / len(x)
    rms = (mean_sq + eps) ** 0.5
    result = []
    for xi in x:
        result.append(xi / rms)
    return result

After softmax, the output is a probability distribution - non-negative values that sum to exactly 1.0. In attention_block this turns raw dot-product scores into attention weights. In the training loop it turns the model’s final logits into token probabilities.

Normalization stabilizes training by keeping activations in a reasonable range. It stablizes training; nothing fancy here.

Design choice: RMSNorm instead of LayerNorm. Simpler (no mean subtraction, no learned scale/bias parameters). Modern LLMs like LLaMA use RMSNorm for the same reason.
RMSNorm: scale activations to unit RMS input x 4.0 -1.5 6.0 -2.5 3.0 -5.0 uneven range ÷ RMS ≈ 3.97 output x / RMS 1.01 -.38 1.51 -.63 0.76 -1.26 compressed to ~unit RMS
Input vector has a wide range (−5 to 6). After dividing by RMS≈3.97, values are compressed to a tighter range (~−1.3 to 1.5). Shape preserved, magnitudes tamed. Animated: input → ÷ RMS → output.

5. attention (lines 189-235)

the problem attention solves

An MLP processes each position independently. Token at position 3 has no access to position 0. But predicting the ‘a’ in “emm_” is easier with the full context. Attention gives each token a channel to read from every past position.

Q, K, V

Before computing Q, K, V, ask: who should each token pay attention to? A token representing ‘m’ in position 2 of ‘emma’ should probably attend to position 1 (‘e’) more than position 0 (BOS). Q, K, V give the model the machinery to figure that out from content.

Each token projects its embedding through three learned matrices to produce three vectors:

  • Q (query): “what am I looking for?”
  • K (key): “what do I advertise?”
  • V (value): “what do I carry?”

Formula: $\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^T}{\sqrt{d_k}}\right)V$

This code implements exactly that, one scalar at a time.

token emb [16] Wq Wk Wv Q [16] looking for? K [16] I advertise? V [16] I carry? score(t,s) = Q[t]·K[s] / √d output = Σ attn[s]*V[s]
Each token projects to Q, K, V via separate learned matrices. Attention score = dot(Q[t], K[s]) / sqrt(head_dim). Output = weighted sum of Vs.

causal masking

score(t, s) = dot(Q[t], K[s]) / sqrt(head_dim). Dot product measures alignment: if Q[t] and K[s] point in similar directions in 16-dimensional space, their dot product is large. Large score means high attention weight, which means s’s value gets pulled into t’s output. Basically the model routes information based on what the query is asking and what the key is advertising.

The sqrt(head_dim) scaling keeps dot products from growing proportionally to dimensionality. Without the sqrt(head_dim) scaling, dot products grow with the embedding dimension. At high dimension, a few scores dominate, softmax saturates (one weight goes to 1.0, rest go to 0), and gradients vanish. The scaling prevents this.

Without the causal mask every position can see all future tokens (the model learns to copy from the future rather than predict it; cheating)

Result? Remove range(t+1) and replace with range(len(x_seq)) (which means seeing all tokens) and the model immediately hits zero loss on training data while failing completely on generation.

Causal mask: the loop runs s in range(t+1), so future tokens never exist in the score list. No explicit mask matrix (just don’t compute the scores)

This is O(T²) per head per layer. For T=16: 16×16=256 dot products per head, 1024 total. Fine here. At T=100K (long-context models), it’ll crash, hence algorithms like flash attention is used.

attention weights (causal) ▶ (0) e (1) m (2) m (3) ▶(0) e(1) m(2) m(3) query t → key s ↓ 1.0 0.3 0.7 0.1 0.4 0.5 0.1 0.2 0.4 0.3 ✕ = future (masked). Columns fill in as query position grows.
Each column t = one query token. It can only attend to rows s ≤ t. The causal triangle reveals progressively. Animated: columns appear sequentially.

multi-head

Split Q, K, V into n_head=4 slices of head_dim=4 each. Run attention independently per head. Heads specialize: one might track character repetition (“mm” in “emma”), another common name endings, another positional distance. This comes from training. Concatenate the 4 outputs back to 16 dims, then project through wo.

The output of each attention head is a weighted average of value vectors: head_out[d] = sum over s of weights[s] * V[s][d].

Attention is just a learnable weighted sum.

16-dim embedding head 0 [4] head 1 [4] head 2 [4] head 3 [4] attn h0 attn h1 attn h2 attn h3 concat [16] Wo [16×16] out [16]
16-dim embedding splits into 4 heads (4 dims each). Each attends independently, outputs concatenated, then projected through Wo back to 16 dims.

the residual connection

After attention: x = x + attn_out. This is called skip connection from ResNet. Without it, gradients multiply through every layer and hit zero before reaching early parameters. With it, the gradient path from loss to input includes a direct addition - the derivative of x + f(x) w.r.t. x is 1 + df/dx, so the gradient is never zero even if f learns nothing. See attention-residuals.

def attention_block(x_seq, wq_i, wk_i, wv_i, wo_i):
    Q = []
    for x in x_seq:
        Q.append(linear(x, wq_i))
    K = []
    for x in x_seq:
        K.append(linear(x, wk_i))
    V = []
    for x in x_seq:
        V.append(linear(x, wv_i))

    out_seq = []
    for t in range(len(x_seq)):
        x_attn = []
        for h in range(n_head):
            start = h * head_dim
            end = (h + 1) * head_dim
            qh = Q[t][start:end]

            scores = []
            for s in range(t + 1):  # causal: no peeking at future
                score = Value(0.0)
                for qi, ki in zip(qh, K[s][start:end]):
                    score = score + qi * ki
                score = score / math.sqrt(head_dim)  # scale to stabilize softmax
                scores.append(score)

            weights = softmax(scores)

            head_out = []
            for d in range(head_dim):
                # weighted sum of values across attended positions
                val = Value(0.0)
                for s in range(t + 1):
                    val = val + weights[s] * V[s][start + d]
                head_out.append(val)
            for elem in head_out:
                x_attn.append(elem)

        out_seq.append(linear(x_attn, wo_i))  # mix across heads
    return out_seq
Checkpoint: attention - Attention computes content-based weighted averages of value vectors. Multi-head runs this in parallel across different learned projections. Causal masking prevents looking at future tokens. The math is three linear projections, a scaled dot product, a softmax, and a weighted sum.

6. the MLP (lines 238-248)

Attention mixes information across positions. MLP processes each position’s vector independently. Expand 16 -> 64 (4x width), apply ReLU, project back 64 -> 16.

The logic is simple; more dimensions = more you know about the data (more capacity; intuitive when you think about any data and increase dimensions)

MLP = per-token compute. Attention = cross-token communication. The transformer alternates between these two operations: attention lets tokens talk to each other, MLP lets each token process what it heard.

In larger models, the MLP layers are where most factual associations are stored - things like ‘anna’ implies female name. Attention routes information; MLP transforms it.

Design choice: ReLU instead of GELU. GELU is smoother and used in GPT-2/GPT-3. ReLU is simpler and works fine here.
16 in 64 hidden (ReLU) 16 out red = killed by ReLU n_embed=16 n_embed=16 same shape as input
16 inputs (blue) fan out to 64 hidden nodes. Green nodes survive ReLU; red nodes are zeroed. Survivors fan back in to 16 outputs (orange).
def mlp_block(x, w_fc1_i, w_fc2_i):
    x = linear(x, w_fc1_i)          # 16 -> 64: project to wider space
    relu_out = []
    for xi in x:
        relu_out.append(xi.relu())   # non-linearity (ReLU, not GELU)
    x = linear(relu_out, w_fc2_i)   # 64 -> 16: project back
    return x

7. the full forward pass (lines 251-271)

Positional embeddings are not optional. The attention mechanism is permutation equivariant: without position information, ‘emma’ and ‘mame’ produce identical representations because attention treats the set of tokens the same regardless of order. wpe encodes order so the model can distinguish position 0 from position 3.

def gpt(tokens):
    # 1. Build input embeddings: add token embedding + position embedding
    # enumerate(tokens) gives (0, tok0), (1, tok1), ...
    # wte[tok] and wpe[pos] are each 16-dim lists; we add them element by element
    x_seq = []
    for pos, tok in enumerate(tokens):
        emb = []
        for te, pe in zip(wte[tok], wpe[pos]):
            emb.append(te + pe)
        x_seq.append(emb)
    # x_seq is now a list of T vectors, each of length 16

    for i in range(n_layer):
        # 2. Normalize, run attention, add residual (x = x + attn_out)
        normed = []
        for x in x_seq:
            normed.append(rmsnorm(x))
        attn_out = attention_block(normed, wq[i], wk[i], wv[i], wo[i])
        new_x_seq = []
        for attn, x in zip(attn_out, x_seq):
            combined = []
            for a, b in zip(attn, x):          # add skip connection
                combined.append(a + b)          # new_x[d] = attn_out[d] + old_x[d]
            new_x_seq.append(combined)
        x_seq = new_x_seq

        # 3. Normalize, run MLP, add residual (same pattern)
        new_x_seq = []
        for x in x_seq:
            mlp_out = mlp_block(rmsnorm(x), w_fc1[i], w_fc2[i])
            combined = []
            for a, b in zip(mlp_out, x):
                combined.append(a + b)
            new_x_seq.append(combined)
        x_seq = new_x_seq

    # 4. Project each position's 16-dim vector to 27 logits (one per vocab token)
    logits_seq = []
    for x in x_seq:
        logits_seq.append(linear(x, lm_head))
    return logits_seq

Shape walkthrough for input “em” (2 tokens, e=4, m=12):

tokens = [26, 4, 12]             # BOS, 'e', 'm'        shape: [3]
x_seq  = [[...], [...], [...]]   # after wte+wpe         shape: [3, 16]
normed = [[...], [...], [...]]   # after rmsnorm          shape: [3, 16]
attn_out = [[...], [...], [...]] # after attention        shape: [3, 16]
x_seq (residual) = x_seq + attn_out                      shape: [3, 16]
mlp_out = [[...], [...], [...]]  # after mlp              shape: [3, 16]
x_seq (residual) = x_seq + mlp_out                       shape: [3, 16]
logits = [[...], [...], [...]]   # after lm_head          shape: [3, 27]

Each token and its position each contribute a learned 16-dim vector. The two vectors are added element by element.

The result is the model’s complete representation of “this token at this position.”

wte[tok] wpe[pos] x_seq[pos] + + + + + + + + + + + + + + + + = = = = = = = = = = = = = = = = dim 0 dim 7 dim 15 |-------------- 16 dimensions --------------|
tokens [T] wte + wpe x_seq [T × 16] × n_layer RMSNorm Attention + residual RMSNorm MLP + residual lm_head logits [T × 27]
Full forward pass. Pre-norm (RMSNorm before attention/MLP) is more stable than post-norm. Residual connections after each sub-layer.

Pre-norm (RMSNorm before each sub-layer) vs post-norm (after): pre-norm keeps the residual stream unmodified by the norm. The original Transformer used post-norm; GPT-2 switched to pre-norm after finding it more stable at scale. LLaMA kept pre-norm with RMSNorm.

Residual connections let the model start from identity (pass x through unchanged) and learn deltas. The derivative of x + f(x) with respect to x is 1 + df/dx (the gradient always has a clean path back through the 1 term) without it, gradients vanish through many layers.

The model is differentiable. Every operation (embeddings, attention scores, softmax, weighted sum, MLP, final projection) is composed of Value nodes. loss.backward() computes dLoss/d(every weight) in one pass.


8. training loop (lines 274-318)

loss: cross-entropy

Cross-entropy loss = -log(p[target]).

If the model assigns 90% probability to the correct token: loss = -log(0.9) = 0.11.

If it assigns 1%: loss = -log(0.01) = 4.6.

The log function penalizes confident wrong answers harshly and rewards confident correct ones. It’s the only loss function with that property that’s computationally stable.

Why log? Cross-entropy is the negative log-likelihood of the data under the model (log turns products of probabilities into sums, numerically stable and easier to differentiate)

cross-entropy: -log(p) p(correct token) loss 0 0.5 1.0 1/27 ≈3.3 p=0.9 ≈0.1 target
Random model gives uniform probability ≈ 1/27, loss ≈ 3.3. A good model gives high p to the correct token, loss near 0. The dot shows what training does: pushes the model rightward along this curve.

backward + SGD

for step in range(num_steps):
    total_loss = Value(0.0)

    for b in range(batch_size):
        # modulo wraps around so we cycle through all docs repeatedly
        doc = docs[(step * batch_size + b) % len(docs)]
        # chars.index(c) converts each character to its integer ID
        char_tokens = []
        for c in doc:
            char_tokens.append(chars.index(c))
        tokens = [BOS] + char_tokens + [BOS]
        tokens = tokens[:block_size + 1]    # truncate if name > 16 chars

        input_tokens  = tokens[:-1]         # model sees this
        target_tokens = tokens[1:]          # model must predict this

        logits_seq = gpt(input_tokens)

        loss = Value(0.0)
        for logits, target in zip(logits_seq, target_tokens):
            probs = softmax(logits)
            loss = loss + (-probs[target].log())   # negative log-likelihood
        total_loss = total_loss + loss / len(logits_seq)

    avg_loss = total_loss / batch_size
    avg_loss.backward()          # backprop through the ENTIRE Value graph

    # SGD: param -= lr * grad
    for param in params:
        param.data -= learning_rate * param.grad
        param.grad = 0           # must reset or gradients accumulate

    print(f"Step {step+1}/{num_steps}, Loss: {avg_loss.data:.4f}")

Gradients accumulate by += in backward(). If you skip the zero step, gradients from step 1 add to gradients from step 2, tripling the effective learning rate and causing divergence. Zero every step.

Surprise: this is vanilla SGD, not Adam. Three lines: subtract gradient, zero gradient, move on. Adam tracks per-parameter momentum and variance, which matters at scale but adds 2x memory and complexity. For 4K parameters and 20 steps, SGD works. if you want Adam (which works great at scale and adds two extra moment terms per parameter) look at the actual lines in karpathy's repo.

Loss starts at ~3.3 (-log(1/27) ≈ 3.3, random model with 27 classes). It drops to ~2.37 over 20 steps. The model has seen 160 names (20 steps × 8 batch). Output at this point: names are phonetically plausible but rough - “aria”, “eman”, “marn”. Train for 500 steps and they get much better.

$ python train.py
num docs: 32033
vocab size: 27
num params: 4192
step    1 / 1000 | loss 3.3660
step    6 / 1000 | loss 2.9452
step   11 / 1000 | loss 2.7964
step   13 / 1000 | loss 3.0544
...
training loss over 20 steps step loss 1 10 20 3.3 2.37
Loss descends noisily from ~3.3 (random baseline) to ~2.37 over 20 steps. The spike at step 13 is real — gradient steps are noisy by design.
Checkpoint: training - Forward pass computes loss. backward() fills every param.grad. SGD subtracts lr*grad. grad=0 resets for next step. This is the complete training algorithm.

9. inference (lines 321-343)

Start from [BOS]. Run the full forward pass. Take logits at the last position. Apply temperature. Softmax. Sample. Append. Repeat until BOS comes back out.

Logits are unnormalized scores before softmax. A logit of 5.0 doesn’t mean ‘probability 5.0’ - it just means this token is more likely than one with logit 2.0. Softmax turns the relative differences into a probability distribution.

That’s all autoregressive generation is: run the model, get probabilities over next tokens, pick one, condition on it, repeat. The model never “sees” what it’s about to generate - it only sees what it has generated so far.

Temperature is logit / T before softmax. If the model’s top logit is 2.0 and second is 1.5:

  • T=0.5: logits become 4.0 and 3.0 - gap widens, top token dominates
  • T=1.0: unchanged
  • T=2.0: logits become 1.0 and 0.75 - gap narrows, more even distribution

Temperature controls the entropy of the output distribution.

  • T=0 gives argmax (always pick the most likely token, no diversity)
  • T=1 samples from the raw model distribution
  • T>1 flattens the distribution (more surprising outputs)

Sampling instead of argmax gives diversity: two runs with the same context produce different names. This is why when you enter the same prompt more than once in chatgpt; you get different results.

T=0.5 produces repetitive but valid looking names. T=1.5 produces novel but sometimes nonsensical ones.

Attention weights depend on the content of Q and K, not their positions. Position information enters only through the initial wpe embeddings. Once inside the transformer, everything is content-based.

a e m n r s T = 0.5 (sharp) T = 1.0 (natural) T = 2.0 (flat/chaotic)
Same underlying logits. T=0.5 concentrates probability on likely tokens. T=2.0 flattens toward uniform. Animated: cycles through T=0.5, 1.0, 2.0.

Name generation animation:

autoregressive generation: context=[BOS] e m m a BOS emitted again = end of name
Token-by-token generation. Each token is sampled from the model's output distribution, appended to context, and fed back in. BOS = stop.
temperature = 0.5
for _ in range(10):
    context = [BOS]
    name = ""
    for _ in range(block_size):
        logits_seq = gpt(context)
        # apply temperature: divide logits before softmax
        # lower temperature = model picks safer, more common tokens
        logits_t = []
        for l in logits_seq[-1]:
            logits_t.append(l / temperature)
        probs = softmax(logits_t)
        # random.choices samples one item from range(vocab_size)
        # weighted by probability - higher prob tokens get picked more often
        # [0] because random.choices returns a list; we want the single item
        prob_data = []
        for p in probs:
            prob_data.append(p.data)
        next_token = random.choices(range(vocab_size), weights=prob_data)[0]
        if next_token == BOS:    # BOS doubles as end-of-name sentinel
            break
        context.append(next_token)
        name += chars[next_token]
    print(name)

Running this after 1000 training steps:

sample  1: kamon
sample  2: ann
sample  3: karai
sample  4: jaire
sample  5: vialan
sample  6: karia
sample  7: yeran
sample  8: anna
sample  9: areli
sample 10: kaina
sample 11: konna
sample 12: keylen
sample 13: liole
sample 14: alerin
sample 15: earan
sample 16: lenne
sample 17: kana
sample 18: lara
sample 19: alela
sample 20: anton

The model has never seen most of these, it learns the pattern (vowel/consonant rhythm, common endings like -a, -an, -ina) from examples in 1000 steps.


10. from microgpt to chatgpt

microGPT ChatGPT
32K names trillions of internet tokens
27-token char vocab ~50K BPE subword tokens
4,192 params, 1 layer 175B params, 96 layers
single CPU thousands of H100s
vanilla SGD Adam + LR schedule + grad clipping
no post-training SFT + RLHF + safety alignment
ReLU GELU
RMSNorm LayerNorm (GPT-2) / RMSNorm (LLaMA)

Same algorithm. Scale and engineering differ.

The core attention mechanism is from Attention paper (2017) and hasn’t changed. Only thing from 2017-GPT4 now is data pipeline, post training (SFT + RLHF), infra, compute budget.


references


<
Previous Post
what happens when you press 'submit' on chatgpt