TL;DR: Covers linear algebra, calculus, and optimization through the lens of deep learning: what a neuron computes, how loss functions measure error, why the chain rule is the backbone of backprop, and how gradient descent finds the minimum. Code alongside every concept.

Why do we even need math for AI? Because computers understand numbers and not words, they don’t understand meaning of words, just like we didn’t when we were born, after rigorous training (school) and reading, we finally understood what one word meant to the other, how words and sentences are formed and the meaning behind them. Teaching this to a computer does not require 12 years like we took, it’s much simpler. It’s basic math.

So how do we teach these computers? We must ask, how did we teach ourselves.

Geoffrey Hinton (cognitive psychologist and computer scientist) while studying psychology, wanted to know how a human learns. We all have neurons in our brains, somehow they capture relationship between things in our lives. How do these neurons work? They have connections to other neurons in the brain and somehow pass messages to other neurons and learn over time.

For example -> How do we know the difference between a car and a bike? Simple; since childhood we have seen millions of cars and bikes and our parents / people have told us that this is a car / vehicle or this is a bike and somehow our brain has retained that information over time.

Geoffrey Hinton wanted to figure out how to make computers do the same, how do machines learn in a similar way. Not by hard-coding rules, but by showing them examples and making sure they have these so called ‘neurons’ that learn by looking at these examples and labels.

These neurons when connected together -> is called a neural network (just like ours is called a brain)

Making these neurons in a neural network learn to perform some task is the basis of AI.

Before jumping into neural networks, what does 1 neuron actually compute?

A neuron can be also be thought of as a function. Has some inputs -> function -> output

These inputs have weights associated with them. They define how significant that input is to the function.

A neuron:

  • Receives inputs
  • Performs some computation
  • Gives back output

neuron

This computation has 2 stages

  • Multiply each weight (importance we give to the input) with inputs and sum them.
  • Sum of weighted inputs is passed through an ‘activation function’ to produce output

Now you might have some questions; the first stage makes sense, it also is the slope of a line equation we studied as kids, just tells you representation of the input with respect to the function.

$ y = \text{activation_function}\left( b + \sum_i w_i x_i \right) $

here b = bias (allows sum to be non zero (even if all inputs ($ x_i $) are 0)) -> think of bias as another weight

What the hell is an activation function and why are we passing our sum (stage1) output to it?

  • Activation functions decide if a neuron should “fire” based on its input.
  • If the input looks important for prediction, the function activates the neuron. Otherwise it doesn’t.

Different neurons can have different activation functions

  • Linear function (where activation function is just y = x) ~ Used for regression tasks
  • Step function (where activation function is y = 0 or 1 depending on x (x<=0 ;= 0, x>0 ;= 1)) ~ Used for classification tasks
  • Many more (we’ll discuss them in a bit)

So what problem are we solving here?

Basically for any ML/DeepLearning problem we choose the weights and biases (also known as parameters) that best represent our dataset -> This is called ‘training’

So when we are learning about cars and bikes as a kid, there are parameters (for example tyres, engine etc) which are common to both but have different values, we look at millions of these cars and bikes parameters (weights and biases) and learn which is what, so even if we see a new car (or something that looks like a car) we can confidently say “Oh that’s a car”

To do this ‘training’ we need a loss function.

Job of the loss function: for current parameter values, how wrong is the model with respect to representing our data.

For example we show the model a photo of car and it predicts truck, we need a loss function to tell the model “Oops, you made an error there, that’s not accurate representation of our data, please adjust yourself to learn it’s a car”

A good loss function -> sum of squared errors.

$ \text{Loss} = \sum (y_{\text{output}} - y_{\text{prediction}})^2 $

  • y_output is the actual output, in this case “car”
  • y_prediction is the output our model predicted, in this case “truck”

  • We need our loss function to be minimum, that’s when we are closest to our ground truth.
\[\text{Loss}(w_1, w_2, \ldots, w_n, b) = \frac{1}{N} \sum \left( y - \text{model}(x) \right)^2\]

Now we know how ‘wrong’ our model is, we can minimise it.

By ‘it’ here I mean updating the parameters (weights and bias) in a way that reduces the loss (making our model less wrong)

How do we do this??

Gradient Descent

Toy problem time

toy_example

  • x -> input
  • y -> output
  • f is our function y = weights * x + bias

Putting Loss(w = 2, b = -2) we get 10. This is our initial loss.

Our goal is to minimise this, how? by moving in the direction that will decrease the loss which is also known as negative of gradient

So we need to find gradient of this loss function (which consists of weights and bias)

You may ask what even is a gradient??

time for some calculus

All you need to know is partial derivatives

  • How does the output change when we wiggle just one input?

  • In single-variable calculus, the derivative dy/dx tells us the rate of change of y with respect to x.
  • But neural networks have millions of parameters.A loss function might depend on weights w1,w2,,,,w1000000000.

  • We can’t simply ask “what’s the slope?” because the function slopes differently in every direction. Instead, we ask: “How does the output change if I wiggle just one input while holding all others constant?”

  • Gradient Descent: The gradient is a vector of partial derivatives.

Computing partial derivatives for every weight in a neural network is known as backpropagation and finding which direction reduces loss the most is known as optimization

Stepping back to calculate the partial derivative for a general neuron, not just the toy problem.

\[L(w_1, w_2, w_3, \ldots, w_n, b) = \sum \left( \text{output} - f(x) \right)^2\]

Partial derivatives is basically taking derivative of function with respect to 1 input at a time, the others are treated as a constant.

For example for $ L = x^2 + y^2 $ :

\[\frac{\partial L}{\partial x} = 2x \quad \text{(since } y^2 \text{ is treated as a constant)}\] \[\frac{\partial L}{\partial y} = 2y \quad \text{(since } x^2 \text{ is treated as a constant)}\]

Okay back to our problem.

\[L(w_1, w_2, \ldots, w_n, b) = \sum \left( \text{output} - f(x) \right)^2\]

Here X is the error: $ X = \text{output} - \text{prediction} = y - f(x) $

\[\frac{\partial L}{\partial w_i} = \sum 2X \left( - \frac{\partial X}{\partial w_i} \right)\]

We also apply the chain rule here.

Derivative of (y-y’)^2 is 2 * (y-y’) * derivative of (y-y’)

\[\frac{\partial L}{\partial w_i} = -2 \sum (y - f(x)) \frac{\partial f(x)}{\partial w_i}\]

pd

more on chain rule

No chain rule, no backprop, no gradient descent, no modern AI. It’s that important.

To find the derivative of a composition, multiply the derivatives of each step.

Example:

$ y = (3x + 1)^2 $

Let $ u = 3x + 1 \quad \Rightarrow \quad \frac{du}{dx} = 3 $

Since $ y = u^2 \quad \Rightarrow \quad \frac{dy}{du} = 2u $

By the chain rule, $ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = 2u \cdot 3 $

Substituting back ( u = 3x + 1 ), $ \frac{dy}{dx} = 6(3x + 1) $

Consider:

\[z = wx + b\] \[a = \sigma(z)\] \[L = (a - y)^2\]
  • ∂L/∂a = 2(a-y) * partial derivative of activation with respect to whatever it depends on (z in this case)
    • The dependency on z is handled in the next chain rule step (∂L/∂a is only the derivative of loss with respect to a, we also need da/dz)
  • ∂a/∂z = σ’(z) = a(1-a) * derivative of z wrt whatever it depends on (w in this case)
  • ∂z/∂w = x (remember z = wx + c so partial derivative will be just x)
  • For bias, ∂L/∂b = 1 (since Loss only depends on bias as y = wx + b and ∂y/∂b = 1)

  • Hence ∂L/∂w (partial derivative of final loss with respect to initial weights) :

∂L/∂a * ∂a/∂z * ∂z/∂w = 2(a-y) * a(1-a) * x

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w} = 2(a - y)\, a(1 - a)\, x\]

Dependencies and computation graph for this:

\[w \xrightarrow{\;x\;} z \xrightarrow{\;\sigma\;} a \xrightarrow{\;(\cdot - y)^2\;} L\]

Can also be thought as : 2(a-y) * a(1-a) * x = error * activation * input

  • Direction of Steepest Ascent
    • The gradient ∇f points in the direction where f increases fastest. If you’re standing on a hill, the gradient points uphill.
    • To minimize f (like a loss function), move in the opposite direction : dir = old_dir - small_step * gradients
    • The magnitude ∣∣∇f∣∣ tells you how steep the slope is. Large gradient = steep terrain = big updates.

backprop_For_MSE

more on backpropagation

  • When training a neural network, we need $∂L/∂w_i$ for every weight $w_i$
  • Chain rule lets us compute these efficiently by propagating gradients backward through the network

Back to our toy problem:

  • L(w=2, b=-2) = 10 # we calculated this

Now let’s try to find the vector of gradient for this:

toyexamplefinal

how do we take this “step” in the right direction

  • Subtract ‘small multiple’ of the gradient from current value of our parameter
  • This ‘small multiple’ is called learning rate -> how much we should move at one step (e.g -> learning_rate = 0.01)

new = old - learning_rate * gradient

new_weight = 2 - 0.01 * 22 = 1.78

new_bias = -2 - 0.01 * 4 = -2.04

plot_contour

  • Applying same to our classification problem, this fails since derivative of y_predicted will be 0 for step function.
  • To fix this, we make it smooth -> sigmoid function

So from now on our new activation function is sigmoid function.

derivative_sigmoid

Now we’ve seen how 1 neuron works. How does it generalise to a “neural network”?

feed forward neural network (FFN)

Hyperparameter: Tunable aspects of model NOT updated BY training (e.g: stepsize/learning_rate, number of iterations, etc…)

nn

weights 4 input in first * 3 inputs in second + 3 inputs in second * 3 inputs in third

  • 3 inputs in third * 2 final outputs

(these layers are also called as hidden layers)

= 27

also bias 3 for first layer + 3 layer for second + 2 for final = 8

Hence number of parameters in this neural network = 27+8 = 33.

this is what a model means when it says LLAMA3 has 405 billion parameters. It means LLAMA3 has 405 billion tunable parameters (weights and biases) which we tweak to find the best results for our answer.

  • How does one computation look like in this neural network?

example_for_comp

Question

  • Why is using a neural network with hidden layers better than using a single layer?
  • Answer: They aren’t better if we use linear activation functions.
    • Why?
  • If we have 2 linear functions and we compose them -> result is another linear function
    • f(x), g(x) then if we pass from f(x) -> g(x) our output would be f(g(x)) which DOES NOT HELP. this has no meaning
  • Hence non linearity is required if we want some value out of our function.

Examples of non-linearity

nonlin

All in all, this is the entire training process:

  • Save activations on forward pass
  • Backward pass computes gradients
  • Update weights and biases using gradients computed

This whole thing is also called as backprop.

how do we train this neural network?

  • By training parameters of this neural network (weights, biases) -> by gradient descent

  • For each data point; the loss function needs to aggregate over output dimensions.

gd

What is the loss in a neural network?

  • For any point $ j $ in the dataset; the loss of the model on that point = sum over the output neurons of squared differences (target, activation)

loss_in_nn

backpropagation

Partial derivative of the loss with respect to parameters (weighted sum of input at that neuron)

G = $∂L/∂x$

G -> it will determine the weight updates we perform for every edge coming into a neuron

backprop

update

  • Average partial derivatives, NOT on entire dataset
  • Pick a random subset and calculate for that subset -> works same, “can move roughly in the right direction & still reduce error rapidly”
  • Best way to choose this “random sample” is to shuffle the dataset and group into batches, then on each data point -> compute activation (FORWARD PASS) and partial derivates (BACKWARD PASS) then UPDATE!!

types of optimisers

Vanilla gradient descent has no memory. Each step, it looks at the current gradient, moves, and forgets everything. Doesn’t know if it’s been heading the same direction for 1000 iterations or just turned around.

Problems:

  • Oscillation: In narrow valleys, SGD bounces back and forth between steep walls. Wasted computation.
  • Slow progress: Along flat directions, gradients are tiny. SGD just crawls.
  • Think of a ball rolling down a hill taking tiny baby steps when really it should be picking up speed early on and slowing down near the bottom.

The fix comes from physics. Standard SGD treats parameters like a massless particle: apply a force (gradient), it moves; remove the force, it stops.

batch gradient descent

In batch gradient descent, we compute the gradient using the entire dataset:

\[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{N} \sum_{i=1}^{N} \nabla L(\theta_t; x_i, y_i)\]

Advantages

  • Gradient is accurate (no sampling noise)
  • Stable, predictable convergence
  • Easy to analyze theoretically

Disadvantages

  • Extremely slow (one step = full dataset pass)
  • Requires all data in memory
  • Can’t escape local minima

Batch GD is impractical for modern deep learning. Training GPT on the full internet for one gradient update would take years.

stochastic gradient descent (SGD)

Other extreme: stochastic GD uses a single random sample:

\[\theta_{t+1} = \theta_t - \eta \cdot \nabla L(\theta_t; x_i, y_i), \quad \text{where } (x_i, y_i) \text{ is a single randomly chosen sample.}\]

Gradient from one sample is noisy. But that noise isn’t always bad:

mini-batch SGD

In practice, we use mini-batches. Small random subsets of the data.

\[\theta_{t+1} = \theta_t - \eta \cdot \frac{1}{|B|} \sum_{i \in B} \nabla L(\theta_t; x_i, y_i), \quad \text{where } B \text{ is a mini-batch (e.g., } |B| = 32, 64, 128, 256 \text{).}\]

Advantages

  • Variance Reduction: Averaging over B samples reduces gradient noise by factor of 1/sqrt(B)
  • Hardware Efficiency: GPUs are optimized for matrix ops. A batch of 64 is nearly as fast as a batch of 1.

Disadvantages

  • Memory Constraint: Batch must fit in GPU memory. Larger batches = more memory usage.

The noise from random mini-batches “kicks” the parameters. Even if the average gradient is zero, individual batch gradients are likely non-zero in the escape direction.

activation functions

  • For now we were using sigmoid for binary class classification (2 classes: car, truck)
  • What if we had multi class problem (predicting correct number given image of the number (MNIST))

What is the problem with sigmoid then??

sigmoid_prob

Alternative : Softmax activation

  • Only for output / final layer for classification problem
  • For hidden we still use sigmoid, since the problem was only for final layer as we saw above

softmax

code

def softmax(X):
    # numerically stable softmax (if you still want explicit probs)
    X_shift = X - X.max(dim=1, keepdim=True).values
    X_exp = torch.exp(X_shift)
    partition = X_exp.sum(1, keepdim=True)
    return X_exp / partition


# Torch has this built in btw.

X = torch.rand((2,5))
print(X)
Xprob = softmax(X)
print(Xprob, Xprob.sum(1)) # should sum to 1

'''
tensor([[0.2075, 0.2380, 0.2615, 0.1074, 0.1856],
        [0.2310, 0.2398, 0.2227, 0.1600, 0.1465]])

tensor([1., 1.])
'''

What it achieves:

  • Largest input gets pushed to 1
  • Other inputs get pushed to 0

This is what we want!!

increasing one $x_i$ decreases others -> sensible probabilities

At inference / testing time we can take the softmax outputs -> get class probabilities.

Now the question is::

How to combine this new softmax activation function with a loss function that gives us effective gradient descent updates?

We can’t use mean squared error (MSE) Loss.

Why?

  • After softmax, your model is not predicting numbers, it’s predicting a categorical probability distribution.

  • In classification, the true label is usually one-hot encoded: y = [0, 0, 1, 0] (This is also a probability distribution: 100% probability on the correct class, 0% elsewhere)

So the problem becomes: “How different is my predicted probability distribution from the true distribution?”

For this problem, MSE doesn’t work… MSE treats probabilities like ordinary numbers, but probabilities are coupled (increasing one decreases others)

Also with MSE + Sigmoid, the derivative is a little complicated, even with MSE + Softmax the derivative is complicated as shown below:

problem_with_mse

Solution? Cross Entropy Loss

ce

Not going to derive the partial derivative here, but you can do it in your free time. It’ll be:

derivaive_ce

  • With softmax + cross entropy, the derivative is clean, stable, strong signal even when model is wrong.

  • When we take step in -grad direction -> we will be increasing activation where activation = 1 and decreasing all others.

LLMs use cross-entropy loss for next-token prediction - penalize wrong guesses, reward right ones. If the model is confident about a wrong token, it gets punished hard.

cross entropy loss code

'''
This is the loss function used to train almost all LLMs.

Measures how wrong were the predicted probabilities for the correct class?

It's a combination of LogSoftmax + Negative Log Likelihood
Math:
-log(softmax)

Given:
- model outputs scores (logits)
- correct class index

It answers:
“How much probability did you assign to the correct answer?”

- High probability → low loss
- Low probability → high loss
'''
import torch

def main():
    logits = torch.tensor([[2.0, 1.0, 0.1]])  # (batch, num_classes)
    target = torch.tensor([0])               # correct class index (batch_size,)

    loss_fn = torch.nn.CrossEntropyLoss()

    loss = loss_fn(logits, target)
    print(loss) # tensor(0.4170) (Our goal is to minimise this using backprop)

main()


'''
Running through the output
Step1: log(softmax)
log_probs = log(softmax(logits))

Step2: picking correct class
correct_log_prob = log_probs[range(batch), target]

Step3: negate + average
loss = -mean(correct_log_prob)

CrossEntropyLoss expects raw logits it applies softmax automatically

In LLMs:
Given:
(batch, seq_len, vocab_size)  ← logits
(batch, seq_len)              ← target token IDs

We reshape:
logits = logits.view(-1, vocab_size) # as 1 list
targets = targets.view(-1) # as 1 list
loss = CrossEntropyLoss(logits, targets) # apply loss fn

For each token:

“How surprised was the model by the correct next token?”

- Predicts common token → low loss
- Predicts rare token → higher loss
- Predicts wrong token → very high loss

Average over all tokens.

How is it implemented:

def loss(logits, correct_class):
    probs = softmax(logits)
    return -log(probs[correct_class])
'''

how to scale neural networks?

Some problems:

1. vanishing gradients

How does a neural network scale through multiple layers and what goes wrong:

scale

  • Unfortunately derivative of sigmoid has very small values
  • If we multiply by sigmoid activations repeatedly (everytime we go backwards through the network) our gradients will get smaller and smaller at each layer -> This is known as the vanishing gradient problem.
  • Information from loss function -> propagates back -> the early layers don’t get updated since gradient steps are too small (they vanish)
  • For sigmoid max val is 0.25 (max val occurs at 0, sigmoid’(0) = sigmoid * (1-sigmoid) = 0.5*0.5 = 0.25), after n layers it becomes very less (~0.000008)
  • Gradient shrinks by 75% at every layer, hence layers don’t really learn anything when they backprop.

Solution: Use another activation function (RELU for example (Derivative is 0 or 1. No multiplication decay when active.)

relu

BTW reLU: f(x)=max(0,x). It is continuous everywhere but has a corner at x = 0 (not differentiable there). Hence modern LLMs use SwiGLU (a gated activation function that combines Swish and GLU for better training dynamics)

relu code

'''
Simplest non-linearity.
Why do we even need non-linearity because if we just pass
f(x) to next layer it'll be g(f(x)) and then to next layer it will be z(g(f(x)))
This is because passing linear functions makes it just composite linear function (another linear function) and we can't extract any useful data out of it.
Linear → Linear → Linear collapses into One big Linear



torch.nn.ReLU: Kill negative values. Let positive values pass

 RELU breaks linearity so networks can:
 -model complex functions
 -learn hierarchies
- represent interactions

ReLU(x)=max(0,x)
No parameters, no learning, just adds non-linearity. 

'''

import torch

def main():
    x = torch.tensor([[-2.0, -0.5, 0.0, 1.5, 3.0]])

    relu = torch.nn.ReLU()

    y = relu(x)

    print(y) # tensor([[0.0000, 0.0000, 0.0000, 1.5000, 3.0000]])

main()

'''
Running through the output
Input: any shape
Output: same shape
Internally:
if x < 0:
    x = 0

If we backprop, d/dx(RELU(x)) = 1 if x > 0 else 0 (positive neurons -> learn normally, negative neurons -> zero gradient)

In LLMs a variant of Relu called GELU() is used which is smoother, nonzero gradient everywhere and better for language.

You can look up graphs of RELU and GELU and instantly tell why it is better.
'''

2. overfitting

  • Memorises training set; really bad at newer dataset (happens when network is too big)
  • Solution: scale down the model using Dropout (drop some of the neurons randomly)
  • Randomly zero out some of the neurons/activations during training (hyperparameter p)
  • Adjustment in testing after dropout -> multiply weights by (1-p) to compensate for the fact that you removed them before.

Solution: Dropout

'''
During training, randomly zeroes some of the elements of the input tensor with probability p
We do this to reduce overfitting during training since we don't need all neurons for training everytime. Also makes training faster.

To compensate for the fact that we are dropping out some neurons we also need to scale the outputs by factor of 1/1-p during training.
This means during evaluation the module simply computes an identity function (y = x) and we don't have to scale down or up when it comes to testing, we do it in the training itself.

Parameters:
p (float) - probability of an element to be zeroed. Default: 0.5
inplace (bool) - If set to True, will do this operation in-place. Default: False

Writing this mathematically as a function
y = 0 with probability p and xi / (1-p) with probability (1-p) # scaling (So the expected value stays the same)

By default p = 0.5 hence scale factor will be 1 / (1-0.5) = 2:
- So we zero out randomly 50% of the neurons / inputs during training and with the remaining neurons / inputs we multiply them by 2. Makes sense.

'''

import torch
def main():
    input_data = [[1., 2., 3.],
                 [4., 5., 6.]]   # shape: (rows/batch=2, columns/features=3) 
    input_data_tensor = torch.Tensor(input_data)
    dropout = torch.nn.Dropout(p=0.5)

    after_dropout = dropout(input_data_tensor)

    print(f"Shapes of input and output remain the same: {input_data_tensor.shape}, {after_dropout.shape}")
    print(f"Input : {input_data}")
    print(f"Output : {after_dropout}")


main()

'''
Running through the output:
Shapes of input and output remain the same: torch.Size([2, 3]), torch.Size([2, 3])


Input : [
            [1.0, 2.0, 3.0], 
            [4.0, 5.0, 6.0]
        ]
Output : tensor(
        [
            [ 0.,  4.,  0.],
            [ 0.,  0., 12.]
        ])
50% of the inputs have been zeroed out randomly (running this multiple times will lead to different results (different inputs being zeroed out as shown below)) and non-zeroed out inputs are scaled by 2.


After running it again (different results):

Output : tensor(
        [[ 0.,  0.,  0.],
        [ 8., 10., 12.]])

Training mode (model.train())
- Random masking
- Scaling applied

Eval mode (model.eval())
- Dropout is DISABLED
- Output = input (identity)

In Transformers, dropout is used in:
- Attention weights
- FFN hidden layers
- Residual connections

But NOT usually on:
- Token embeddings (or very lightly)
- Final logits

Modern large LLMs often reduce or remove dropout
because massive data already regularizes well

Where is it placed while training a neural network?
model = torch.nn.Sequential(
    torch.nn.Linear(128, 256),
    torch.nn.ReLU(),
     # HERE
    torch.nn.Dropout(p=0.1),
    torch.nn.Linear(256, 10)
)

'''

how to make these optimisers better?

What’s actually wrong with gradient descent / SGD?

  • w = w - learning_rate * gradient for every every batch we train on
  • we take 1 step of size learning_rate in direction opposite of gradient.

problem_with_Gd sgdbad

We want a way smoother contour plot, direct path to minimum loss unlike the image above.

local minima

The classic fear: getting stuck in a local minimum, some shallow valley that isn’t the actual lowest point.

Local Minimum: A point where the function value is lower than all neighboring points.

  • You’re at the bottom of a bowl. Every direction goes UP.
  • Or: gradient = 0, but it’s a minimum in some directions and maximum in others.
  • θ_new = θ_old - α * 0 = θ_old (gradient is 0, parameters don’t move, model thinks it has converged)

How to make it better? ADAM.

2 key ideas:

  • Momentum : Give it a little push in the direction it is taking steps in:
    • Fixes plateau problem
    • Fixes local minima
  • Moving average of square of partial derivatives
  • Estimates ‘second moment’ of partial derivative, tells us about variance so we can find standard deviation hence normalise -> similar size steps -> stabilise (reduce zig zag)

adam

Since vanilla gradient descent gets stuck, we use techniques that add “momentum” or “noise”

  • Momentum
    • Like a ball rolling downhill. If it enters a flat saddle region, its existing velocity carries it across the plateau.

    • $ v_t = beta * velocity_t-1 + gradients $

  • Adam Optimizer combines momentum with adaptive learning rates.
  • Uses both first moment (velocity) and second moment (variance) of gradients. The default choice for most deep learning.

Every optimization problem: you have a loss function (landscape of mountains and valleys) and you want to find the lowest point.

What does your landscape look like?

  • Bowl (convex): exactly one lowest point. No matter where you start, you’ll get there. Life is good.

  • Egg crate (non-convex): countless dips and valleys. You might roll into a shallow puddle and get stuck, never finding the actual minimum.

Momentum: treat it like a heavy ball rolling downhill. It has inertia. Once moving, it keeps going in the same direction even if the local slope changes.

  • Velocity v: the ball moves based on accumulated velocity, not just current gradient. Gradient is just a force that nudges the velocity.
  • Friction beta: without it, the ball oscillates forever. Need some decay so it actually stops at the bottom.

So instead of updating weights directly with the gradient, we update a velocity vector:

\[v_{t+1} = \beta v_t + \eta \nabla L(\theta_t)\] \[\theta_{t+1} = \theta_t - v_{t+1}\]

another problem: learning rates

Momentum uses a single learning rate for all parameters. But neural networks have millions of parameters, and they’re not equal:

  • Parameters connected to frequent features (common words, bright pixels) get big, stable gradients.
  • Parameters connected to rare features (unusual words, edge cases) get small, noisy gradients.

Adaptive optimizers give each parameter its own learning rate based on gradient history. Rare features can finally catch up.

Example: training word embeddings. “the” appears millions of times - massive, stable gradient. “serendipity” appears twice - tiny, unreliable gradient.

  • If LR is Large “the” overshoots and oscillates. “serendipity” finally learns something.
  • If LR is Small “the” converges nicely. “serendipity” barely moves in a lifetime of training.

the fix: Divide the learning rate by the magnitude of recent gradients. Big gradients get small effective LR. Small gradients get large effective LR. Playing field leveled.

Adam (Adaptive Moment Estimation)

the final boss of optimisation. combines momentum + rmsprop

\[\textbf{First Moment (Mean):} \quad m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t\] \[\textbf{Second Moment (Variance):} \quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\] \[\textbf{Adam Update:} \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \cdot \hat{m}_t\]

RMSprop (Root Mean Square Propagation)

Exponential moving average of squared gradients. The accumulator forgets ancient history.

\[\textbf{Moving Average:} E[g^2]_t = \beta E[g^2]_{t-1} + (1 - \beta) g_t^2\] \[\textbf{Parameter Update:} \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot g_t\]

lr

batch normalisation

TLDR: it just speeds up + stabilises training using mean and variance, nothing fancy here.

Before batch norm, deep networks were a pain to train. Gradients would vanish or explode. BN fixes this by normalizing inputs to each layer so training doesn’t blow up.

\[\textbf{Batch Normalization Transform}\] \[\textbf{Step 1: Compute batch statistics} \mu_B = \frac{1}{m} \sum_{i=1}^{m} x_i \sigma_B^2 = \frac{1}{m} \sum_{i=1}^{m} (x_i - \mu_B)^2\] \[\textbf{Step 2: Normalize} \hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \quad \epsilon \text{ (typically } 10^{-5} \text{) prevents division by zero}\] \[\textbf{Step 3: Scale and shift (learnable)} y_i = \gamma \hat{x}_i + \beta \quad \gamma \text{ and } \beta \text{ are learnable parameters}\]

batch and layernorm code

'''
class torch.nn.BatchNorm1d(num_features, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True, device=None, dtype=None)
Applies Batch Normalization over a 2D or 3D input.

Why even apply batch normalization, more importantly what even is batch normalisation
More importantly : What is normalisation

It ensures that the outputs of each batch we send (as inputs) have a consistent scale and distribution as data flows through the network.

In simpler words -> stabilizes and accelerates the training of deep neural networks by normalising features

For each feature dimension, independently: it normalises input x by subtracting mean(entire batch) and dividing it by std variance of batch which makes sense -> inputs are normalised and more stable with respect to to the entire batch.

x_new = x - mean_batch / sqrt(variance) + some error
Then it applies the affine parameters:
y = gamma * x_new + beta (here gamma(scale) and beta)shift are learned parameters via training)

What it expects:
input: fully connected layers of shape (batch_size, num_features)
Normalization is done over batch dimension.
'''

import torch
def main():
    x = torch.tensor([[1., 2., 3.],
                      [4., 5., 6.]]) # here number of features = columns = 3 (think of features as age, name, job in a dataset)
    
    bn = torch.nn.BatchNorm1d(num_features=3)

    y = bn(x)
    print(f"Input: {x}")
    print(f"Output: {y}")
    # shapes are preserved (both have (2,3))
    print(bn.weight)  # γ (gamma)
    print(bn.bias)    # β (beta)


main()

'''
Running through the output:
Input: tensor(
        [[1., 2., 3.],
        [4., 5., 6.] ])


Output: tensor(
        [[-1.0000, -1.0000, -1.0000],
        [ 1.0000,  1.0000,  1.0000]]
        , grad_fn=<NativeBatchNormBackward0>)

Output is normalised.

You can see the bn.weight (gamma) and bn.bias(beta) {the learnable parameters}
Gamma: tensor([1., 1., 1.], requires_grad=True)
Beta: tensor([0., 0., 0.], requires_grad=True)

Internally:
x =
[[1, 2, 3],
 [4, 5, 6]]

mean = [2.5, 3.5, 4.5]
var  = [2.25, 2.25, 2.25]

normalise = (x - mean) / sqrt(var + eps)
scale and shift = γ * normalized + β (initially γ = 1 and β = 0)
print(bn.weight)  # γ (gamma)
print(bn.bias)    # β (beta)

During training
- Uses current batch mean & variance
- Updates running_mean and running_var

During evaluation
- Uses running_mean and running_var

With BN → distributions stay centered & scaled
Where is it placed? Linear → BatchNorm → ReLU
Benefits: - Faster convergence and Smoother optimization landscape

BatchNorm is not used in Transformers since it breaks with variable sequence lengths
Instead in Transformers we use LayerNormalisation which instead of applying this thing to Batches of data, applies to every layer.

Why does batchnorm break in transformers?
Transformers operate under these conditions:
- Variable sequence lengths
- Autoregressive decoding (token by token)
- Small or batch-size = 1 at inference (one token is sent at a time)
step 1 → token 1
step 2 → token 2
step 3 → token 3
Each step:
- Batch size = 1
- Sequence length grows
- BatchNorm cannot compute meaningful statistics here.
Since batchnorm computes mean over batch, but in transformers batch = 1 so it results in unstable normalisation & doesn't really make sense.



Order matters per token, not across batch
To fix this we use LayerNorm

'''

# How batchnorm is implemented internally
import numpy as np
error, gamma, beta = 0.01, 1, 0
class BatchNorm1d:
    def forward(self, x, training=True):
        if training:
            mean = x.mean(dim=0)
            var = x.var(dim=0)
            # update running stats with exponential moving average
            self.running_mean = 0.9 * self.running_mean + 0.1 * mean.detach()
            self.running_var = 0.9 * self.running_var + 0.1 * var.detach()
        else:
            mean = running_mean
            var = running_var

        x_new = (x - mean) / np.sqrt(var + error)
        return gamma * x_new + beta

layernorm

'''
LayerNorm: Mean over feature dimension 
This makes normalisation independent of batch size which is what we want.

So normalisation happens per sample
input: (batch_size, seq_len, hidden_dim) and LayerNorm normalizes over this hidden_dim or the input

class torch.nn.LayerNorm(normalized_shape, eps=1e-05, elementwise_affine=True, bias=True, device=None, dtype=None)

'''
import torch

def main():
    x = torch.tensor([
        [[1., 2., 3.],
         [4., 5., 6.]],
        
        [[7., 8., 9.],
         [10., 11., 12.]]
    ])  # shape: (2, 2, 3)  -> (Batch, W/row, H/col)-> 2 matrices (batch size = 2, but it (LayerNorm) doesnt rely on this), both have 2 rows 3 columns
    # layer normalization is applied to each input sequence individually. There are two major reasons for doing this. First, batch normalization is tricky to apply to sequence models (like transformers) where each input sequence can be a different length,
    # layer norm will apply over both matrices for every row (token vector) (same happens in transformers)
    layernorm = torch.nn.LayerNorm(normalized_shape=3) # features = 3
    y = layernorm(x)

    print("Input shape:", x.shape)
    print("Output shape:", y.shape) # shape is same as input, no batch interaction
    print("Output:\n", y)

main()

'''
Running through the output:
Input shape: torch.Size([2, 2, 3])
Output shape: torch.Size([2, 2, 3])


Output:
 tensor([[[-1.2247,  0.0000,  1.2247],
         [-1.2247,  0.0000,  1.2247]],

        [[-1.2247,  0.0000,  1.2247],
         [-1.2247,  0.0000,  1.2247]]], 
         
         grad_fn=<NativeLayerNormBackward0>)

Both matrices are normalised.

The Vector: For an input matrix where each row is a token (e.g., "The," "cat," "sat"), LayerNorm looks at the row of feature activations for just that one token.
The Calculation: It calculates the mean and variance using only the values within that single token's feature vector.
Independence: The normalization of the word "cat" does not use any information from the word "sat" or "the"

How it works?
For a single token vector -> [4, 5, 6], mean = 5, var = 2/3 
Normalise: [-1.22, 0.0, 1.22] (then scale + shift using learnable parameters (gamma, beta) same as BatchNorm)

Step 1: Pick one sentence in the batch (Sequence Independence).
Step 2: Pick one token (word) in that sentence (Token Independence).
Step 3: Calculate the mean and variance across all features (e.g., all 512 embedding values) for that specific token.
Step 4: Normalize that token's vector using its own statistics.

LayerNorm behaves IDENTICALLY in model.train() and model.eval()

Where does this happen?

Post-Normalization (Post-LN): In the original "Attention Is All You Need" paper, 
LayerNorm was placed after the residual connection of each sub-layer.
This configuration worked but required careful hyperparameter tuning and a specific "warm-up" phase for the learning rate to ensure stability.
x → Sublayer → Add → LayerNorm


Pre-Normalization (Pre-LN): Modern, deeper transformers typically use the Pre-LN setup, where normalization is applied before the attention and feed-forward sub-layers, inside the residual path. 
This approach results in much stabler gradients, faster and more reliable training (often without the need for learning rate warm-up), and allows for the creation of extremely deep models. 
x → LayerNorm → Sublayer → Add

Because every token is normalized using its own unique statistics, the model is not confused by variable sequence lengths or padding.

TLDR: it just speeds up + stabilises training using mean and variance, nothing fancy here.
'''

# How is LayerNorm implemented internally:
import numpy
error, gamma, beta = 0.01, 1, 0
class LayerNorm:
    def forward(self, x):
        mean = x.mean(dim=-1, keepdim=True)
        var = x.var(dim=-1, keepdim=True)
        x_new = (x - mean) / numpy.sqrt(var + error)
        return gamma * x_new + beta

resnets ~ residual networks

Suppose you have an image; you need to classify it. How to?

Answer -> CNN / backpropagation for multiple layers.

Problem: Vanishing Gradients because of so many layers

\[\text{output}_{\text{fn}} = f_{50}\!\left( f_{49}\!\left( \cdots f_2\!\left( f_1(\text{input}) \right) \cdots \right) \right)\]

Question: What goes wrong?

  • In Theory; deeper = more expressive, more it learns (complex features)
  • In Practice; earlier layers stop learning

You have a long call stack -> gradients raised at bottom never reach top properly

These gradients get vanishes over time and the first layer has no idea about the gradient since they vanish (we have discussed this before)

  • Instead of every layer being responsible for producing entire transformed output; we do
    • output = input (original data) + delta (small improvement computed by current block)

Resnet blocks don’t rewrite the whole representation; they apply patches on top of what already exists.

Residual/Skip connections make it easier for network to learn, instead of direct mappings from input x -> output y; resnet learns the residual (difference) between input & output.

res

def residual_block(x):
    return x + F(x) # input to block + output of series of layers
# Addition of x (skip connection) allows network to skip certain layers during training making it easier to optimise
  • If layer is not useful; network can simply learn F(x) = 0 and the output becomes y=x # original (effectively skipping the layer)

After activation:

y = reLU(F(x) + x)

intuition

  • You have a lower resolution image and you want to pass it your network so it outputs correct/higher resolution image for output
    • lower res image = input and higher res image = output
    • input - output = RESIDUAL (missing piece we need)

Now we can just do input + residual = output (Doesn’t need to retain entire signal; learns the actual bit we care about)

Why does this work so well??

  • Speeds up backpropagation + modularity
  • Since input is being passed along and will be available later layers of the network; the job of a layer is no longer to figure out everything important about the input that needs to be passed along but rather to figure out WHAT INFO it can add ON top of the input to make the process easier
  • The block doesn’t have to start by figuring out what information the input contains; instead the block starts by passing along all input
  • Any layer in the network will have short path by which gradients learn (Since each block has 1 path that goes AROUND and 1 path that goes THROUGH)

resnet

transformers

neural network + attention = transformer

  • Idea: self attention encoder / decoder blocks.

Input

  • All word embeddings of a document concatenated into a matrix (every row = embedding of different word)
    • What are word embeddings? Solves the problem of how to give text data to a Neural Network
      • What we want:
        • Vector representation of words
        • Vectors reasonable sized
        • Words have similar representations (King and Queen nearby) iff semantically related
        • embeddings

Output

  • Encoding of document that can be used for various tasks (classification (cat/dog), translation (decoder original attention paper), predicting missing word (ChatGPT))

code for embeddings

'''
Learnable lookup table
Map an integer ID → a vector.

nn.Embedding(num_embeddings, embedding_dim)

it stores a matrix: (weight): (num_embeddings, embedding_dim) (Each row is the vector representation of one discrete symbol)

Neural Networks can't operate on integers directly. 
"hello" → 10321
"world" → 45

These words have no meaning geometrically.

Embedding turns token_id → continuous vector

'''

import torch

def main():
    embedding = torch.nn.Embedding(num_embeddings=10, embedding_dim=4)

    token_ids = torch.tensor([1, 3, 7]) # [1, 3, 7]      → shape (3,) input
    # lookup → add embedding_dim


    vectors = embedding(token_ids)

    # (3 tokens, 4 features)

    print(vectors.shape) # torch.Size([3, 4]) {Embedding adds one dimension at the end.}
    print(vectors) 
    '''
    tensor([[ 0.7213, -0.7668,  0.2204, -1.0903],
        [-1.6225,  0.6308, -1.6110, -0.7528],
        [-1.1358, -2.2639, -0.7726, -1.6766]], grad_fn=<EmbeddingBackward0>)
    '''

main()


'''
Running through the output

Each row of the embedding matrix is a trainable parameter.

During backprop:
- Only rows used in the batch get gradients
- Other rows are untouched

How is it used in LLMs
Token embedding:
token_embed = nn.Embedding(vocab_size, d_model)

Input:(batch, seq_len)
Output:(batch, seq_len, d_model)

This is the very first layer of an LLM.

Even for positional encoding (GPT2, BERT):

pos_embed = nn.Embedding(max_len, d_model)


How it works internally:

class Embedding:
    def __init__(self):
        self.table = random_matrix()

    def forward(self, ids):
        return self.table[ids]

How info flows:
Text
↓
Tokenizer
↓
Token IDs (integers)
↓
nn.Embedding
↓
Vectors

Let's say your sentence is ["hello", "my", "name", "is"]
Which is mapped to numbers:
{
  "hello": 1532,
  "my": 212,
  "name": 784,
  "is": 318
}

So token_ids = [1532, 212, 784, 318]

Now embedding = nn.Embedding(num_embeddings=10, embedding_dim=4)

This creates a table embedding.weight.shape = (10, 4)
ID → vector
0  → [ ... ]
1  → [ ... ]
2  → [ ... ]
...
9  → [ ... ]

hello → vector
my    → vector
name  → vector
is    → vector


Just makes it into a vector

vocab = {
  token_id: learned_vector
}

sentence = [id1, id2, id3]

vectors = [vocab[id] for id in sentence]
'''

Anyway, back to transformers.

encoder-decoder

Attention goal: for each word, how much should it ‘pay attention’ to the other words.

ALGORITHM for Attention

  • Given input = X; find Query, Key, Value (Query: what we need to find, Key: What we have, Value: Answer we are looking for)
  • Do dot product similarity
  • Softmax
  • Multiply with V

attention attention_formula

howattnworks

causal masking

  • Sentence so far: “The quick brown fox”. Model wants to predict “jumps”. It can only look at “The”, “quick”, “brown”, “fox”. Can’t see “jumps” or anything after it. That’s autoregressive generation.

  • During training though, the model sees the entire sentence at once (“The quick brown fox jumps over the dog”). Goal is to predict each word from only the previous ones.

Without masking: Self-attention lets every word look at every other word. So when learning to predict “jumps” at position 5, the model can just peek at positions 6, 7, 8 (“over”, “the”, “dog”). That’s cheating. Model learns nothing about actual causal relationships.

With masking: Token at position t can only attend to positions 0 through t. Future tokens are blocked.

causal

Each token can only attend to itself and previous tokens. Future positions are masked with -inf before softmax.

There are much better blogs out there and teach you transformers and how they work, this is just the basic math required for it.

self attention code

'''
output = attention(Q, K, V)


PyTorch’s fused / optimized implementation of the classic attention core: 
compute softmax((Q @ K.T) / d_model) @ V with optional masking, dropout, and a few performance features (GQA, fused kernels)


How it works:
Given tensors Q, K, V:


'''
import torch
import torch.nn.functional as F

def main():
    Q = torch.randn(2, 3, 4)  # (batch, tokens(rows), dimensions(columns))
    K = torch.randn(2, 3, 4)
    V = torch.randn(2, 3, 4)

    out = F.scaled_dot_product_attention(Q, K, V)

    print(f"Output shape: {out.shape}")
    print(f"Output : {out}")
          
main()

'''
Running through the output
Output shape: torch.Size([2, 3, 4]) # same as (Q,K,V) (2 batches, 3 rows, 4 columns)
Output : tensor([
        [
            [-1.1798,  0.0962, -0.4744,  0.3941],
            [ 0.1781,  0.2997, -0.0952, -0.4089],
            [-1.3043,  0.0902, -0.5171,  0.4701]],

            [[-0.2180,  0.4441,  1.9291, -1.0765],
            [ 0.2576,  0.2707,  0.9700,  0.7783],
            [ 0.0606,  0.2904,  1.4156, -0.0605]]])
'''

'''
Can also use masking (token t can only attend to tokens ≤ t)
- Block future tokens (causal) 
- Masked positions get -inf.

Can also set dropout (at training):

out = F.scaled_dot_product_attention(
    q, k, v,
    attn_mask=None,
    dropout_p=(0.1 if model.training else 0.0),
    is_causal=True
)

LLMs do:

Linear → Q, K, V
↓
scaled_dot_product_attention
↓
Linear

'''
# How is this implemented:
import torch
import math
import torch.nn.functional as F

def scaled_dot_product_attention( Q, K, V, attn_mask=None, dropout_p=0.0, is_causal=False, training=True):
    # shape: Q, K, V: (batch, heads, seq_len, head_dim)

    B, H, seq_len, d_model = Q.shape
    _, _, S, _ = K.shape

    # dot product -> similarity score
    # (B, H, L, D) @ (B, H, D, S) → (B, H, L, S)
    scores = torch.matmul(Q, K.transpose(-2, -1))

    # scale
    scores = scores / math.sqrt(d_model)

    # causal mask apply using torch.tril -> For each query position i; token i can attend to tokens j where j ≤ i
    '''
    Visually
    i\j  0  1  2  3  4
    0   1  0  0  0  0
    1   1  1  0  0  0
    2   1  1  1  0  0
    3   1  1  1  1  0
    4   1  1  1  1  1
    ❌ Look ahead
    ✅ Look backward (and self)
    '''
    if is_causal:
        causal_mask = torch.tril( torch.ones(seq_len, S, device=Q.device, dtype=torch.bool))
        scores = scores.masked_fill(~causal_mask, float("-inf")) # fill with -inf

    # attention mask: Which tokens are valid to attend to for this input?
    '''
    Input:  [Hello, world, <PAD>, <PAD>]
    Mask:   [1,     1,      0,     0]
    '''
    if attn_mask is not None:
        if attn_mask.dtype == torch.bool:
            scores = scores.masked_fill(~attn_mask, float("-inf"))
        else:
            scores = scores + attn_mask # normal default (no padding)

    # softmax
    attn_weights = F.softmax(scores, dim=-1)

    # dropout
    if training and dropout_p > 0.0:
        attn_weights = F.dropout(attn_weights, p=dropout_p)

    # Weighted sum of values
    # (B, H, L, S) @ (B, H, S, D) → (B, H, L, D)
    output = torch.matmul(attn_weights, V)

    return output


'''
This is the heart of attention which is used in Transformers, this is how similarity / how much each word should 'pay attention' to other words is computed.
Word embeddings -> similar words are together (king and queen are together in vector space) -> hence this calculates simiarity between king and queen by using dot product of 3 vectors (Q,K,V)
If dot product -> large -> they are in the same direction -> similar, else they aren't.

Q,K,V are parameters -> Query, Key, Value

Query (Q): "What am I looking for?" - each token broadcasts what info it needs.
Key (K): "What do I have to offer?" - each token advertises what info it contains.
Value (V): "Here's my actual content." - the payload that gets mixed in based on Q*K scores.

Q dot K tells you how relevant two tokens are to each other. High score = pay more attention to that token's V.

Example: "The cat sat on the mat because it was tired"
When processing "it" - Q for "it" gets compared against K for every other word.
K for "cat" scores high (animal, can be tired), K for "mat" scores low.
So "it" pulls in mostly the V of "cat" and barely any V of "mat".
Result: the model figures out "it" refers to "cat".

'''

# Another simpler implementation
# attention.py
import torch, math
def scaled_dot_product_attention(Q, K, V, mask=None):
    # Q,K,V: (batch, heads, seq_len, d_k)
    scores = Q @ K.transpose(-2,-1) / math.sqrt(Q.size(-1))
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))
    weights = torch.softmax(scores, dim=-1)
    return weights @ V, weights

# tiny demo
B, H, S, d = 2, 1, 4, 8
Q = torch.randn(B,H,S,d); K = torch.randn(B,H,S,d); V = torch.randn(B,H,S,d)
out, w = scaled_dot_product_attention(Q,K,V)
print(out.shape, w.shape)

multi-head attention code

'''
class torch.nn.MultiheadAttention(embed_dim, num_heads, dropout=0.0, bias=True, add_bias_kv=False, add_zero_attn=False, kdim=None, vdim=None, batch_first=False, device=None, dtype=None)

We know attention = softmax((Q @ K.T) /  sqrt(d_model)) @ V

Let's take an example to see how this works.

Input -> "I am good"
3 words -> 3 vectors for each token, let's take size of each token = 2

[
[0.2 0.1] -> for "I"
[-0.9 0.4] -> for "am"
[0.7 0.8] -> for "good"
] dimension = [3,2] # 3 rows, 2 columns (seq_len, d_model)

so d_model = 2 (each word is being represented by 2 vectors)
and seq_len = 3 (there are 3 words in our vocab)

- Compute Key, Query, Values
PyTorch initialises random 
weights (3 * d_model, d_model)  -> 3 channels and d_model*d_model = [6,2]
and biases (3 * d_model, 1) -> 3 channels  and 3 * 2 = [6,1]

Visually
W = first two dots are for W_k, second two for W_q, last two for W_v (each has dimn 2,2 or d_model, d_model)
[
    . . 
    . .

    . .
    . .

    . .
    . .
]

B = first two for Bias_k, next two for Bias_q, last two for Bias_v (each has dimn 2,1)
[
    .
    .

    .
    .

    .
    .
]

Now we can compute K, Q, V (we take transpose so shapes match)
K = Input (3,2) * W_k (2,2).T + B_k (2,1).T = [3,2] + [1,2] = [3,2] is the shape for K
Q = Input * W_q.T + B_q.T = [3,2] shape of Q
V = Input * W_v.T + B_v.T = [3,2] shape of V

Now calculating attention
Q * K.T = [3,2] * [2,3] = [3,3], dividing by d_model and putting softmax -> dimensions don't change.

Our Q*K.T/d_model

[
. . .
. . .
. . .
]

After softmax, all scores are between 0 -> 1, this is called scores

      I   am  good
I     0.1 0.3 0.6
am    0.7 0.1 0.2
good  0.5 0.4 0.1


This is our softmax(Q@K.T/d_model) matrix which has attention scores which tells how much attention we need to pay for each word FROM each word.

now out =  scores[3,3] * V [3,2] = [3,3] matrix of output. This isn't final output.

PyTorch creates out_w = [d_model, d_model] and out_b = [d_model, 1]

final_out = out (3, 2) * out_w (2, 2).T + out_b (2, 1).T
          = (3, 2) + (1, 2)
          = (3, 2) is the final output of this layer -> which goes for further processing (normalise etc..)

also known as head1. MHA just does concat(head1, head2,...head_h) 

Every layer does this. Depends on number of heads.

torch.nn.MultiheadAttention does:
(“Which tokens matter in different ways at the same time?”)
- Projects inputs → Q, K, V
- Runs attention per head
- Concatenates heads + output projection

'''

import torch
def main():
    x = torch.randn(3, 2, 2) # (seq_len, batch, d_model)
    mha = torch.nn.MultiheadAttention(embed_dim=2, num_heads=2)
    out, attn_weights = mha(x, x, x)

    print(f"Output shape: {out.shape}") # final transformed embeddings
    print(f"Attention Weights: {attn_weights.shape}") # attention matrix (for debugging, LLMs do not use attention weights during inference)

main()

'''
Running through the output:
Output shape: torch.Size([3, 2, 2]) # output shape is 3,2 which is what we predicted!!
Attention Weights: torch.Size([2, 3, 3])


Revising process:

Q = X @ W_q
K = X @ W_k
V = X @ W_v

Q → (seq_len, batch, H, D)
K → (seq_len, batch, H, D)
V → (seq_len, batch, H, D)

softmax(Q_h K_hᵀ / sqrt(D)) @ V_h

Concat(head_1, ..., head_H)
→ shape: (seq_len, batch, E)

out = concat @ W_o

Causal Mask in MHA:
attn_mask = torch.triu(torch.ones(L, L), diagonal=1)
attn_mask = attn_mask.bool()
key_padding_mask = (tokens == PAD_ID)


attn_mask → blocks (i → j)
key_padding_mask → blocks tokens entirely
'''

# How it works inside the hood
'''
class MultiHeadAttention:
    def forward(self, x):
        Q, K, V = linear(x)
        split heads
        for each head:
            head = attention(Q, K, V)
        concat heads
        return output_projection
'''

positional encoding

'''
Self-attention processes all tokens in parallel - it has no idea which word came first. "The cat chased the dog" and "The dog chased the cat" would look identical without position info. So we add positional encodings to each word's embedding to tell the model where each token sits in the sequence.

Example:

Let's take the words good boy
good position = 0, boy position = 1
and d_model = 4

Positional Encoding has dimension -> [d_model, 1]

Formulas:
PE(p, 2i) = sin(p / 10000 ^ 2i/d)
PE(p, 2i+1) = cos(p / 10000 ^ 2i/d)

p is the position, d = dimension, i = list (of length d_model)
i
[0
 1
 2
 3]

i / 2 = [
        0
        0
        1
        1
        ]

Final position encoding for word 'good' 
i = 0, p = 0 -> plug in values in formula -> sin(0/10000^2*0/4) and 
i = 1, p = 0 -> cos(0/10000^2*0/4)

same for all positions and same for the word 'boy'

good(d_model,1) + boy(d_model,1) = final position vector = (d_model, 1)

After this you can add this position embedding to your original input vector tokens and pass it to attention:
embeddings = token_embeddings + sinusoidal_pos_enc(seq_len, d_model)[:seq_len]
'''
import torch, math

def sinusoidal_pos_enc(seq_len, d_model):
    pos = torch.arange(seq_len).unsqueeze(1)
    i = torch.arange(d_model).unsqueeze(0)
    angle = pos / (10000 ** (2*(i//2)/d_model))
    enc = torch.zeros(seq_len, d_model)
    enc[:, 0::2] = torch.sin(angle[:, 0::2])
    enc[:, 1::2] = torch.cos(angle[:, 1::2])
    return enc  # (seq_len, d_model)

# Example: Encode positions for a sequence
seq_len = 10      # 10 tokens in sequence
d_model = 512     # Embedding dimension

pos_enc = sinusoidal_pos_enc(seq_len, d_model)
print(pos_enc.shape)  # torch.Size([10, 512])

Additional Math

linear algebra

Two operations you need to know:

  • Dot product (alignment)
  • Norms (magnitude)

dot product

  • Takes two vectors of equal length, returns a single scalar.
  • Also written as a.T * b (Row vector * Column vector, why transpose? To match shapes)
  • Measures alignment between vectors. Bridges algebra and geometry.

What it means geometrically:

a⋅b = ∣∣a∣∣⋅∣∣b∣∣cos(θ)

if Result:

  • Positive -> Vectors point in roughly the same direction (θ < 90)
  • Negative -> Vectors point in the opposite direction (θ > 90)
  • Zero -> Vectors are orthogonal (perpendicular, θ = 90)

dot_prod

Matrix Multiplication is just batched dot product.

Quick refresher on matrix multiplication:

  • Always row x column. 2x4 = 2 rows, 4 columns.
  • To multiply: columns on the left must equal rows on the right. A is (m x n), B is (n x p), AB works. BA doesn’t (unless p = m).
  • Result shape: rows of left x columns of right. (m x n) * (n x p) = (m x p).
  • Element at row i, column j = dot product of (row i of left) and (column j of right).

mm

Assume we have a row with 3 elements: (a1 a2 a3), and a column with 3 elements (b1 b2 b3). (a1 a2 a3) ★ (b1 b2 b3) = a1b1 + a2b2 + a3b3 = sum of aibi where 1 ≤ i ≤ 3. The sane process is true for rows and columns of other sizes.

the i, jth position in the matrix is the dot product of the ith row (transpose) with the jth column.

  • Transformer attention is a scaled dot product. It calculates relevance scores between tokens in Attention Mechanism
  • Also used in neural network layer (y = activation(X_input*Weights + bias))

probability

conditional probability

How does learning new info change what we believe?

Question: What is the chance of A happening given that B has already happened?

Another way to say this is P(A given B)

Solution to this is:

  • P(intersection(A,B) / P(B)) = Probabaility that both A and B occur / Probability that B occurs

  • Why divide?
    • Once B has occurred, the universe of possibilities collapses to only outcomes where B is true.
  • Original sample space → all outcomes
  • New sample space → only outcomes consistent with B
  • We renormalize probabilities so they sum to 1 again
  • Every time an LLM generates a token:

The model computes a conditional probability distribution over tokens

  • LLMs don’t “understand” text. They update beliefs about the next token given new evidence (context).
  • They don’t learn whole sentences at once. One conditional distribution at a time.

bayes theorem

Just conditional probability viewed from the other side:

P(A given B) = P(B given A) * P(A) / P(B)

  • Prior: P(A) - what you believed before seeing data
  • Likelihood: P(B A) - how compatible the data is with that belief
  • Posterior: P(A B) - updated belief after seeing data

Bayes = belief updating, LLMs amortize those updates into weights

additional math code in PyTorch

cosine similarity

  • Scores from -1 (opposite) to 1 (identical). How similar are two vectors?
  • Dot product of vectors divided by product of their magnitudes. That’s just the cosine of the angle between them.
  • Smaller angle = higher similarity
'''
Returns cosine similarity between  x1 and x2 computed along dim
dim = diension where cosine similaritty is computed (default:1)
epsilon = 1e-8 (small evalue to add to avoid division by 0)

Formula = x1 * x2 / max (|x1| * |x2|, epsilon)


How similar are two vectors by direction, not magnitude? (Are these two embeddings semantically similar?)
'''
import torch

def main():
    x = torch.tensor([[1., 0., 0.]])
    y = torch.tensor([[0.9, 0.1, 0.]])

    cos = torch.nn.CosineSimilarity(dim=1)

    sim = cos(x, y)
    print(sim) # tensor([0.9939])

main()

'''
Internal:

dot = (x * y).sum(dim)
norm_x = sqrt((x * x).sum(dim))
norm_y = sqrt((y * y).sum(dim))

cos_sim = dot / (norm_x * norm_y + eps) 

This is used in LLMs to calculate embedding similarity
query_embedding / document_embedding
↓
CosineSimilarity

Also used in re-ranking retrieved documnets in RAG:
LLM embedding → cosine sim → top-k → LLM reasoning

Why attention doesn't use cosine-similarity -> Attention needs Q*K.T
So direction doesn't matter -> Magnitude matters (confidence of how important one word is to another word)

def semantic_similarity(x, y):
    return cos(angle_between(x, y))

'''

gradient descent

import torch

# Create parameter θ
theta = torch.tensor(5.0, requires_grad=True) # Enable gradient tracking

# Define hyperparameters
eta = 0.1 # Learning rate
total_steps = 15 # No. of iterations

print("Step 0 (initial θ):", theta.item())
print("Initial loss:", (theta ** 2).item())
print()

# Iterate for total_steps 
for step in range(1, total_steps + 1):
    # Forward pass
    loss = theta ** 2
    
    # Backward pass (calculates d(θ²)/dθ = 2θ and stores in theta.grad)
    loss.backward() 
    
    # Store gradient and loss to print later
    gradient = theta.grad.item()
    loss_value = loss.item()

    # Gradient descent update: θ = θ - η·∇J(θ)
    # Use no_grad() to prevent PyTorch from tracking this parameter update
    with torch.no_grad():
        theta -= eta * theta.grad

    print(f"Step {step}:")
    print(f"loss = {round(loss_value, 3)}")
    print(f"θ = {round(theta.item(), 3)}")
    print(f"gradient = {round(gradient, 3)}")
    print(f"update = {round(eta * gradient, 3)}")

    # Reset gradients for next iteration
    theta.grad.zero_()

print("Final θ:", round(theta.item(), 3))
print("Final loss:", round((theta ** 2).item(), 3))

'''
Output
Step 0 (initial θ): 5.0
Initial loss: 25.0

Step 1:
loss = 25.0
θ = 4.0
gradient = 10.0
update = 1.0

Step 2:
loss = 16.0
θ = 3.2
gradient = 8.0
update = 0.8

Step 3:
loss = 10.24
θ = 2.56
gradient = 6.4
update = 0.64

Step 4:
loss = 6.554
θ = 2.048
gradient = 5.12
update = 0.512

Step 5:
loss = 4.194
θ = 1.638
gradient = 4.096
update = 0.41

Step 6:
loss = 2.684
θ = 1.311
gradient = 3.277
update = 0.328

Step 7:
loss = 1.718
θ = 1.049
gradient = 2.621
update = 0.262

Step 8:
loss = 1.1
θ = 0.839
gradient = 2.097
update = 0.21

Step 9:
loss = 0.704
θ = 0.671
gradient = 1.678
update = 0.168

Step 10:
loss = 0.45
θ = 0.537
gradient = 1.342
update = 0.134

Step 11:
loss = 0.288
θ = 0.429
gradient = 1.074
update = 0.107

Step 12:
loss = 0.184
θ = 0.344
gradient = 0.859
update = 0.086

Step 13:
loss = 0.118
θ = 0.275
gradient = 0.687
update = 0.069

Step 14:
loss = 0.076
θ = 0.22
gradient = 0.55
update = 0.055

Step 15:
loss = 0.048
θ = 0.176
gradient = 0.44
update = 0.044

Final θ: 0.176
Final loss: 0.031
'''

ffn (feed forward network)

# class torch.nn.Linear(in_features, out_features, bias=True, device=None, dtype=None)
# y = x * W.transpose + b
'''
x → input vector (size = in_features)
W → weight matrix (out_features x in_features)
b → bias vector (out_features)
y → output vector (out_features)
'''
import torch

def main():
    input_data = [[1., 2., 3.],
                  [4., 5., 6.]]   # shape: (rows/batch=2, columns/features=3)
    x = torch.tensor(input_data) # convert to tensor
    formula = torch.nn.Linear(in_features=3, out_features=1, bias=True)
    y = formula(x)
    print(f"Weights : {formula.weight} and Bias : {formula.bias}")
    print(f"input shape: {x.shape}") 
    print(f"output shape: {y.shape}") 
    print(f"output of y = mx + c : {y}")
main()

'''
Running through the output

Out input is matrix:
[
[1, 2, 3],
[4, 5, 6]
]  
Shape: (2 * 3) 2 rows and 3 columns

Our random weights (the W in y = WX + B and Bias) shape will be (1 * 3) -> (out_features, in_features)
Weights: tensor([[0.4468, 0.0444, 0.4144]], requires_grad=True) and 
Bias : tensor([0.3921], requires_grad=True)

Which makes sense since our in_features = 3 (weights) and bias = 1
So shape of our random W matrix = (1x3) and shape of B = (1,) # scalar

Now we need to do y = WX + bias (Slope equation) -> also known as torch.nn.Linear
Let's write it in shape form
y = (2 * 3) * (1 * 3) + bias (1,)

Can't matrix multiply like this, since (p*q) * (q*r) = p * r for matrix multiply, hence we need to transpose our Weight matrix
W.transpose !! so now shape of W becomes (3 * 1) and y = (2 * 3) * (3 * 1) + bias(1,) which can now be done

Another question you might have -> once we get X * W.Transpose of (2*1) how are we adding bias (1,) to it? Since bias is a scalar and not of the same shape how are 2 matrices of different shapes being added?
PyTorch (you beauty) handles this internally by broadcasting / changing the bias from scalar (1, ) to (2, 1) {2 rows, 1 column both have same values of the original bias 0.3921} -> Now you can easily add the X * W.T + B because both all have same shapes.

So after matrix multiply output shape will be (2 * 1), we can verify this by seeing output shape of y:
input shape: torch.Size([2, 3])
weight shape after transpose: torch.Size([3, 1])
output shape: torch.Size([2, 1]) which means 2 rows 1 column

What do these 2 rows and 1 column mean -> the gradients of weight and bias respectively.

nn.Linear does this whole thing (including transpose, the whole backpropagation thing (since requires_grad=True)) and stores it.

When we output y -> we get 2 gradients (one is weight and one is for bias) which tells us how much we need to move the weights and bias by to move in the correct direction of loss (to minimise it)

output of y = mx + c : tensor([[2.1708], [4.8874]], grad_fn=<AddmmBackward0>)

so we need to move -2.17 for our weights and -4.8 for our bias (why -? because we move in the negative direction of the gradients to find the minimum of the loss function)
'''

# formula can also be written as Linear class
'''
This is what happens under the hood for Linear.

class Linear:
    def __init__(self):
        self.W = random()
        self.b = random()

    def forward(self, x):
        return x @ self.W.T + self.b

'''

'''
Why this matters for Transformers / LLMs

Everywhere you see: 
- W_q, W_k, W_v
- W_o
- FFN/MLP layers

They are just nn.Linear layers with different shapes.
Example:
W_q = nn.Linear(d_model, d_model)
Same concept. Bigger matrices.
'''

single neuron


### Single Neuron

```python
import torch

# data OR gate (4 options: 00, 01, 10, 11)

x = torch.tensor(
    [[0.,0.],[0,1.],[1,0.],[1,1.]] # outputs (y) = 0,1,1,1 (shape: x: (4, 2))
    )


y = torch.tensor([0., 1., 1., 1.,]).unsqueeze(1) # add new dim (shape: y: (4, 1))

# here nn.Sequential is used since we are using multiple .nn functions, we can also use just nn.Linear and then nn.Sigmoid seperately (works the same)
model = torch.nn.Sequential(
    torch.nn.Linear(2, 1), # in_features = 2 (like 01), out_features = 1 (like 1)
    torch.nn.Sigmoid(),
)

# using stochastic gradient descent
optimiser = torch.optim.SGD(model.parameters(), lr=0.1)
loss_function = torch.nn.MSELoss() # better to use BCELoss() for binary classification

epochs = 1000
for epoch in range(epochs):
    y_pred = model(x)
    loss = loss_function(y_pred, y) # calc mse 
    optimiser.zero_grad() # zero out gradients just to be safe and make sure no residuals form from before and also after first / nth iteration
    loss.backward() # magic!! simply backward pass through the layers to find all gradients wrt weights and biases
    optimiser.step() # this basically performs weight = weight - learning_rate * gradients (updatation)

print("Prediction:", (model(x) > 0.5).int().squeeze().tolist()) 


'''
Running through the output:
> uv run single_neuron.py
Input: [0, 0], [0, 1], [1, 0], [1, 1] 
Prediction: [0, 1, 1, 1] # CORRECT!! 

Model predicts jargon -> backprop to find gradients of weights and biases, update them by taking 0.1 step in the negative direction of the gradient each iteration until loss is minimised.
'''

how autograd works (backprop using PyTorch)

# how autograd works

import torch

weight = torch.randn(3, requires_grad=True) # 3 features, save this gradient
x = torch.tensor([1.,2.,3.])
y = (weight * x).sum() # sum of dot products 
y.backward() # gradients  are calculated here using chain rule + partial derivatives
# dy/dx = dy/dw * dw/dx (chain rule) = sum(x) -> 1 + 2 + 3 and tensor would be [1, 2, 3]

print("grad:", weight.grad)  # derivative dy/dw == x
'''
Running through the output
Verified:

grad: tensor([1., 2., 3.])

So we don't have to do this manually, torch does this calculus magic for us by calling .backward()
'''

transformers using PyTorch

encoder layer
'''
Let's go through the entire Encoder process

High level overview::

nn.TransformerEncoderLayer(
    d_model,
    nhead,
    dim_feedforward=2048,
    dropout=0.1,
    activation=relu,
    layer_norm_eps=1e-5,
    batch_first=False,
    norm_first=False
)

It is a prewired composition of multihead attention, feed forward network, residual connection, layernorm and dropout

Input shape: (seq_length, batch, dimension_model) or (batch, seq_length, dimension_model) if batch_first=True 
Output shape: same

It works exactly like Transformer Encoder model with architecture:

x (input embeddings with positional encoding added (assumption))
↓
Self-Attention
↓
Dropout
↓
Add (residual)
↓
LayerNorm
↓
FeedForward
↓
Dropout
↓
Add (residual)
↓
LayerNorm

As functions:
x = LN(x + Dropout(SelfAttention(x)))
x = LN(x + Dropout(FFN(x)))



--- 

Connecting all blocks with an example:
"I am samit" -> Tokeniser -> [t0, t1, t2] IDs corresponding to ["I", "am", "samit"] so seq_length = 3, batchsize=1
Hidden model size = d_model = 8, number of heads = 2, thus head_dimension = d_model/nhead = 4
FFN size = dim_feedforward = 32 (in GPT it's 2048)
Use batch_first=False default, so tensors are of shape (seq_len, batch, d_model) = (3,1,8)
Use norm_first=True (pre-norm (used in Transformer++ models, stablises gradients more than post norm))
No padding / Causal mask (This is encoder)

Step 0: Tokenization -> token IDs (text -> tokenizer -> ids)
"I am samit"  ->  token_ids = [1532, 212, 784]  (these are feeded into nn.Embedding) to create vector embeddings of these tokens.

Step 1: Embedding + positional encoding = input tensor X
token_emb = token_embedding(token_ids)
pos_emb = pos_embeddings(positions)
X = token_emb + pos_emb # shape: (3,1,8)

X[0] = vector for "I"    shape (1, 8)
X[1] = vector for "am"   shape (1, 8)
X[2] = vector for "samit"shape (1, 8)

Step 2: TransformerEncoderLayer (with norm_first=True)
- x = x + self_atten(LayerNorm(x))
- x = x + FFN(LayerNorm(x))

Breaking this down:
First we compute LayerNorm -> x_norm = Layernorm(X)
LayerNorm normalizes each token vector across its 8 features: mean and var computed per token (per position), then x_norm = (x - μ)/sqrt(var+eps) scaled by learnable γ and shifted by β. No shape change.

Then nn.MultiheadAttention is used that does three linear projects from d_model -> d_model:
- For each token we compute Q_i = W_q * X_norm + b_q. Same for K_i and V_i

After splitting into heads reshape to (seq_length, batch, nhead, head_dim) -> transpose to common attention shape (batch, nhead, seq_length, head_dim)
Q.shape = (B, nhead, sequence_length, head_dim) = (1, 2, 3, 4)
K.shape = (1, 2, 3, 4)
V.shape = (1, 2, 3, 4)

Then we do scaled dot product attention per head. score_i_j = softmax((Q_i*Kj)/sqrt(head_dim)) where i = query token position and j = key position
Then weighted sum with V -> out = V_i * score
Do this for every head, every token position.
Concatenate heads and output projection -> final shape same as input -> (3,1,8) -> Apply lienar projection W_o (d_model, d_model) to mix head outputs

Apply dropout optionally (p = 0.5) ~ drops 50% of neurons for stable training and to reduce overfitting

Then add residual (add original X)
X = X + dropout(mha(ln(x))) # share preserved (3,1,8)

Then we have the FFN.
- LayerNorm
X_norm2 = LayerNorm(X) # shape still (3,1,8)
hidden = Linear1(X_norm2) # shape = (3,1,dim_feedforward) = (3,1,32) projecting to higher space
hidden_activation = activation(hidden) # RELU/SWIGLU
hidden_drop = Dropout(hidden_activation)
out_ffn = Lienar2(hidden_dropout) # shape : (3,1,8) again (downsample, project back to original dimension)

Again add residual to this:
X = X + Dropout(out_ffn)

Final output of the layer
Shape remains (seq_len, batch, d_model) = (3,1,8):
X[0] is the updated embedding for token "I", now informed by "am" and "samit".
X[1] updated embedding for "am".
X[2] updated embedding for "samit".
All values are differentiable; gradients will flow back through FFN, attention, LayerNorm, embeddings to update parameters.

'''

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

sentence = ["I", "am", "samit"]
token_to_id = {"I": 0, "am": 1, "samit": 2}

token_ids = torch.tensor([token_to_id[w] for w in sentence])
seq_len = len(token_ids)
batch_size = 1
d_model = 8
nhead = 2
head_dim = d_model // nhead
dim_ff = 32

# embedding
embedding = nn.Embedding(num_embeddings=10, embedding_dim=d_model)
x = embedding(token_ids)              # (seq, d_model)
x = x.unsqueeze(1)                    # (seq, batch, d_model)

print("\n TOKEN EMBEDDINGS")
print("x shape:", x.shape)
print(x)

# layernorm (pre-norm)
ln1 = nn.LayerNorm(d_model)
x_norm = ln1(x)

print("\nAFTER LAYERNORM (PRE-ATTN)")
print(x_norm)

# qkv projection
W_q = nn.Linear(d_model, d_model, bias=False)
W_k = nn.Linear(d_model, d_model, bias=False)
W_v = nn.Linear(d_model, d_model, bias=False)

Q = W_q(x_norm)
K = W_k(x_norm)
V = W_v(x_norm)

print("Q shape:", Q.shape)
print(Q)

# reshape for heads
def split_heads(t):
    return t.view(seq_len, batch_size, nhead, head_dim).permute(1, 2, 0, 3)

Qh = split_heads(Q)
Kh = split_heads(K)
Vh = split_heads(V)

scores = torch.matmul(Qh, Kh.transpose(-2, -1)) / math.sqrt(head_dim)

print("scores shape:", scores.shape)
print(scores)

weights = F.softmax(scores, dim=-1)

print("\nATTENTION WEIGHTS (SOFTMAX)")
print(weights)

attn_out = torch.matmul(weights, Vh)

print("\nATTENTION OUTPUT PER HEAD")
print(attn_out)

# concatenate heads
attn_out = attn_out.permute(2, 0, 1, 3).contiguous()
attn_out = attn_out.view(seq_len, batch_size, d_model)

W_o = nn.Linear(d_model, d_model, bias=False)
attn_out = W_o(attn_out)

# residual
x = x + attn_out

print("\nAFTER ATTENTION + RESIDUAL")
print(x)

# ffn (pre-norm)
ln2 = nn.LayerNorm(d_model)
x_norm = ln2(x)

fc1 = nn.Linear(d_model, dim_ff)
fc2 = nn.Linear(dim_ff, d_model)

ff_hidden = fc1(x_norm)
ff_hidden_act = F.gelu(ff_hidden)
ff_out = fc2(ff_hidden_act)

print("\nFFN HIDDEN (AFTER GELU)")
print(ff_hidden_act)

# residual
x = x + ff_out

print("\nFINAL OUTPUT OF ENCODER LAYER")
print("output shape:", x.shape)
print(x)


'''
The Transformer++ Model
Modern LLM engineering practices

Pre-norm (LayerNorm before sublayer) is preferred for stability in deep stacks.

Large LLM implementations typically reimplement the encoder/decoder block manually for performance: fused QKV projections, scaled_dot_product_attention using FlashAttention or other memory-efficient kernels, and custom FFNs (SwiGLU / gated activations).

Many state-of-the-art LLMs replace LayerNorm with RMSNorm or small variants for efficiency/stability.

Dropout is often reduced or removed for very large models (regularization comes from scale and data).

FFNs in modern LLMs often use gated activations (SwiGLU) with different hidden-size scaling (e.g., 2× or 4× d_model with different parameter efficiency).
'''

decoder layer
'''
Let's go through the entire Decoder process

High level overview::

nn.TransformerDecoderLayer(
    d_model,
    nhead,
    dim_feedforward=2048,
    dropout=0.1,
    activation=relu,
    layer_norm_eps=1e-5,
    batch_first=False,
    norm_first=False
)

It is a prewired composition of:
- masked multihead self-attention
- cross (encoder-decoder) attention
- feed forward network
- residual connections
- layernorm
- dropout

Input shapes:
- tgt (decoder input): (tgt_seq_length, batch, d_model)
- memory (encoder output): (src_seq_length, batch, d_model)

Output shape:
- same as tgt: (tgt_seq_length, batch, d_model)

---------------------------------

Decoder architecture (pre-norm variant, norm_first=True):

x (decoder input embeddings with positional encoding added)
↓
Masked Self-Attention (causal)
↓
Dropout
↓
Add (residual)
↓
Cross-Attention (attend to encoder memory)
↓
Dropout
↓
Add (residual)
↓
FeedForward Network
↓
Dropout
↓
Add (residual)

As functions (pre-norm):

x = x + Dropout(SelfAttention(LayerNorm(x)))        # causal self-attn
x = x + Dropout(CrossAttention(LayerNorm(x), memory))
x = x + Dropout(FFN(LayerNorm(x)))

---------------------------------

Connecting all blocks with an example:

Sentence: "I am samit"

Tokenizer → token IDs:
"I am samit" → [t0, t1, t2]
seq_length = 3
batch_size = 1

Model parameters:
d_model = 8
nhead = 2 → head_dim = 4
dim_feedforward = 32
batch_first = False → tensors are (seq_len, batch, d_model)
norm_first = True (pre-norm, used in modern Transformers / LLMs)

---------------------------------

Step 0: Tokenization
Text → tokenizer → token IDs
Decoder input receives shifted tokens during training (teacher forcing).

---------------------------------

Step 1: Embedding + positional encoding

token_emb = token_embedding(token_ids)
pos_emb   = positional_embedding(positions)
X = token_emb + pos_emb

Shape:
X → (3, 1, 8)

X[0] = embedding for "I"
X[1] = embedding for "am"
X[2] = embedding for "samit"

---------------------------------

Step 2: Masked Self-Attention (Causal)

x_norm = LayerNorm(X)

Q, K, V = Linear(x_norm)

After splitting heads:
Q, K, V → (batch, nhead, seq_len, head_dim) = (1, 2, 3, 4)

Apply causal mask:
- token i can only attend to tokens ≤ i
- prevents information leakage from future tokens

Compute:
scores = (Q · Kᵀ) / sqrt(head_dim)
scores[j > i] = -inf
weights = softmax(scores)

out_self = weights · V

Concatenate heads + output projection:
out_self → (3, 1, 8)

Residual:
X = X + out_self

---------------------------------

Step 3: Cross-Attention (Encoder-Decoder Attention)

Purpose:
- Decoder tokens attend to encoder outputs ("memory")
- This is how the decoder conditions on source text or retrieved context

x_norm = LayerNorm(X)

Q = Linear(x_norm)           # from decoder
K = Linear(memory)           # from encoder
V = Linear(memory)

Shapes:
Q → (1, 2, 3, 4)
K,V → (1, 2, src_seq_len, 4)

Compute:
scores = (Q · Kᵀ) / sqrt(head_dim)
weights = softmax(scores)
out_cross = weights · V

Concatenate heads + projection:
out_cross → (3, 1, 8)

Residual:
X = X + out_cross

---------------------------------

Step 4: Feed Forward Network (FFN)

x_norm = LayerNorm(X)

hidden = Linear1(x_norm)          # (3, 1, 32)
hidden = activation(hidden)       # GELU / SwiGLU
out_ffn = Linear2(hidden)         # (3, 1, 8)

Residual:
X = X + out_ffn

---------------------------------

Final output:

Shape:
X → (3, 1, 8)

Interpretation:
- Each decoder token embedding is now informed by:
  - previous decoder tokens (via causal self-attention)
  - encoder tokens (via cross-attention)
  - non-linear feature transformation (via FFN)

Gradients flow through all paths:
FFN ← attention ← embeddings ← tokenizer

---------------------------------

Modern LLM usage notes:

- Decoder-only LLMs (GPT, LLaMA):
  - remove cross-attention
  - keep masked self-attention + FFN
- Encoder-Decoder models (T5, BART):
  - use full decoder with cross-attention
- Production models:
  - use pre-norm + RMSNorm
  - use FlashAttention / SDPA
  - use SwiGLU FFNs
  - use KV caching during inference

This class is educational and correct,
but large-scale LLMs reimplement it manually for performance.
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
import math

sentence = ["I", "am", "samit"]
token_to_id = {"I": 0, "am": 1, "samit": 2}

tgt_ids = torch.tensor([token_to_id[w] for w in sentence])
src_ids = torch.tensor([0, 1, 2])  # pretend encoder input

tgt_len = len(tgt_ids)
src_len = len(src_ids)
batch_size = 1
d_model = 8
nhead = 2
head_dim = d_model // nhead
dim_ff = 32

# embeddings
embedding = nn.Embedding(10, d_model)

x = embedding(tgt_ids).unsqueeze(1)       # decoder input (tgt)
memory = embedding(src_ids).unsqueeze(1)  # encoder output (memory)

print("\nDECODER INPUT x")
print(x.shape)
print(x)

print("\nENCODER MEMORY")
print(memory.shape)
print(memory)

# ---- helpers ----
def split_heads(t):
    return t.view(-1, batch_size, nhead, head_dim).permute(1, 2, 0, 3)

def causal_mask(L):
    return torch.tril(torch.ones(L, L)).bool()

# ---- Sublayer 1: Masked Self-Attention ----
ln1 = nn.LayerNorm(d_model)

Wq = nn.Linear(d_model, d_model, bias=False)
Wk = nn.Linear(d_model, d_model, bias=False)
Wv = nn.Linear(d_model, d_model, bias=False)
Wo = nn.Linear(d_model, d_model, bias=False)

x_norm = ln1(x)
Q = split_heads(Wq(x_norm))
K = split_heads(Wk(x_norm))
V = split_heads(Wv(x_norm))

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(head_dim)
scores = scores.masked_fill(~causal_mask(tgt_len), float("-inf"))

weights = F.softmax(scores, dim=-1)
out_self = torch.matmul(weights, V)

out_self = out_self.permute(2, 0, 1, 3).contiguous().view(tgt_len, batch_size, d_model)
x = x + Wo(out_self)

print("\nAFTER MASKED SELF-ATTENTION")
print(x)

# ---- Sublayer 2: Cross-Attention ----
ln2 = nn.LayerNorm(d_model)

Wq_c = nn.Linear(d_model, d_model, bias=False)
Wk_c = nn.Linear(d_model, d_model, bias=False)
Wv_c = nn.Linear(d_model, d_model, bias=False)
Wo_c = nn.Linear(d_model, d_model, bias=False)

x_norm = ln2(x)
Q = split_heads(Wq_c(x_norm))
K = split_heads(Wk_c(memory))
V = split_heads(Wv_c(memory))

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(head_dim)
weights = F.softmax(scores, dim=-1)

out_cross = torch.matmul(weights, V)
out_cross = out_cross.permute(2, 0, 1, 3).contiguous().view(tgt_len, batch_size, d_model)

x = x + Wo_c(out_cross)

print("\nAFTER CROSS-ATTENTION")
print(x)

# ---- Sublayer 3: FFN ----
ln3 = nn.LayerNorm(d_model)

fc1 = nn.Linear(d_model, dim_ff)
fc2 = nn.Linear(dim_ff, d_model)

x_norm = ln3(x)
ff = fc2(F.gelu(fc1(x_norm)))

x = x + ff

print("\nFINAL DECODER OUTPUT")
print(x.shape)
print(x)

'''
Decoder has two attentions: masked self-attn (causal) + cross-attn (attend to encoder).
Decoder enforces time flow in self-attn; encoder uses full (non-causal) self-attention.
Decoder is used in: Encoder-Decoder models (e.g., T5, original Transformer for translation).
Decoder-only LLMs (they omit cross-attention and only use masked self-attention).
Cross-attention is the core conditioning mechanism in seq2seq and RAG-style pipelines (retrieval results as memory).
'''

conclusion

that’s all the math. if you made it this far, you now have enough to read any ML paper without getting lost in the notation. the best way to actually learn this is to implement it - go build something.

  • To read more great math stuff for ML, check out tensortonic


<
Previous Post
numpy & pytorch for dummies
>
Next Post
building rag for my website