TL;DR: This post walks through neural networks from scratch: perceptrons, gradient descent, backpropagation, and the math behind them. Then we build a handwritten digit classifier in Python with no frameworks, just NumPy.

index

  1. Introduction
  2. Gradient Descent
  3. Backpropagation
  4. Time 4 Math
  5. Code
  6. Problems
  7. Conclusion
  8. Optional : More Math
  9. References

introduction

Your brain learns new things by adapting to situations. That’s what we’re trying to replicate with a computer.

Neural networks are behind everything interesting in AI right now: image recognition, language models, self-driving cars. Once you get the core ideas, everything else is variations on the same theme.

perceptron

The idea of neural networks originates from perceptrons used in the early 1950s.

Perceptron

Networks composed of perceptrons where outputs from one layer feed into the next are called feedforward neural networks.

Perceptrons were limited to linear decision boundaries. Modern neural networks use multiple layers and non-linear activation functions to get around that.

problem statement

Given an image of a handwritten number, how do you identify the number?

We know what strokes 2’s have: a little horizontal line as a base and a slight curved S shaped line. That pattern is obvious to us. Expressing it precisely enough for a computer is another problem.

How can we make a computer learn this?

Visual pattern recognition is hard to program explicitly. Intuitions like “a 9 has a loop at the top, and a vertical stroke in the bottom right” fall apart the moment you try to make them precise.

When you try to make such rules precise it doesn’t work.

Idea is:

  • Take a large number of handwritten digits (training data)
  • Develop system that learns from this (infer rules for recognizing handwritten digits)
  • Test on new image and see how well it predicts.

Try to think about this in a smarter way.

framework

  • Each image is 28 * 28 pixels (which can be represented as numbers)
  • Each of these pixels has a weight value attached to it which indicates how much those pixels matter.
    • Example: in a ‘2’ only pixels containing horizontal line as a base and a S shaped curve pixels will matter, all other pixels are just dead weight.
    • Large values of weights indicates that it matters a lot. By varying the weights and the threshold, we can get different models of decision-making.
  • Somehow make sure all these images learn some underlying pattern and are able to recognise what the image is saying.
    • This happens via some trial and error, predicting some jargon at first (known as the forward pass)
    • Then hopefully minimising that error and making some sense out of it (known as the backward pass)
  • Given a new image, how well can the neural network predict what number this image is?

basics revisited

  • Neuron: Thing that holds a number between 0-1 i.e activation value (0-1, grayscale, white numbers : 1, black numbers : 0)
  • Activation Functions: Introduce non-linearity in data otherwise every data is just a bunch of matrix multiplications on other layers
    • Examples: Sigmoid, tanh, ReLU (more on these later, just think of them as math functions which transforms your data x into f(x))
  • Weights: Determine the strength of connections.
  • Biases: Shift activations to improve flexibility.

bias and weights

bias: the intercept

Without bias, zero input forces zero output. That’s a problem because it means every line your model learns has to pass through the origin (0, 0).

Bias is the constant $c$ in $y = mx + c$. It shifts the output up or down, giving the model flexibility to fit data that doesn’t conveniently pass through zero.

Mathematically: $\text{output} = w_1 x_1 + w_2 x_2 + \dots + b$

Here, (b) (bias) lets the neuron’s “neutral” point shift up or down, so the model can better fit the data.

weights: the slope

Weights determine the strength of the connection between inputs and outputs. They’re the slope (m) in (y = mx + c) - how much influence each input $x_i$ has on the output.

Large weights = this input matters a lot. Small weights = not so much. Training is just figuring out which inputs matter most.

neural network

  • Input : 784 neurons → Output : 10 neurons (Probability of number being that number (0 to 9))
  • In between hidden layers → activation of first layer determines activation of second layer and so on… Just how neurons in our brain work.

Visually:

Neural Network Neural Network2

Training data for the network will consist of many 28 * 28 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28 neurons.

The input pixels are greyscale, with a value of 0.0 representing white, a value of 1.0 representing black, and in between values representing gradually darkening shades of grey.

The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by n, and we’ll experiment with different values for n. The example shown illustrates a small hidden layer, containing just n=15 neurons.

The output layer of the network contains 10 neurons. If the first neuron fires (i.e., has an output = digit 0) which indicates that network thinks the digit is a 0. If the second neuron fires then that will indicate that the network thinks the digit is a 1. And so on.

Notice how in these drawings each neuron from one layer is connected to every neuron of the next. Each activation has some influence on the activations in the next layer.

But not all connections are equal. Some connections are more equal than others (shoutout George Orwell). Figuring out how strong these connections should be is the whole game.

neural network setup

  • Think about y = mx + c: we know x (input) and y (expected output), and we’re trying to learn m (weight) and c (bias).
  • Any ML problem follows this pattern: we already have x (our sample training data) and y (the correct answers), we just need to find m (weights) and c (bias) so that we can fit a model to our input and get some meaningful predictions from it.

Question: What happens in one layer?

  • First layer weights to next layer neuron : $w_i$
  • Inputs of first layer neurons : $x_i$

  • Calculate slope line for all, y = mx + c = $sum(w_i*x_i + bi)$ ($x_i$ is replaced by $a_i$ here after ‘activating’ all inputs $x_i$)

Weights tell you what pixel pattern this neuron in the second layer is picking up on, and the bias tells you how big that weighted sum needs to be before the neuron gets meaningfully active.

Basic equation → y = mx + c translates to $y = sum(weights * activations + bias)$ for all neurons also this y is just a bunch of random numbers, they don’t make sense, we need to squash this data into a range so that we truly can find some meaning to these. That’s where activation functions come in.

A lot of functions can do this, one of them is sigmoid which is just a fancy name for a function that translates any real-valued number to a value between 0 and 1.

You put this $sum(weights * activations + bias) = z$ into the sigmoid activation function = $sigmoid(z)$

Once again:

  • Input : 28 * 28 = 784 neurons, Let’s say I have 10 neurons in second layer.
  • Total : 784 (input) * 10 (neurons in second layer weights) + 10 (bias) * 10 + 10 (2nd layer) * 10 (last layer) ~ 8000 total weights and biases that can be tweaked and turned on or off per your choice.

Learning: Finding right weights and bias (so that our network learns underlying patterns of these numbers (i.e strokes))

The equation for a single neuron’s output, often written as: $\sigma\left(\sum_i a_i w_i + b\right)$ can be cumbersome to write, especially for entire layers. We can use the tools of linear algebra - vectors and matrices - to express this much more cleanly.

representing layers with vectors and matrices

  • Activations ((a)): The outputs from a previous layer can be represented as a single column vector:
\[a = \begin{bmatrix} a_0 \\ a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}\]
  • Weights ((W)): All the weights connecting the previous layer to the current layer can be organized into a single matrix:
\[W = \begin{bmatrix} w_{0,0} & w_{0,1} & \dots & w_{0,n} \\ w_{1,0} & w_{1,1} & \dots & w_{1,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{k,0} & w_{k,1} & \dots & w_{k,n} \end{bmatrix}\]

Here, $w_{j,i}$ is the weight connecting the $i$-th neuron of the previous layer to the $j$-th neuron of the current layer.

  • Biases ((b)): The biases for each neuron in the current layer can be represented as a column vector:
\[b = \begin{bmatrix} b_0 \\ b_1 \\ b_2 \\ \vdots \\ b_k \end{bmatrix}\]

Using this notation, the calculation for the pre-activation values $z$ of an entire layer becomes a matrix-vector operation:

\[z = W a + b\]

Finally, to get the output activations of the current layer, we apply the sigmoid function $\sigma$ element-wise:

\[a' = \sigma(z) = \begin{bmatrix} \sigma(z_0) \\ \sigma(z_1) \\ \sigma(z_2) \\ \vdots \\ \sigma(z_k) \end{bmatrix}\]

Neuron is just a function that takes output of all neurons in prevs layer and spits out number between 0-1.

Also no one uses sigmoid anymore.

They use $\text{ReLU}(a) = \max(0, a)$ (if any input is negative → make it 0, otherwise → let it be)

Why is this the case? The problem is that the sigmoid function becomes super flat at the extremes, when the weighted sum being passed in has a large magnitude (tldr; relu is better, trust me)

So wiggling the weights always gives useful feedback about how the network should change.

Training becomes much faster, especially with more layers - which is where the “deep” in “deep learning” comes from.

What we’ll do is assign a weight to each of the connections between our neuron and the neurons from the first layer, these weights are just random numbers (for now)

Each weight is an indication of how its neuron in the first layer is correlated with this new neuron in the second layer.

If the neuron in the first layer is on, then a positive weight suggests that the neuron in the second layer should also be on, and a negative weight suggests that the neuron in the second layer should be off.

So to actually compute the value of this second-layer neuron, you take all the activations from the neurons in the first layer, and compute their weighted sum.

Something like this:

$w1a1 + w2a2 + w3a3 + w4a4 + … + wn * an$

The expression w1a1 + w2a2 + w3a3 + w4a4 + ... + wn * an represents a weighted sum.

Two ways to write this more formally:

writing in summation form

\(\sum_{i=1}^{n} w_i a_i\)

  • $\sum$ is the Greek letter Sigma, which means “sum up.”
  • $i = 1$ below the Sigma indicates that the sum starts with the index $i$ equal to 1.
  • $n$ above the Sigma indicates that the sum ends when the index $i$ reaches $n$.
  • $w_i a_i$ is the expression for each term in the sum.

This notation compactly says: “Sum the product of $w_i$ and $a_i$ for all integer values of $i$ from 1 to $n$.”

writing in linear algebra (standard form)

In linear algebra this is the dot product of two vectors. This is how NumPy and PyTorch actually implement it.

First, we define a weight vector (W) and an activation (or input) vector (a):

\[W = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix}, \quad a = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix}\]

The weighted sum can then be written as the dot product $W \cdot a$, or more commonly in machine learning literature, as a matrix multiplication involving a transpose:

\[W^T a\]
  • $W^T$ is the transpose of the vector $W$, which turns the column vector into a row vector: $[w_1, w_2, \dots, w_n]$.
  • When you multiply the row vector $W^T$ by the column vector $a$, you get the exact same weighted sum:
\[W^T a = [w_1, w_2, \dots, w_n] \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \end{bmatrix} = w_1 a_1 + w_2 a_2 + \dots + w_n a_n\]

(Same result)

So maybe first layer picks up on the edges → Second layer picks up on the patterns → Final layer actually knows what number we have as input.

Image

So how does it learn? Whole point of training the model is that it can make the error as minimum as possible → How does it do that?

defining the error

How does a computer define error? Especially in a machine learning model.

There are multiple error functions but the basic premise we need to understand this is: \({error} = \hat{y} - y\)

where,

  • $\hat{y}$ = what our neural network predicts.
  • $y$ = the actual number recognised from the image correctly

Image2

Our job is to make these the same, which means make error as minimum as possible.

How do you make a function as minimum as possible? Put on yo thinking caps its calculus time.

gradient descent

One solution is to have random weights at first → calculate error → tweak weights and calculate error again (Expensive, takes too much time to do this for all neurons and layers)

Error is basically predicted_output (y’) - actual_output (y) and

  • $y’ = \text{output} = w_1 x_1 + w_2 x_2 + \dots + b$

So y’ depends on weights and biases which means the loss is dependent on weights and biases (parameters of the curve to be precise)

This means our loss is a function of parameters → hence called loss function. So we need to find configuration of these parameters to figure the minimum loss.

Plugging the coefficients obtained by minimising the loss function into curve equation gives us best description of the data.

How do we find these configurations? More importantly how do we find minimum loss of a function? We differentiate it.

We can keep iteratively change the knob of weights/biases one at a time to see whether the resulting curve is a better fit (This is bad. Slow. It does work though)

How can we be more intelligent?

Calculus. Since these are curves, they follow the property of differentiability which allows us to get the optimized knob settings much better.

Computer tells us which direction to change the knob and by how much.

We are asking the computer to predict the future and estimate the effect of knob adjustments on the loss function without doing trial & error.

revisiting the setup

Neural Network

  • Input: 784 numbers / pixels
  • Output: 10 numbers
  • Parameters: 8000 weights / biases

Initialise weights and bias randomly. \(\text{Loss} = (\text{Predicted} - \text{Actual})^2\)

In expanded form

\[C = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2\]

Why do we square it? We take average of this cost function and sometimes average might have positive and negative values which cancel out giving the illusion that cost function is minimum when it’s actually not. Hence the square.

Assume we send an image of ‘9’ = Error is large when network doesn’t seem to know what it’s doing and small when it’s close to 9.

The cost $C(w, b)$ becomes small, $C(w,b) ≈ 0$ precisely when $\hat{y} = y$ for all inputs x.

  • Input : The number 9
  • Output: Grey and White colors on multiple neurons (1-9)

This is what the output looks like

Output2

This is what we want

Output

What you gave me is utter trash (prediction), I need the output to be 9 (actual) ~ only cell 9 should be activated/fired.

So our training algorithm has done a good job if it can find weights and biases so that $C(w,b) ≈ 0.$

By contrast, it’s not doing so well when $C(w,b)$ is large.

Aim of our training algorithm will be to minimize the cost $C(w,b)$ as a function of the weights and biases.

Gradient descent updates weights and biases iteratively to minimize the cost function:

\(\mathbf{W} \leftarrow \mathbf{W} - \eta \frac{\partial C}{\partial \mathbf{W}}, \quad \mathbf{b} \leftarrow \mathbf{b} - \eta \frac{\partial C}{\partial \mathbf{b}}\) We’ll talk more about this in a while.


tuning knobs to minimize loss

Your model has one “knob” (parameter) $w$. Turning it up or down changes the prediction, and thus the loss. We want to find the setting of $w$ that makes loss as small as possible.

Taking only one parameter: weights for now (let’s say bias is fixed)

plotting loss vs. ($w$)

  • Horizontal axis ($x$) = parameter $w$
  • Vertical axis ($y$) = loss $L(w)$

This plot is a curve $y=L(w)$. We want the minimum point on that curve.

Plot

the derivative

The derivative at a point tells us the instantaneous slope of the curve:

\[\frac{dL}{dw} = \lim_{\Delta w \to 0} \frac{L(w + \Delta w) - L(w)}{\Delta w}\]
  • If $\tfrac{dL}{dw} > 0$, the curve is rising as we move right (loss increases if $w$ increases).
  • If $\tfrac{dL}{dw} < 0$, the curve is falling as we move right (loss decreases if $w$ increases).

ball on a hill

Picture a ball sitting on the curve. The derivative is “which way is downhill?”

  • If slope positive, the ball rolls left (decrease $w$).
  • If slope negative, the ball rolls right (increase $w$).

Gradient gives the direction of steepest ascend. So to minimise the function → take steps in the opposite direction.

the update rule

We take a small step opposite the slope:

\[w \leftarrow w - \underbrace{\eta}_{\text{learning rate}} \frac{dL}{dw}\]
  • $\eta$ controls the step size.

  • We repeat: compute slope, step downhill, repeat… until the slope is (nearly) zero.

same but for two variables k1 and k2 (weights and bias in our case)

How do we measure change with respect to multi variables? Multivariate calculus.

Partial Derivatives : only focus on the term wrt which you’re differentiating, treat everything else like a constant.

Now suppose we have two knobs $(k_1, k_2)$.

We measure how the loss changes if we wiggle one parameter at a time:

\[\frac{\partial L}{\partial k_1} \quad \text{and} \quad \frac{\partial L}{\partial k_2}\]
  • $\tfrac{\partial L}{\partial k_1}$ = rate of change if we nudge $k_1$ while keeping $k_2$ fixed.
  • $\tfrac{\partial L}{\partial k_2}$ = rate of change if we nudge $k_2$ while keeping $k_1$ fixed.

the gradient vector

Collect these partials into the gradient:

\[\nabla L = \begin{bmatrix} \dfrac{\partial L}{\partial k_1} \\ \dfrac{\partial L}{\partial k_2} \end{bmatrix}\]
  • This points in the direction of steepest ascent on the loss surface.
  • To go downhill, we move in the opposite direction.

updating again

\[\begin{bmatrix} k_1 \\ k_2 \end{bmatrix} \leftarrow \begin{bmatrix} k_1 \\ k_2 \end{bmatrix} - \eta\, \nabla L = \begin{bmatrix} k_1 \\ k_2 \end{bmatrix} - \eta\, \begin{bmatrix} \tfrac{\partial L}{\partial k_1} \\ \tfrac{\partial L}{\partial k_2} \end{bmatrix}\]

And in $n$-dimensional parameter space $\mathbf{k}=(k_1,\dots,k_n)$:

\[\mathbf{k} \leftarrow \mathbf{k} - \eta\, \nabla L(\mathbf{k}), \quad \nabla L(\mathbf{k}) = \begin{bmatrix} \dfrac{\partial L}{\partial k_1} \\ \vdots \\ \dfrac{\partial L}{\partial k_n} \end{bmatrix}\]

Continuing with our ball rolling example:

  • Draw the hill: Visualize loss as a surface over your parameters.
  • Find the slope: Compute (partial) derivatives - these tell you “uphill” vs. “downhill.”
  • Step downhill: Update each parameter by subtracting a fraction (learning rate) of its derivative.
  • Iterate: Keep repeating until you reach a flat spot (derivative ≈ 0), i.e. a minimum.

Gradient descent is simply “roll the ball downhill, one small step at a time.”

We can figure out the slope directly & the direction we need to adjust our weight without going all the way back to our neural network and calculating. Pretty cool.

Question: How do we change these weights and biases so it makes it better?

Answer: Shift inputs to the left if slope is positive, shift inputs to right is slope is negative.

Input in this case are the parameters of the cost function (weights and bias)

If you keep repeating → Approach local minimum of the function (ball rolling down a hill : lowest point)

Bowl

We need to reach the bottom of the bowl.

  • Randomly choose starting point for a ball (imaginary) then simulate the motion of ball as it’s rolled down.
  • We can do this by computing derivatives of cost function wrt parameters (weights and bias) : those derivatives would tell us everything about the local shape of the valley, therefore how our ball should roll.
  • What law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley? where η is a small, positive parameter (known as the learning rate).

The learning rate $\eta$ controls step size. Big learning rate = big steps, might overshoot the minimum. Small learning rate = small steps, stable but slow.

Gradient Descent is just taking small steps in the direction that decreases the cost function the most.

Lets say you’re at a hill and you want to go from one village to the other. And all you can see are paths. We can stand at the edge of the terrace and look for the steepest route to the terrace below. That’s also the shortest path down to the next piece of level ground, if we repeat the process from terrace to terrace we will eventually reach the village. In doing so we will have taken the path of steepest descent (might not be a straight line, can be zig zag)

What we did was evaluate the slope/gradeint of the hillside as we looked in different directions while standing at the edge of a terrace and then took the steepest path down each time.

This is gradient descent. An algorithm use to compute the minimum of the cost function.

Which direction decreases C(weights, bias) most quickly?

Gradient/Derivative of function gives you direction of steepest increase (which dirn should you step to increase function most quickly)

Taking negative of gradient gives you direction of step which decreases function most quickly : WHAT WE WANT

This can be visualized using the ball rolling diagram, draw a few slopes on the curve,

  • if you’re on the left : you need to move right to reach the bottom
  • if you’re on the right : you need to move left to reach the bottom

Image3

Algorithm:

  • Compute G(Cost Function) (G is gradient)
  • Take small step in the opposite direction of gradient : -Gradient(C) direction.
  • Repeat.
def gradient_descent_pure_python(learning_rate=0.1, num_iterations=100):
    """
    Minimizes the function f(x) = x^2 using gradient descent.
    The derivative f'(x) = 2x.
    """
    weight = 10.0  # Initial guess for x

    print(f"Starting x: {weight}, f(x): {weight**2:.4f}")

    for i in range(num_iterations):
        gradient = 2 * weight # x^2 derivative is 2x

        # Update weight by moving in the negative direction of the gradient
        weight = weight - learning_rate * gradient

        if (i + 1) % 10 == 0 or i == 0:
            print(
                f"Iteration {i+1}: x = {weight:.4f}, f(x) = {weight**2:.4f}, "
                f"Gradient = {gradient:.4f}"
            )

    print(f"\nOptimization finished. Minimum found at x = {weight:.4f}")
    print(f"Minimum function value f(x) = {weight**2:.4f}")

Output

smol@samit ~  > python3 a.py
Starting x: 10.0, f(x): 100.0000
Iteration 1: x = 8.0000, f(x) = 64.0000, Gradient = 20.0000
Iteration 10: x = 1.0737, f(x) = 1.1529, Gradient = 2.6844
Iteration 20: x = 0.1153, f(x) = 0.0133, Gradient = 0.2882
Iteration 30: x = 0.0124, f(x) = 0.0002, Gradient = 0.0309
Iteration 40: x = 0.0013, f(x) = 0.0000, Gradient = 0.0033
Iteration 50: x = 0.0001, f(x) = 0.0000, Gradient = 0.0004
Iteration 60: x = 0.0000, f(x) = 0.0000, Gradient = 0.0000
Iteration 70: x = 0.0000, f(x) = 0.0000, Gradient = 0.0000
Iteration 80: x = 0.0000, f(x) = 0.0000, Gradient = 0.0000
Iteration 90: x = 0.0000, f(x) = 0.0000, Gradient = 0.0000
Iteration 100: x = 0.0000, f(x) = 0.0000, Gradient = 0.0000
Optimization finished. Minimum found at x = 0.0000
Minimum function value f(x) = 0.0000

As you can clearly tell, the minimum of $f(x) = x^2$ is at x = 0, f(x) = 0 and that is what the output represents also.

Here we only optimized weights because that was the only parameter, in our original equation of line we have 2 parameters : weights and biases, need to optimize both.

How do you diffrentiate a function and find it’s minimum wrt 2 parameters? Partial derivatives BAM. Hell yeah. More on this later.

Here is that update rule in concise mathematical form.

\[\frac{\partial C}{\partial \mathbf{W}^{(l)}} = \delta^{(l)}(a^{(l-1)})^T, \quad \frac{\partial C}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}\] \[\text{Compute gradients:} \quad \frac{\partial C}{\partial W}, \quad \frac{\partial C}{\partial b}\] \[\text{Gradient‐descent update:} \quad W \leftarrow W - \eta\,\frac{\partial C}{\partial W}, \qquad b \leftarrow b - \eta\,\frac{\partial C}{\partial b}\]

Usually we use learning rate = 0.1 or 0.01 (So we take small steps and not move too fast (we might miss the bottom of the bowl and directly jump to the other side))

Each component of the negative gradient vector tells us two things-:

  • The sign, of course, tells us whether the corresponding component of the input vector should be nudged up or down.
  • The relative magnitudes of the components in this gradient tells us which of those changes matters more.

Backpropagation (efficiently calculating the gradients (derivatives) that Gradient Descent needs to update the weights of a neural network)

It’s just a fancy word for applying gradient descent multiple times efficiently.

backpropagation

Not only do we have to find the error in the forward pass (when we go from input layer -> hidden layer -> output layer) but we also need to adjust our weights and biases to minimise the error and “backpropagate” from output layer -> hidden layer -> input layer and update the weights and biases based on the minimised function / partial derivatives we calculated.

How does it calculate error what do you mean? Why does this work? What is learning? Which weights and bias minimise a certain cost function?

Algorithm for computing the gradient: Don’t be scared by these equations they don’t mean shit for now, but it’ll make sense later. If you know enough math to understand this in the first go,,,, why are you still reading this?

\[\delta^{L} = (\hat{\mathbf{y}} - \mathbf{y}) \odot \sigma'(z^{L})\] \[\delta^{l} = ((\mathbf{W}^{(l+1)})^T \delta^{(l+1)}) \odot \sigma'(z^{(l)})\] \[\frac{\partial C}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (a^{(l-1)})^T\] \[\frac{\partial C}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}\]

Fuck this, that’s what I said to myself when I saw it

Focus on one example: Input 2, lets say this is our activations based on random assignments of weights and biases.

Activations

0 : 0.5
1 : 0.8
2 : 0.2
3 : 1.0 (white)
4 : 0.4
5 : 0.6
6 : 1.0 (white)
7 : 0.0 (black)
8 : 0.2
9 : 0.1

This is utter trash like we talked about. We want it to be like:

Activations

0 : 0.0
1 : 0.0
2 : 1.0 (white)
3 : 0.0
4 : 0.0
5 : 0.0
6 : 0.0
7 : 0.0
8 : 0.0
9 : 0.0

We want that third value to get nudged up while all others get nudged down.

The sizes of these nudges should be proportional to how far away each current value is from target value.

What I mean by this:

  • Increase in number 2 activation is more important than decrease in number 8 neuron (which is already 0.2 close to 0)

So let’s take 2 neuron 0.2 = $\sigma\left(\sum_n w_n a_n + b\right)$

What can we change:

  • Increase bias.
  • Increase weights.
  • Change activations from previous layer.

It makes sense to increase weights in proportion to activation function of the neuron.

If neuron already has high activation function (high probability of being that number), you want to increase weight of that neuron (more importance to that neuron)

We need to increase weights of neurons that contribute to neuron 2 in a more powerful way. That makes sense.

Similarily we change the activation function of previous layers in proportion to weights.

Biggest increases in weights - the biggest strengthening of connections - happens between neurons that are the most active and the ones which we wish to become more active.

In practice it’s really time consuming for the computer to calculate all gradients and upgrade weights for ALL training data. Hence we use stochastic (randomized) gradient descent where we:

  • Shuffle Training Data.
  • Randomly subdivide into mini batches.
  • Go through all mini batches and make these adjustments until you converge for your local loss function.

The bias associated with the digit-2 neuron should be increased, because that will cause the activation of the digit-2 neuron to increase. All the other output activations should be decreased, which requires decreasing their associated biases.

Backpropagation Error Flow

Backpropagation Error Flow

time 4 math

Feel free to skip this part if you are not going to make it.

Backprop Chain Rule

Backpropagation is an algorithm for calculating the gradient of the cost function of a network.

Entries of gradient vector are partial derivatives of the cost function with respect to weights and biases.

Once you know how sensitive the cost function is to activations in this second-to-last layer, you can just repeat the process for all the weights and biases feeding into that layer.

At the heart of every neural network: matrix multiplication, derivatives, and the chain rule. That’s it.

  • Matrix multiplication encodes how signals flow between layers. Each weight amplifies or dampens its inputs.
  • The cost function turns “learning” into “find the minimum.” Derivatives point us downhill.
  • The chain rule lets us trace how a small tweak to any weight ripples through every layer.

Line these up in the right order and you get backpropagation.

So what do these scary equations actually mean?

Consider a simple neural network → 4 neurons connected via 3 links hence 4 activations, 3 weights and 3 biases (for each neuron except first) Simple NN

Goal

  • How sensitive the cost function is to these variables?
  • Which knobs to turn to make the cost function minimal?

We’re only going to focus on the activation of last 2 neurons.

The weight, previous activation, and bias are combined to compute the pre-activation:

\[z = \sum_i w_i a_i + b\]

This in turn is passed through the sigmoid activation function:

\[a = \sigma(z)\]

Finally, the activation $a$, along with the true label $y$, is used to compute the cost (e.g., using mean squared error or cross-entropy).

$a^{(L-1)}$ is itself computed from:

  • its own weights and biases,
  • and activations from the previous layer $a^{(L-2)}$,

and this pattern repeats backward through the network all the way to the input.

Each layer’s output becomes the next layer’s input, all the way to the final cost.

Image1 Image2 Image3

sensitivity of the cost to a single weight

Our goal is to measure how a tiny change (“nudge”) in one weight in the last layer, $w^{(L)}$, affects the loss $C$.
In calculus terms we want the partial derivative:

\[\frac{\partial C}{\partial w^{(L)}}\]

When you see $\partial w^{(L)}$, think “a small nudge to $w^{(L)}$”;
when you see $\partial C$, think “the resulting nudge to the cost.”
Their ratio $\tfrac{\partial C}{\partial w^{(L)}}$ is the instantaneous rate of change of the loss with respect to that weight.


1. chain rule decomposition

Because $C$ depends on $w^{(L)}$ only through the pre‐activation $z^{(L)}$ and the activation $a^{(L)}$, the chain rule gives

\[\frac{\partial C}{\partial w^{(L)}} = \frac{\partial C}{\partial a^{(L)}} \;\times\; \frac{\partial a^{(L)}}{\partial z^{(L)}} \;\times\; \frac{\partial z^{(L)}}{\partial w^{(L)}}.\]

A neat trick to verify if this is right is to cancel all denom and next numerator from the start, you’ll get the original function.


2. individual derivatives

  1. Pre‐activation
\[z^{(L)} = w^{(L)}a^{(L-1)} + b^{(L)} \quad\Longrightarrow\quad \frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L-1)}.\]
  1. Activation (sigmoid)
\[a^{(L)} = \sigma\left(z^{(L)}\right) \quad\Longrightarrow\quad \frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma'\left(z^{(L)}\right) = \sigma\left(z^{(L)}\right)\left(1 - \sigma\left(z^{(L)}\right)\right).\]
  1. Loss for one example (Mean Squared Error)
\[C_0 = \left(a^{(L)} - y\right)^2 \quad\Longrightarrow\quad \frac{\partial C_0}{\partial a^{(L)}} = 2\left(a^{(L)} - y\right).\]

3. combined gradient

Putting it all together,

\[\frac{\partial C_0}{\partial w^{(L)}} = \underbrace{2\bigl(a^{(L)} - y\bigr)}_{\frac{\partial C_0}{\partial a^{(L)}}} \;\times\; \underbrace{\sigma'\!\bigl(z^{(L)}\bigr)}_{\frac{\partial a^{(L)}}{\partial z^{(L)}}} \;\times\; \underbrace{a^{(L-1)}}_{\frac{\partial z^{(L)}}{\partial w^{(L)}}} \;=\; 2\,\bigl(a^{(L)} - y\bigr)\,\sigma'\!\bigl(z^{(L)}\bigr)\,a^{(L-1)}.\]

This tells us exactly how a small change in $w^{(L)}$ for one training example will change the loss.


4. averaging over the dataset

For $N$ training examples, we average these per-example gradients to get the gradient of the full cost $C$:

\[\nabla C = \frac{1}{N}\sum_{i=1}^{N} \frac{\partial C^{(i)}}{\partial w^{(L)}}.\]

5. update rule (gradient descent)

Finally, we update the weight by stepping opposite to the gradient:

\(w^{(L)} \;\leftarrow\; w^{(L)} \;-\;\eta\, \frac{\partial C}{\partial w^{(L)}},\) where $\eta$ is the learning rate.
Likewise for the bias: \(b^{(L)} \;\leftarrow\; b^{(L)} \;-\;\eta\, \frac{\partial C}{\partial b^{(L)}}.\)

stochastic gradient descent:

When your dataset is huge, computing the exact gradient
$\displaystyle \nabla C = \frac{1}{N}\sum_{i=1}^N \nabla C^{(i)}$
on all $N$ examples each update can be very slow.
Stochastic Gradient Descent (SGD) speeds things up by estimating the true gradient with a small random mini-batch of $m\ll N$ examples:

\[\nabla C_{\text{batch}} = \frac{1}{m}\sum_{i\in\text{batch}}\nabla C^{(i)}.\]
  1. Sample a mini-batch of $m$ training examples at random.
  2. Compute the average gradient $\nabla C_{\text{batch}}$.
\[\theta \leftarrow \theta - \eta \nabla C_{\text{batch}}\]

where $\eta$ is the learning rate.

why it works

  • Cost per update falls from $O(N)$ to $O(m)$.
  • Unbiased estimate: $\mathbb{E}[\nabla C_{\text{batch}}] = \nabla C$.

Think of it like political polling: poll 1,000 voters (a mini-batch) instead of the entire electorate (the full dataset). The average still points in the right direction.


partial derivative w.r.t. bias

For a neuron with pre-activation \(z = W\,x \;+\; b,\) the bias $b$ enters linearly, so

\[\frac{\partial z}{\partial b} \;=\; 1 \;\;\Longrightarrow\;\; \frac{\partial C}{\partial b} = \frac{\partial C}{\partial z}\,\frac{\partial z}{\partial b} = \frac{\partial C}{\partial z}.\]

the full gradient vector

We’ve already seen how to compute two entries of the gradient vector for the last layer:

  1. Weight gradient
\[\frac{\partial C}{\partial W^{(L)}} = \delta^{(L)}\,(a^{(L-1)})^{\top}\]
  1. Bias gradient
\[\frac{\partial C}{\partial b^{(L)}} = \delta^{(L)}\]

where $\delta^{(L)} = \tfrac{\partial C}{\partial z^{(L)}}$.

By propagating these errors backward through the network, multiplying along the chain of dependencies, we obtain every partial derivative $\tfrac{\partial C}{\partial w}$ and $\tfrac{\partial C}{\partial b}$ for every layer. Stacking them gives the gradient vector:

\[\nabla C \;=\; \begin{bmatrix} \vdots \\ \displaystyle \frac{\partial C}{\partial w_1} \\ \displaystyle \frac{\partial C}{\partial b_1} \\ \vdots \end{bmatrix}.\]

Once we have $\nabla C$, we perform gradient descent (or SGD)
to step downhill and minimize the loss:

\[\theta \;\leftarrow\; \theta \;-\;\eta\,\nabla C.\]

Components of gradient vector guide the adjustments we need to make.

If for a particular configuration:

  • The partial derivative of loss with respect to k1 is positive means increase k1 implies increase in loss
  • Hence we need to decrease (opposite sign of gradient) k1 to decrease loss.

How do we access the derivative of the loss function in the first place? Chain Rule → tells you how to compute derivative of combination of functions.

The chain rule tells us how to compute the derivative of a function that is itself the input to another function. In symbols, if \(h(x) = f\bigl(g(x)\bigr),\) then \(\frac{d}{dx}h(x) = \frac{d}{dx}\bigl[f\bigl(g(x)\bigr)\bigr] = f'\bigl(g(x)\bigr)\;\cdot\;g'(x).\)

intuition: two “machines” in series

  1. Machine 1

    • Input: $x$
    • Output: $g(x)$
    • Local slope: $g’(x)$
  2. Machine 2

    • Input: $u = g(x)$
    • Output: $f(u)$
    • Local slope: $f’(u)$

When you feed $x$ into the first machine, it spits out $g(x)$.
That becomes the input to the second machine, which outputs $f\bigl(g(x)\bigr)$.

small nudge argument

  1. Nudge $x$ by a tiny amount $\Delta x$.
  2. Machine 1 outputs

\(g(x + \Delta x) \approx g(x) + g'(x)\,\Delta x\) So the change in its output is \(\Delta u \approx g'(x)\,\Delta x,\) where $u = g(x)$.

  1. Machine 2 then sees its input change by $\Delta u$, so its output changes by
\[f(u + \Delta u) \approx f(u) + f'(u)\,\Delta u\]
  1. Combined change in the final output:
\[\Delta h = f'\bigl(g(x)\bigr)\,\Delta u \approx f'\bigl(g(x)\bigr)\,\bigl[g'(x)\,\Delta x\bigr] = \bigl[f'\bigl(g(x)\bigr)\,g'(x)\bigr]\,\Delta x\]
  1. Dividing by $\Delta x$ and taking the limit $\Delta x \to 0$ gives the chain rule:
\[\frac{d}{dx}f\bigl(g(x)\bigr) = f'\bigl(g(x)\bigr) \cdot g'(x)\]

why this matters for neural networks

Backpropagation is just many of these compositions chained together. Each layer’s output feeds into the next layer’s activation function.

The chain rule lets us multiply all the local derivatives together to find how the final loss changes with respect to any parameter deep in the network.

For the full chain rule derivation with worked examples and more calculus, see the math post.

To calculate the derivative of the cost function with respect to weights and bias for a simple sigmoid-activated neuron:

We want to find the gradient of our Cost Function, which is usually the Mean Squared Error (MSE), with respect to the weights $w$ and bias $b$.

1. Define the building blocks:

  • Input Features: $X$ (a vector or matrix of input values, $x_1, x_2, \dots, x_n$)
  • Weights: $W$ (a vector of weights, $w_1, w_2, \dots, w_n$)
  • Bias: $b$ (a single scalar value)
  • Activation Function: Sigmoid function,
\[\sigma(z) = \frac{1}{1 + e^{-z}}\]
  • The derivative of the sigmoid function is
\[\sigma'(z) = \sigma(z)(1 - \sigma(z))\]
  • Actual Labels: $Y$ (the true output values)

2. The Forward Pass

First, let’s understand how we get from inputs to our error.

  • Linear Combination (Weighted Sum + Bias): This is the input to the activation function.
\[z = X \cdot W + b\]

In code, this is z = np.dot(features, weights) + bias. This is a sum of $x_i w_i$ for all input features, plus the bias.

  • Predicted Output (y_hat): Apply the sigmoid activation function to (z).
\[\hat{y} = \sigma(z)\]

In code, this is predictions = sigmoid(z). This is your predicted_output.

  • Cost Function (Mean Squared Error - MSE): We measure how “wrong” our prediction is. For a single training example, the error is:
\[(\hat{y}_i - y_i)^2\]

For multiple samples, we typically take the mean.

\[C = \text{MSE} = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2\]

In code, this is mse = np.mean((predictions - labels) ** 2). This is your Cost.

The process of calculating $z$, $\hat{y}$, and $C$ is called the forward pass.

\[z^{(l)} = \mathbf{W}^{(l)} a^{(l-1)} + \mathbf{b}^{(l)}, \quad a^{(l)} = \sigma(z^{(l)})\]
# forward pass
z = np.dot(features, weights) + bias # sum(wx + b)
predictions = sigmoid(z) # apply sigmoid function to this (this is y_hat)
mse = np.mean((predictions - labels) ** 2) # take mean square error (this is C)
mse_values.append(round(mse, 4)) # add it to mean square error list

3. The Backward Pass

Now, we use the Chain Rule to find how the Cost $C$ changes with respect to each weight $w_j$ and the bias $b$.
The Chain Rule breaks down complex derivatives into a product of simpler ones.

We want to find $\frac{\partial C}{\partial w_j}$ and $\frac{\partial C}{\partial b}$.

Let’s trace the dependencies backward from (C)

\[C \leftarrow \hat{y} \leftarrow z \leftarrow (W, b)\]

First, the derivative of the Cost with respect to the Predicted Output $(\hat{y})$

\[\frac{\partial C}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}} \left( \frac{1}{N} \sum (\hat{y}_i - y_i)^2 \right) = \frac{1}{N} \sum 2(\hat{y}_i - y_i)\]

In code, error = predictions - labels. So, $\frac{\partial C}{\partial \hat{y}}$ is proportional to 2 * error. When we take the mean over $N$ samples, it becomes $\frac{2}{N} \times \text{sum of errors}$

Second, the derivative of the Predicted Output ($\hat{y}$) with respect to $z$:

This is the derivative of the sigmoid function. \(\frac{\partial \hat{y}}{\partial z} = \frac{\partial}{\partial z} \sigma(z) = \sigma(z)(1 - \sigma(z))\)

In code, this is sigmoid_derivative = predictions * (1 - predictions).

Combining the first two steps to get $(\frac{\partial C}{\partial z})$

Using the chain rule $\frac{\partial C}{\partial z} = \frac{\partial C}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}$

\[\frac{\partial C}{\partial z} = \left( \frac{2}{N} \sum (\hat{y}_i - y_i) \right) \cdot \sigma(z_i)(1 - \sigma(z_i))\]

In code, delta = error * sigmoid_derivative effectively calculates (predictions - labels) * sigmoid_derivative for each sample. The (2/len(labels)) factor will be applied later when we average over samples. So, delta represents $(\hat{y} - y) \cdot \sigma’(z)$ for each sample.

Now, for the final gradients:

Derivative of Cost with respect to a Weight $w_j$: Using the chain rule: $\frac{\partial C}{\partial w_j} = \frac{\partial C}{\partial z} \cdot \frac{\partial z}{\partial w_j}$ We know that $z = X \cdot W + b = x_1 w_1 + x_2 w_2 + \dots + x_j w_j + \dots + x_n w_n + b$

So, $\frac{\partial z}{\partial w_j} = x_j$

Therefore

\[\frac{\partial C}{\partial w_j} = \left( \frac{2}{N} \sum (\hat{y}_i - y_i) \cdot \sigma(z_i)(1 - \sigma(z_i)) \right) \cdot x_{ij}\]

(where $(x_{ij}$ is the $j$-th feature for the $i$-th sample).

In code: partial_derivative_error_wrt_weights = (2 / len(labels)) * np.dot(features.T, delta) This np.dot(features.T, delta) correctly performs the sum

\(\sum \text{delta}_i \cdot x_{ij}\) for all features, which is exactly what the formula requires.

Derivative of Cost with respect to the Bias (b):

Using the chain rule $\frac{\partial C}{\partial b} = \frac{\partial C}{\partial z} \cdot \frac{\partial z}{\partial b}$

We know $z = X \cdot W + b$

So $\frac{\partial z}{\partial b} = 1$ Therefore

\[\frac{\partial C}{\partial b} = \left( \frac{2}{N} \sum (\hat{y}_i - y_i) \cdot \sigma(z_i)(1 - \sigma(z_i)) \right) \cdot 1\] \[\frac{\partial C}{\partial b} = \frac{2}{N} \sum (\hat{y}_i - y_i) \cdot \sigma(z_i)(1 - \sigma(z_i))\]

In code partial_derivative_error_wrt_bias = (2 / len(labels)) * np.mean(delta) This np.mean(delta) effectively calculates $\frac{1}{N} \sum \text{delta}_i$ for all samples, which is what the formula requires.

Calculate error at the output

\[\delta^{(L)} = (\hat{\mathbf{y}} - \mathbf{y}) \odot \sigma'(z^{(L)})\]

Propagate errors backward:

Code for the Backward Pass:

error = predictions - labels # This is (y' - y) for each sample
sigmoid_derivative = predictions * (1 - predictions) # This is sigma'(z) for each sample
delta = error * sigmoid_derivative # This is (y' - y) * sigma'(z) for each sample

# Calculate the partial derivatives (gradients) of the Cost with respect to weights and bias
# partial_derivative_error_wrt_weights corresponds to dC/dW
# partial_derivative_error_wrt_bias corresponds to dC/db
partial_derivative_error_wrt_weights = (2 / len(labels)) * np.dot(features.T, delta)  # using chain rule
partial_derivative_error_wrt_bias = (2 / len(labels)) * np.mean(delta) # using chain rule

# update weights and bias using the calculated gradients (Gradient Descent step)
weights -= learning_rate * partial_derivative_error_wrt_weights
bias -=  learning_rate * partial_derivative_error_wrt_bias
gradients = [] # assume this is a matrix which just stores all results of dC/dw and dC/db
gradients.append((weights, bias)) # This line looks a bit off in the original, `gradients` should likely be a list of gradient vectors, not the updated weights/bias themselves.

Backward because as the gradients are computed by propagating sensitivity backward from the output layer to the input layer.

Once these gradients are found, We’re going to tweak these parameters based on the loss function gradient.

Slightly tweak the knobs oppositve to the gradient. $updatedWeight = weight - learningRate * [partial(Cost) / partial(weight)]$

After each adjustment we need to redo forward and backward passes since the loss functions have changed.

Performing this loop of:

  • Forward Pass
  • Backward Pass
  • Nudge knobs
  • Repeat 1-3

is the essence of training modern machine system.

As long as your fancy model can be decomposed as a sequence of differentiable functions, you can apply backprop to optimize the parameters.

why is it fast?

Why not just brute-force the gradient? You could treat cost as a function of weights C=C(w), and estimate ∂C/∂wj by computing the cost C for two slightly different values of wj. Bump each weight by a tiny amount, see what happens.

1. bruteforce gradient descent

To estimate $\tfrac{\partial C}{\partial w_j}$ by finite differences, you’d do

\[\frac{\partial C}{\partial w_j} \approx \frac{C\bigl(w + \varepsilon\,e_j\bigr)\;-\;C(w)}{\varepsilon},\]

where $e_j$ is the unit vector in direction $j$. If you have (N) weights, that requires $N+1$ full forward‐passes through the network per training example!

2. backpropagation via the chain rule

Backprop avoids that explosion by computing all partial derivatives
$\tfrac{\partial C}{\partial w_j}$ at once with:

  1. One forward pass (compute all activations).
  2. One backward pass (propagate “error signals” and multiply local derivatives).

2.1 Small‐Change Analysis

A small change $\Delta w_{l,j,k}$ in weight $w_{l,j,k}$ causes a small change in the neuron’s activation:

\[\Delta a_{l,j} \approx \frac{\partial a_{l,j}}{\partial w_{l,j,k}}\, \Delta w_{l,j,k}.\]

That, in turn, perturbs the next layer’s activation:

\[\Delta a_{l+1,q} \approx \frac{\partial a_{l+1,q}}{\partial a_{l,j}}\, \Delta a_{l,j}.\]

… and so on, all the way to the cost (C):

\[\Delta C \approx \frac{\partial C}{\partial w_{l,j,k}}\, \Delta w_{l,j,k}.\]

2.2 Chain Rule for a Single Weight

By following every path from $w_{l,j,k}$ to the final cost,
and multiplying the local partials along each edge,
then summing over all such paths, you get:

\[\frac{\partial C}{\partial w_{l,j,k}} = \sum_{\text{paths }p} \prod_{\substack{\text{edges }(u\to v)\\\in p}} \frac{\partial v}{\partial u}.\]

Backpropagation is simply a dynamic‐programming implementation of this sum‐of‐products in linear time $O(N)$, rather than $O(N^2)$ or worse.


  • Finite‐difference: $O(N)$ forward passes for $N$ weights → slow.
  • Backpropagation: 1 forward pass + 1 backward pass → fast.
  • Uses the chain rule to share and reuse intermediate derivatives,
    computing the entire gradient vector $\nabla C$ efficiently.

Backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

coding time coding time coding time

The math can be intimidating but it’s actually pretty simple once you sit down with it.

Enough theory. Let’s build a program that correctly identifies numbers from handwritten images. The hello world of neural networks.

about the dataset

MNIST: 60,000 training images and 10,000 test images of handwritten digits. Scanned from 250 people - half Census Bureau employees, half high school students.

MNIST

Greyscale, 28x28 pixels. The test data comes from a different set of 250 people so we know the network can generalize to handwriting it’s never seen before.

What we want it to predict (Ideal Output)

Output

MNIST from scratch (using NumPy)

import numpy as np
"""
if we want to create a Network object with 2 neurons in the first layer, 3 neurons in the second layer, and 1 neuron in the final layer:
    net = Network([2, 3, 1])

net.weights[1] is a Numpy matrix storing the weights connecting the second and third layers of neurons (Python counts from 0)
"""
class Network:
    def __init__(self, sizes) -> None:
        ''' This creates a network with layers and indicates how many neurons in each layer (sizes)'''
        self.num_layers = len(sizes)
        self.sizes = sizes
        # randomly initialize the weights and bias
        self.bias = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1], sizes[1:])]

    def sigmoid(self, z) -> float:
        ''' Activation function to convert probability in range 0 -> 1 '''
        return 1.0 / (1.0 + np.exp(-z))
        # relu : return max(0, z) 

    def sigmoid_derivative(self, z) -> float:
        ''' Derivative of sigmoid : we need this for calculating partial derivative of cost function '''
        return self.sigmoid(z) * (1 - self.sigmoid(z)) 

    def forward(self, activation) -> float:
        ''' Return output of neural network '''
        for b, w in zip(self.bias, self.weights):
            activation = self.sigmoid(np.dot(w, activation) + b) # sigmoid(wx + b) (wx = dot product if you think about how matrix works)
        return activation

    def SGD(self, training_data, epochs, mini_batch_size, learning_rate, test_data=None):
        """
        Train the neural network using mini-batch stochastic gradient descent.  
        The "training_data" is a list of tuples "(x, y)" representing the training inputs and the desired outputs.
        If "test_data" is provided then the network will be evaluated against the test data after each epoch, and partial progress printed out.
        epochs is just a fancy name for number of iterations.
        """
        if test_data:
            n_test = len(test_data)
        n = len(training_data) # otherwise
        for epoch in range(epochs):
            np.random.shuffle(training_data)
            mini_batches = [training_data[k : k + mini_batch_size] for k in range(0, n, mini_batch_size)]
            # for each mini_batch we apply a single step of gradient descent
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, learning_rate)
            
            if test_data:
                accuracy = self.evaluate(test_data)
                accuracy_percentage = (accuracy / n_test) * 100
                print( f"Epoch {epoch}: {accuracy} / {n_test}") 
            else:
                print(f"Epoch {epoch} complete") 

        print(f"Accuracy : {accuracy_percentage:.2f}%")
            
    def update_mini_batch(self, mini_batch, learning_rate):
        ''' Update weights and biases by applying Gradient Descent to single mini batch (list of tuples (x,y) representing training inputs and their desired outputs for this mini batch) '''
        # initialize gradients to 0
        # the sum of the gradients of the loss function with respect to the biases for all examples in the mini-batch.
        sum_of_bias_gradients = [np.zeros_like(b) for b in self.bias]
        sum_of_weight_gradients = [np.zeros_like(w) for w in self.weights]
        # Calculate gradient for each example in mini batch.
        for x, y in mini_batch:
            grad_b, grad_w = self.backprop(x, y)
            # update sum of bias and weight gradients (elementwise addition for gradients)
            # the overall gradient used to update the weights and biases is the average of the gradients computed for each individual training example within that mini-batch.
            sum_of_bias_gradients = [bias_sum + gradient_bias for bias_sum, gradient_bias in zip(sum_of_bias_gradients, grad_b)]
            sum_of_weight_gradients = [weight_sum + gradient_weight for weight_sum, gradient_weight in zip(sum_of_weight_gradients, grad_w)]
        
        mini_batch_size = len(mini_batch)
        learning_rate_scaled = learning_rate / mini_batch_size
        # update weights and biases using avg gradients
        self.weights = [
            w - learning_rate_scaled * weight_sum
            for w, weight_sum in zip(self.weights, sum_of_weight_gradients)
        ]

        self.bias = [
            b - learning_rate_scaled * bias_sum
            for b, bias_sum in zip(self.bias, sum_of_bias_gradients)
        ]

    def backprop(self, x, y):
        """
        Calculates the gradient for the cost function C_x
        for a single training example (x, y) using backpropagation.

        Args:
            x (np.ndarray): The input feature vector.
            y (np.ndarray): The true label/target output vector.

        Returns:
            tuple[list[np.ndarray], list[np.ndarray]]: A tuple containing
            (grad_b, grad_w), which are layer-by-layer lists of numpy
            arrays representing the gradients for biases and weights,
            respectively.
        """
        grad_b, grad_w = [np.zeros_like(b) for b in self.bias], [np.zeros(w.shape) for w in self.weights]
        # forward pass
        # input layer activation and list to store all activations (layer by layer)
        activation = x
        activations = [x]
        all_z = [] # list to store all zs = (wx + b) and activation is basically sigmoid(z)
        for b, w in zip(self.bias, self.weights):
            z = np.dot(w, activation) + b
            all_z.append(z)
            activation = self.sigmoid(z)
            activations.append(activation)

        # backward pass
        # activations[-1] is the output layer's activation (by output layer I mean last layer)
        # all_z[-1] is the weighted input for the output layer
        error = self.cost_derivative(activations[-1], y) * self.sigmoid_derivative(all_z[-1])
        # update last layers
        grad_b[-1] = error
        grad_w[-1] = np.dot(error, activations[-2].T) # a^(L-2) * error = partial derivative of cost wrt weight of last layer

        # Iterate backward through the layers (from second-to-last to input)
        # We iterate in reverse order of layers (from output towards input).
        # Layer 1 is the first hidden layer (weights[0], biases[0]).
        # Layer L-1 is the last hidden layer (weights[-2], biases[-2]).
        for layer_idx in reversed(range(self.num_layers - 1)):
            # skip last layer (we handled that above)
            if layer_idx == self.num_layers - 2: 
                continue 
            z = all_z[layer_idx]
            sd = self.sigmoid_derivative(z)
            # Error propagation: delta for current layer from delta of next layer
            error = np.dot(self.weights[layer_idx + 1].T, error) * sd

            # Gradients for current layer's biases and weights
            grad_b[layer_idx] = error
            grad_w[layer_idx] = np.dot(error, activations[layer_idx].T)

        return grad_b, grad_w

    
    def cost_derivative(self, output_activations, y):
        ''' Return prediction - actual '''
        return output_activations - y


    def evaluate(self, test_data):
        ''' Return the number of test inputs for which the network outputs correct result '''
        test_results = [
                        (np.argmax(self.forward(x)), y) for (x, y) in test_data
                        ]
        return sum(int(x == y) for (x, y) in test_results)

training_data, validation_data, testing_data = mnist_loader.load_data_wrapper()

# 784 input image pixels, 30 hidden neurons, 10 output neurons consisting of probability of images
net = network.Network([784, 30, 10])

# use stochastic gradient descent to learn from the MNIST training_data over 30 epochs, with a mini-batch size of 10, and a learning rate of η = 0.01
net.SGD(training_data, 30, 10, 0.01, test_data=testing_data)

# Predicting new image
def preprocess_image(image_path):
    """
    Loads an image, converts to grayscale, resizes to 28x28,
    normalizes pixels, and reshapes to (784, 1) numpy array.
    """
    try:
        img = Image.open(image_path).convert('L') # grayscale
        img = img.resize((28, 28)) 
        img_array = np.array(img) 

        # MNIST pixels are 0 (black) to 255 (white).
        # Flatten and transpose to (784, 1) column vector
        img_vector = np.reshape(img_array, (784, 1)) / 255.0
        return img_vector

    except FileNotFoundError:
        print(f"Error: Image file not found at {image_path}")
        return None
    except Exception as e:
        print(f"Error processing image {image_path}: {e}")
        return None

image_path = Path(__file__).parent / "data" / "sample_digit.jpg"
processed_image = preprocess_image(image_path)
if processed_image is not None:
    output_activation = net.forward(processed_image)
    prediction = np.argmax(output_activation) # idx of neuron with highest activation
    print("\nNetwork Output Activations (confidence for each digit 0-9):")
    for i, val in enumerate(output_activation.flatten()):
            print(f"  Digit {i}: {val * 100:.2f}%")
    print(f"\nPredicted digit: {prediction}")
else:
    print("Could not make a prediction due to image processing error.")


Trying it on this sample image:

9

Output

Epoch 0: 2718 / 10000
Epoch 1: 3465 / 10000
Epoch 2: 3928 / 10000
Epoch 3: 4300 / 10000
Epoch 4: 4650 / 10000
Epoch 5: 5049 / 10000
Epoch 6: 5531 / 10000
Epoch 7: 5907 / 10000
Epoch 8: 6255 / 10000
Epoch 9: 6504 / 10000
Epoch 10: 6704 / 10000
Epoch 11: 6888 / 10000
Epoch 12: 7043 / 10000
Epoch 13: 7165 / 10000
Epoch 14: 7266 / 10000
Epoch 15: 7358 / 10000
Epoch 16: 7441 / 10000
Epoch 17: 7508 / 10000
Epoch 18: 7583 / 10000
Epoch 19: 7661 / 10000
Epoch 20: 7722 / 10000
Epoch 21: 7793 / 10000
Epoch 22: 7830 / 10000
Epoch 23: 7878 / 10000
Epoch 24: 7914 / 10000
Epoch 25: 7957 / 10000
Epoch 26: 7991 / 10000
Epoch 27: 8052 / 10000
Epoch 28: 8073 / 10000
Epoch 29: 8119 / 10000
Accuracy : 81.19% (It means it predicted 8119 images correctly out of 10000)

Network Output Activations (probability for each digit 0-9):
  Digit 0: 0.21%
  Digit 1: 0.08%
  Digit 2: 0.06%
  Digit 3: 0.53%
  Digit 4: 17.36%
  Digit 5: 2.49%
  Digit 6: 2.85%
  Digit 7: 0.12%
  Digit 8: 48.34%
  Digit 9: 70.89% # highest

Predicted digit: 9 # HELL YEAH

This was tedious, hence we have libraries like PyTorch/TensorFlow that already do this calculus for us.

basic model in libraries like PyTorch/TensorFlow

import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

# Define a transform to normalize the data
transform = transforms.Compose([
    transforms.ToTensor(), # Converts image to tensor and scales to [0, 1]
    transforms.Normalize((0.1307,), (0.3081,)) # Standardize with MNIST's mean and std
])

# Download and load the training, testing data
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Load data
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=1000, shuffle=False)

class PyTorchNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.inp_to_hidden = nn.Linear(784, 128) # inp to hidden layer
        self.relu = nn.ReLU() # activation function
        self.hidden_to_pred = nn.Linear(128, 10) # hidden to output layer

    # Takes input
    def forward(self, x):
        x = self.inp_to_hidden(x)
        x = self.relu(x)
        x = self.hidden_to_pred(x)
        return x 

model = PyTorchNet()
loss = nn.CrossEntropyLoss() # target - pred 

optimizer = optim.Adam(model.parameters(), lr=0.01) # weights = (weights - learning_rate) * derivative

epochs = 2000
for epoch in range(epochs):
    for xb, yb in train_loader:
        optimizer.zero_grad() # reset weights to 0 after each batch so that they are updated correctly
        pred = model(xb) # calculate prediction
        final_loss = loss(pred, yb) # loss of prediction
        final_loss.backward() # apply backpropagation to this 
        optimizer.step() # update weights correctly
    if epoch % 500 == 0: # every 500th iteration : print loss
        print(f"Epoch {epoch}, Loss : {final_loss.item():.4f}")

# prediction
with torch.no_grad():
    pred = model(features)
    probs = torch.softmax(pred, dim=1)
    prediction = torch.argmax(probs, dim=1) # take the highest probability (example : correct digit in our MNIST example)
    print("\nPredicted class probabilities:\n", probs) # probability of each class (example : 0 - 9 all digit probabilities)
    print("Predicted classes:", prediction) # predicted outputs
    print("True classes:", targets) # actual outputs

MNIST from MiniTorch (custom implementation of PyTorch)

MiniTorch

import numpy as np
from minitorch.tensor import Tensor
from minitorch.layers import Linear, ReLU
from minitorch.loss import cross_entropy_loss
from minitorch.optim import SGD, Adam

from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

def run_mnist_example():
    print("Fetching MNIST data...")
    mnist = fetch_openml('mnist_784', version=1)
    X = mnist.data.astype(np.float32) / 255.0
    y = mnist.target.astype(np.int64)
    # converting data to usable data (between 0 to 1 etc...)
    X_train_all, X_test_all, y_train_all, y_test_all = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    X_train = X_train_all[:1000]
    y_train = y_train_all[:1000]
    X_test = X_test_all[:200]
    y_test = y_test_all[:200]

    def one_hot(labels, num_classes=10):
        oh = np.zeros((labels.shape[0], num_classes), dtype=np.float32)
        oh[np.arange(labels.shape[0]), labels] = 1.0
        return oh

    y_train_oh = one_hot(y_train, 10)
    y_test_oh = one_hot(y_test, 10)

    X_t = Tensor(X_train, requires_grad=False)
    y_t = Tensor(y_train_oh, requires_grad=False)

    # defining layers (wx + b) and activation functions
    layer1 = Linear(784, 128)
    relu = ReLU()
    layer2 = Linear(128, 10)

    params = layer1.parameters() + layer2.parameters()
    opt = SGD(params, lr=0.01, momentum=0.9)

    epochs = 10
    batch_size = 100
    num_batches = X_train.shape[0] // batch_size

    print("Training")

    for epoch in range(epochs):
        epoch_loss = 0.0
        for i in range(num_batches):
            start = i * batch_size
            end = start + batch_size
            xb = Tensor(X_train[start:end], requires_grad=False)
            yb = Tensor(y_train_oh[start:end], requires_grad=False)
            # forward pass
            out1 = layer1(xb)
            act1 = relu(out1)
            logits = layer2(act1)
            loss = cross_entropy_loss(logits, yb) # calculate loss
            # backward pass
            # we do .zero_grad() to reset all the gradients since we don't want gradients of previous layer to affect this new layer
            opt.zero_grad()
            loss.backward()
            opt.step()

            epoch_loss += loss.data

        avg_loss = epoch_loss / num_batches
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

    X_test_t = Tensor(X_test, requires_grad=False)
    out1_test = layer1(X_test_t)
    act1_test = relu(out1_test)
    logits_test = layer2(act1_test)
    preds = np.argmax(logits_test.data, axis=1)
    accuracy = np.mean(preds == y_test) # checking if what we predicted = real output
    print(f"Test Accuracy: {accuracy * 100:.2f}%")

if __name__ == "__main__":
    run_mnist_example()

Output

(.venv) smol@samit ~/fun/minitorch git:(master) > python3 mnist_example.py 
Fetching MNIST data...
Training
Epoch 1/10, Loss: 71.2395
Epoch 2/10, Loss: 24.0194
Epoch 3/10, Loss: 11.0858
Epoch 4/10, Loss: 6.6417
Epoch 5/10, Loss: 4.4537
Epoch 6/10, Loss: 3.4053
Epoch 7/10, Loss: 2.5522
Epoch 8/10, Loss: 1.8609
Epoch 9/10, Loss: 1.4868
Epoch 10/10, Loss: 1.0654
Test Accuracy: 78.50% # I know this sucks, can improve it via CNN, but it works.

Output

All of this is similar, just think of it as the same sentence but in a different language. You can see the underlying pattern behind MiniTorch/PyTorch/TensorFlow are the same (they all follow numpy which just follows basic calculus)

For better results you can also use CNN (Convolution Neural Network) which is just a fancy word for optimizing our neural network via multiple stages and feature extraction, you can see this video for more.

But at a quick glance this is how it works:

CNN

  • Input: 28 * 28 pixel image.

  • Convolution layer: multiples this image by some filter (2x2 matrix) to get a new matrix which learns the underlying patterns of the image (in this case, the strokes)

A 3x3 kernel slides across the input image, computing a dot product at each position to produce the feature map.

  • Max Pooling: Extracts the most important features and reduces size of the original image (14 * 14 now)

  • This goes through another Convolution Layer followed by another Max Pooling layer to identify more patterns and in the end we end up with 7 * 7 image and 64 of them which are then flattened into a single array of 7 * 7 * 64 numbers = 3136

  • After flattening this we find the probabilities of our numbers by applying the dense layer which outputs 10 numbers (0 to 9) and predicts which number has the highest activation / probability (i.e predicting what number our image is)

loss vs accuracy graph:

AccuracyLoss

Makes sense, loss goes down and accuracy of the model goes up over time.

This is the heart of it. Every ML/AI program you’ll ever write boils down to chain rule calculus and matrix multiplications to learn the right parameters (weights and bias).

problems with gradient descent

vanishing gradient problem

What happens when you backprop through a lot of layers? You’re using sigmoid/tanh to squash numbers between 0 and 1, then multiplying them together via chain rule.

If you have A LOT of layers, by the time the gradient reaches the early weights it’s basically 0. They never learn.

Sigmoid derivatives are already small (e.g., 0.1, 0.01). Multiply a bunch of small numbers together during backprop ($0.25 \times 0.25 \times 0.25 \dots$) and the result vanishes fast.

Early layers get an almost zero gradient signal. Network learns nothing. Cool.

Gradients shrink exponentially through layers with sigmoid activation. ReLU and residual connections solve this by keeping gradients strong.

dead neuron problem

If the weighted sum going into a ReLU neuron is negative, ReLU outputs 0 ($f(a) = max(0, a)$). Output is 0 means derivative is 0. Derivative is 0 means gradient is 0. Gradient is 0 means weights never update.

The neuron is dead. Permanently. Always outputs 0, never learns, never contributes. Just sitting there taking up space.

Fix: Leaky ReLU, which never fully zeroes out.

\(f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\) Where:

  • $x$: Input to the activation function.
  • $\alpha$: A small positive constant (e.g., $(0.01$) that determines the slope for negative input values.

conclusion

In a larger sense, these parameters/weights are just probabilities. In an LLM like ChatGPT, they’re probabilities over all English words.

LLMs have billions of these parameters. They start random and get refined on text. Backpropagation tweaks all of them so the model becomes a little more likely to pick the right next word instead of a wrong one.

Example: I am going to the bank to deposit some _

Probabilties : money, cash, blood, ashes, etc… (Actual last word here is money)

A good video on this: LLMs explained

Image classification, text generation (GPT) - all of it uses these same ideas at the core.

Transformers, Attention, every cutting edge AI paper - it all comes back to y = mx + c, finding the slope and intercept to minimize loss. That’s it.

Behind all the hype it’s just high school math and common sense. I spent the weekend reading these resources, writing code and it felt good to be back in touch with school math (although I was never good at it) and discover AI from first principles. Was going to write about attention and transformers too but this post is already way too long :P

Code for all of this can be found here → MNIST

references


<
Previous Post
mcp servers
>
Next Post
some python tricks