TL;DR: Walkthrough of NumPy and PyTorch: arrays, broadcasting, tensors, autograd, and training loops. Building an MNIST digit classifier two ways (raw NumPy and PyTorch) to see how the pieces fit together.

This is intro and everything you need to know about Numpy and PyTorch to write production grade code.

NumPy primer

NumPy gives you fast, fixed-type arrays in Python. That’s it. That’s the pitch. Everything else - broadcasting, ufuncs, linear algebra - is built on top of that one idea.

(Visualizations in this section are from Jay Alammar’s excellent visual numpy guide - CC BY-NC-SA 4.0)

why not just use python lists?

Say you want a dot product of two million-element vectors. Pure Python:

import time

def python_dot(a, b):
    """Computes dot product of two lists."""
    result = 0
    for i in range(len(a)):
        result += a[i] * b[i]
    return result

# Let's try it with million-element lists
size = 1_000_000
list_a = list(range(size))
list_b = list(range(size))

start_time = time.time()
py_result = python_dot(list_a, list_b)
end_time = time.time()

print(f"Python dot product result: {py_result}")
print(f"Time taken with Python lists: {(end_time - start_time) * 1000:.2f} ms")

Same thing with NumPy:

import numpy as np

# Let's try it with million-element NumPy arrays
array_a = np.arange(size)
array_b = np.arange(size)

start_time = time.time()
np_result = np.dot(array_a, array_b)
end_time = time.time()

print(f"NumPy dot product result: {np_result}")
print(f"Time taken with NumPy arrays: {(end_time - start_time) * 1000:.2f} ms")

Run both. The NumPy version is 100x to 1000x faster. Two reasons:

  1. Memory layout: NumPy arrays are dense, contiguous, fixed-type blocks. Python lists are arrays of pointers to heap-allocated objects - indirection everywhere.
  2. Vectorization: NumPy ops are C/Fortran loops under the hood. You write Python, it runs at near-C speed. That’s the whole deal.

the ndarray

Everything in NumPy revolves around the ndarray. You create one from a list:

Creating a NumPy array

matrix = np.array([[1, 2, 3],
                   [4, 5, 6]])

Every array has a few attributes you’ll use constantly:

  • .ndim: number of dimensions. Our matrix has 2.
  • .shape: size along each dimension. (2, 3) = 2 rows, 3 columns.
  • .size: total element count. 6 here.
  • .dtype: data type. NumPy arrays are homogeneous - every element is the same type. This is what makes them fast.
print(f"Dimensions: {matrix.ndim}")  # Prints: 2
print(f"Shape: {matrix.shape}")      # Prints: (2, 3)
print(f"Size: {matrix.size}")        # Prints: 6
print(f"Data type: {matrix.dtype}")  # Prints: int64 (or similar, depending on your system)

Set the dtype on creation or cast later with .astype() (returns a new array).

float_matrix = matrix.astype(np.float32)
print(f"New data type: {float_matrix.dtype}") # Prints: float32

creating arrays

In practice you almost never build arrays from Python lists. You use these:

ones, zeros, random

# Create an array of all zeros
zeros_array = np.zeros((2, 3)) 
print(f"Zeros:\n{zeros_array}\n")

# Create an array of all ones
ones_array = np.ones((3, 2), dtype=np.int16)
print(f"Ones:\n{ones_array}\n")

# Create a 2x2 identity matrix (1s on the diagonal)
identity_matrix = np.eye(2)
print(f"Identity:\n{identity_matrix}\n")

# Create an array with a range of elements (like Python's range)
ranged_array = np.arange(0, 10, 2) # Start, stop (exclusive), step
print(f"Ranged array: {ranged_array}\n")

# Create an array with evenly spaced numbers over an interval
linspace_array = np.linspace(0, 1, 5) # Start, stop (inclusive), number of points
print(f"Linspace array: {linspace_array}\n")

ufuncs (element-wise ops)

This is where vectorization actually shows up. Instead of writing this garbage:

matrix = np.array([[1, 2, 3], [4, 5, 6]])
new_matrix = np.zeros(matrix.shape)

for i in range(matrix.shape[0]):
    for j in range(matrix.shape[1]):
        new_matrix[i, j] = matrix[i, j] + 10

You just write:

matrix = np.array([[1, 2, 3], [4, 5, 6]])
new_matrix = matrix + 10
print(f"Matrix + 10:\n{new_matrix}")

array arithmetic

Works for all basic arithmetic (+, -, *, /, **), and between two arrays of the same shape.

subtract, multiply, divide

For aggregation - summing, min, max, etc. - use the axis parameter:

aggregation

  • axis=0 collapses rows (result per column).
  • axis=1 collapses columns (result per row).

matrix aggregation with axis

x = np.array([[1, 2], [3, 4]])

print(f"Sum of all elements: {np.sum(x)}")         # Prints: 10
print(f"Sum of each column: {np.sum(x, axis=0)}")  # Prints: [4 6]
print(f"Sum of each row: {np.sum(x, axis=1)}")      # Prints: [3 7]

indexing and slicing

Python list slicing but for n-dimensions. You specify a slice per dimension, separated by commas.

slicing

basic slicing

a = np.array([[1, 2, 3, 4], 
              [5, 6, 7, 8], 
              [9, 10, 11, 12]])

# Get the first two rows and columns 1 and 2
# (row 0 and 1, column 1 and 2)
b = a[:2, 1:3] 
print(f"Sliced array b:\n{b}") # Prints [[2 3], [6 7]]

fancy indexing and boolean masks

Integer array indexing (“fancy indexing”): pass arrays of indices to grab exactly the elements you want.

# Create an array
a = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
print(f"Original array a:\n{a}\n")

# Create an array of indices to select one element from each row
b_indices = np.array([0, 2, 0, 1])

# Select one element from each row of a using the indices in b
# This selects a[0,0], a[1,2], a[2,0], a[3,1]
selected = a[np.arange(4), b_indices]
print(f"Selected elements: {selected}")  # Prints: [ 1  6  7 11]

# We can also use this to modify elements
a[np.arange(4), b_indices] += 100
print(f"\nModified array a:\n{a}")

Boolean indexing: filter elements by a condition. You’ll use this all the time.

bool_idx = (a > 10) # This creates a boolean array!
print(f"\nBoolean mask (a > 10):\n{bool_idx}\n")

# Use the boolean mask to select only the elements that are > 10
print(f"Elements > 10: {a[bool_idx]}") # or just a[a > 10]

views vs. copies (this will bite you)

Basic slicing returns a view, not a copy. Same underlying memory. Modify the view, you modify the original. This trips up everyone at least once.

.reshape() works the same way - the -1 trick lets NumPy infer a dimension. 12 elements + .reshape(3, -1) = (3, 4). PyTorch’s .view() does the same thing.

reshape

# Create an array with initial shape
original = np.arange(12)
print("Original shape:", original.shape)
print("Original array:", original)

# Reshape using -1 trick
reshaped = original.reshape(2, -1)
print("\nReshaped to (2, -1):\n", reshaped)
print("New shape:", reshaped.shape)
ary = np.array([[1, 2, 3], [4, 5, 6]])
print(f"Original array:\n{ary}\n")

# Create a view of the first row
first_row_view = ary[0]
first_row_view += 100 # Modify the view

print(f"Modified first row: {first_row_view}")
print(f"\nOriginal array is CHANGED:\n{ary}")

This is a performance thing - avoids copying huge arrays. If you want a copy, ask for one explicitly:

copied_first_row = ary[0].copy()
copied_first_row -= 50 # Modify the copy

print(f"\nModified copy: {copied_first_row}")
print(f"Original array is UNCHANGED:\n{ary}")

Rule of thumb:

  • Basic slicing creates views.
  • Advanced indexing (integer or boolean) creates copies.

squeeze, unsqueeze, view

Quick PyTorch preview since these are basically the tensor version of reshape.

unsqueeze(dim) adds a size-1 dimension, squeeze(dim) removes one. You’ll use unsqueeze(0) constantly to add a batch dimension before feeding a single sample to a model.

import torch

# Create a tensor
tensor = torch.tensor([1, 2, 3, 4, 5])
print("Original shape:", tensor.shape)
print("Original tensor:", tensor)

# Add a dimension at position 0 (adds batch dimension)
with_batch = tensor.unsqueeze(0)
print("\nAfter unsqueeze(0):")
print("Shape:", with_batch.shape)
print("Tensor:\n", with_batch)

# Remove the dimension we just added
back_to_original = with_batch.squeeze(0)
print("\nAfter squeeze(0):")
print("Shape:", back_to_original.shape)
print("Tensor:", back_to_original)

# This is especially useful when there's a shape mismatch error between the model and the input data (common when working with single data points instead of batches).

# view is like .reshape in numpy - it lets you specify the desired shape of the tensor.
tensor = torch.tensor([0, 1, 2, 3, 4]) # tensor([0, 1, 2, 3, 4])
reshaped = tensor.view(5, 1) # Reshape to 5 rows, 1 column
print(reshaped.shape) # Output: torch.Size([5, 1])

# Same -1 trick as numpy: let PyTorch infer the size of that dimension
dynamic_reshaped = tensor.view(-1, 1)
print(dynamic_reshaped.shape) # Output: torch.Size([5, 1])

slicing + math

tensor = torch.tensor([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12]
])

second_row_third_column = tensor[1, 2] # second_row_third_column = tensor[1][2] same thing
last_row = tensor[-1]
second_row_third_column # tensor(7)
last_row                # tensor([9, 10, 11, 12])


# Math operations
data = torch.tensor([10.0, 20.0, 30.0, 40.0, 50.0])
data.mean() # tensor(30.)
data.std()  # tensor(15.8114)

# Dot Product
t1 = torch.tensor([1, 2, 3])
t2 = torch.tensor([4, 5, 6])

torch.dot(t1, t2) # tensor(32)
# 1 * 4 + 2 * 5 + 3 * 6
# 4 + 10 + 18
# 32

# Derivatives

x = torch.tensor(2.0, requires_grad=True) # tensor(2., requires_grad=True)
y = x ** 2                                # tensor(4., grad_fn=<PowBackward0>)
y.backward() # x^2 derivative = 2x (2*2) = 4
x.grad                                    # tensor(4.)

x = 2, y = x^2, derivative of x^2 is 2x, so x.grad = 2 * 2 = 4. That’s it. PyTorch just did calculus for you.

reshaping and concatenating

Your data is never in the right shape. Get used to this.

a = np.arange(12) # A 1D array of 0-11
print(f"Original: {a}\n")

# Reshape it into a 3x4 matrix
b = a.reshape(3, 4)
print(f"Reshaped to 3x4:\n{b}\n")

# Flatten it back to 1D
c = b.ravel() # ravel() creates a view if possible
d = b.flatten() # flatten() always creates a copy
print(f"Raveled: {c}\n")

# Transpose (swap rows and columns)
print(f"Transposed:\n{b.T}\n")
# visual: https://jalammar.github.io/images/numpy/numpy-transpose.png

# Join arrays together
ary1 = np.array([[1,2], [3,4]])
ary2 = np.array([[5,6]])
# Join along rows (axis=0)
print(f"Concatenated (axis 0):\n{np.concatenate([ary1, ary2], axis=0)}\n")
# Join along columns (axis=1) - shapes must match!
print(f"Concatenated (axis 1):\n{np.concatenate([ary1, ary1], axis=1)}\n")

broadcasting

Broadcasting lets you do math on arrays of different shapes without writing loops. Simplest case: array + 5 broadcasts the scalar to every element.

More interesting: adding a vector to each row of a matrix.

broadcasting

# A 4x3 matrix
x = np.array([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]])
# A 1x3 vector
v = np.array([1, 0, 1])

# Add v to each row of x
y = x + v 

print(f"Original matrix x:\n{x}\n")
print(f"Vector v: {v}\n")
print(f"Result of broadcasting y = x + v:\n{y}")

Broadcasting

How broadcasting stretches a (1x3) vector to match a (4x3) matrix, and when it fails.

The rules are simple - NumPy compares shapes right-to-left. Two dimensions are compatible if they’re equal or one of them is 1. NumPy “stretches” the size-1 dimension to match.

linear algebra

dot product

m = np.array([[1,2,3], [4,5,6]]) # (2,3)
v = np.array([1,0,1]) # (3,)

# Matrix-vector multiplication
# The @ symbol is the modern, preferred way
result = m @ v 
print(f"m @ v = {result}") # (2,3) @ (3,) -> (2,)

# For explicit matrix-matrix multiplication, shapes must be compatible
# (m, n) @ (n, p) -> (m, p)
m1 = np.arange(6).reshape(2,3)
m2 = np.arange(6).reshape(3,2)

print(f"\nMatrix multiplication:\n{m1 @ m2}")
# You can also use np.matmul(m1, m2)

More here


PyTorch primer

PyTorch makes training neural networks way easier than doing math by hand - loading data, training, loss, gradients, all of it.

internals

PyTorchOverview

  • Tensor library is very similar to numpy with support for GPU acceleration.
  • Autograd library is where the heart of PyTorch is, no need to compute backprop by hand anymore!! PyTorch allows you to save gradients and creates a graph of all gradients of weights and biases so it can update the weights easily.
  • DL library has pretty cool things too, prebuilt loss functions, models, datasets, optimizers which are all used in modern time.

installation

pip install torch
import torch
print(torch.__version__)
# 2.0.0

print(torch.cuda.is_available())
# False # this would print True if you are using Google Collab GPU or personal GPU. Not on apple silicon chips / CPU
# (for M-series chips you have print(torch.backends.mps.is_available()) - MPS backend is now well-supported in recent PyTorch versions and makes inference really faster)

what is a tensor?

Fancy word for a data container / array.

  • Scalar is a tensor of rank 0
  • Vector is tensor of rank 1
  • Matrix is tensor of rank 2

tensor

How are they stored? A tensor is a mathematical concept. But to represent it on our computers, we have to define some sort of physical representation for them. The most common representation is to lay out each element of the tensor contiguously in memory, writing out each row to memory.

CT

  • Whenever a function is called for an operation (like torch.dot(x, y)) -> PyTorch internally does two things:
    • calls the dynamic dispath which checks for dot product implementation based on kernel (CPU implementation, CUDA implementation)
    • same for dtype (float, double, int) (This dispatch is just a simple switch-statement for whatever dtypes a kernel chooses to support.)
      • it should also make sense that we need to a dispatch here: the CPU code (or CUDA code, as it may) that implements multiplication on float is different from the code for int

tw

  • The device, the description of where the tensor’s physical memory is actually stored, e.g., on a CPU, on an NVIDIA GPU (cuda), or perhaps on an AMD GPU (hip) or a TPU (xla). The distinguishing characteristic of a device is that it has its own allocator, that doesn’t work with any other device.
  • The layout, which describes how we logically interpret this physical memory. The most common layout is a strided tensor (which maps the mathematical tensor -> physical location in memory like we talked about), but sparse tensors have a different layout involving a pair of tensors, one for indices, and one for data; MKL-DNN tensors may have even more exotic layout, like blocked layout, which can’t be represented using merely strides.
  • The dtype, which describes what it is that is actually stored in each element of the tensor. This could be floats or integers, or it could be, for example, quantized integers.

when you call a function like torch.add, what actually happens? if you remember the discussion we had about dispatching, you already have the basic picture in your head:

  • We have to translate from Python realm to the C++ realm (Python argument parsing)
  • We handle variable dispatch (VariableType–Type, by the way, doesn’t really have anything to do programming language types, and is just a gadget for doing dispatch.)
  • We handle device type / layout dispatch (Type)
  • We have the actual kernel, which is either a modern native function, or a legacy TH function.
  • They hold multi-dimensional data where each dimension represents a feature.

Taking a cricketer -> Virat Kohli Virat Kohli Tensor would have multiple dimensions (Batting, Bowling, Fielding, Coaching) basically becoming a big matrix with these features and values for them.

# How to create tensors and play with them
tensor0d = torch.tensor(1)
tensor1d = torch.tensor([1,2,3])
tensor2d = torch.tensor([[1, 2], [3, 4]])
tensor3d = torch.tensor([[[1, 2], [3, 4]], [[5, 6], [7, 8]]])

print(tensor3d.dtype) # torch.int64 (PyTorch adopts the default 64-bit integer data type from Python)
print(tensor2d.shape) # torch.Size([2, 2]) matrix of 2x2
# You can reshape arrays
tensor2d.reshape(1,2)
print(tensor2d)

matrix1 = torch.tensor([
                            [1, 2, 3],
                            [4, 5, 6]
                       ]) # shape = 2 * 3

print(matrix1.shape)
print(matrix1.reshape(3,2)) # opposite (we do this a lot while multiplying matrices since we need (p,q) * (q, m) so q = q for them to be multiplied (here p,q,q,m are rows and columns of two matrices))

'''
This prints matrix of 3 * 2 while preserving all elements and now we can use this to multiply to a matrix of (2 x whatever)
tensor([[1, 2],
        [3, 4],
        [5, 6]])
'''
  • Instead of reshape we just use view, which is same as .reshape in numpy
print(matrix1.view(3,2))
  • For transposing matrices we can use .T
tensor = torch.tensor([1, 2, 3, 4, 5])
# Add a dimension at position 0 (adds batch dimension)
with_batch = tensor.unsqueeze(0)
print(with_batch.shape) # torch.Size([1, 5])

# Remove the dimension we just added
back_to_original = with_batch.squeeze(0)

The .view() method reshapes a tensor without copying the data. It’s a “free” operation.

multiply = matrix1 @ matrix2  # (2 * 3) * (3 * 2) = final matrix (2 * 2)
print(f"After multiplication: {multiply} of shape {multiply.shape}")
After multiplication: tensor(
        [[14, 32],
        [32, 77]]) of shape torch.Size([2, 2])



Matrix1 = [
            [1,2,3]
            [4,5,6]
            ]

Matrix2 = [
        [1, 4],
        [2, 5],
        [3, 6]
        ]

Matrix1 * Matrix2 = every elem of row of m1 gets multiplied with every elem of col of m2 and added
{1*1 + 2*2 + 3*3} = {14} which is the first number
Similarily
{1 * 4 + 2 * 5 + 3 * 6} = {32} which is the second number

and so on...

PyTorch dimensions

tensor = torch.tensor([1,2,3,4,5]) 
print(tensor.shape) [] # torch.Size([5])
s = torch.sum(tensor,dim=0) 
print(s, s.shape) # scalar has no shape
# tensor(15) torch.Size([])

More operations

tensor = torch.tensor([
        [1,2,3], [4,5,6]
]) # 2x3

print("Dimension 0")
dimo0 = torch.sum(tensor, dim=0) # adds both lists (1 + 4, 2 + 5, 3 + 6)
print(dimo0)
print()
print("Dimension 1")
dimo1 = torch.sum(tensor, dim=1) # adds inner list (1+2+3, 4+5+6)
print(dimo1)

print("Inner most (just like last access in list)")
dims = torch.sum(tensor, dim=-1)
print(dims)
'''
torch.Size([2, 3])
Dimension 0
tensor([5, 7, 9])

Dimension 1
tensor([ 6, 15])

Inner most (just like last access in list)
tensor([ 6, 15])
'''
# 3d!!
tensor = torch.tensor([
    [ [1,2,3], [4,5,6] ],
    [ [7,8,9], [10,11,12] ]
])
print(tensor.shape) # 2 outer dimensions, each has 2 elements (2,2,3)

autograd

PyTorch’s autograd computes gradients automatically by building a dynamic computation graph - a directed graph of every operation you run. No manual calculus.

how autograd works (pictures)

ag1 ag2 ag1 ag1

  • PyTorch destroys the graph after .backward() to free memory. Set retain_graph=True if you need it again.
  • Call .backward() on the loss and PyTorch computes gradients for all leaf nodes, stored in .grad. No manual looping.

Create neural net:

  • Subclass torch.nn.Module. Define layers in __init__, wire them up in forward.

model

NeuralNetwork(
  (layers): Sequential(
    (0): Linear(in_features=50, out_features=30, bias=True)
    (1): ReLU()
    (2): Linear(in_features=30, out_features=20, bias=True)
    (3): ReLU()
    (4): Linear(in_features=20, out_features=3, bias=True)
  )
)
  • Linear layer = y = xW^T + b. That’s a fully connected / feedforward layer.
  • Use manual_seed for reproducible weight init.
  • grad_fn=<AddmmBackward0> = this tensor came from a matmul + add. Addmm = Add + matrix multiply.
  • At inference time, skip graph construction with torch.no_grad() - saves memory, you don’t need gradients.
  • Convention: models return raw logits. Apply softmax yourself at inference time for class probabilities.

data loaders

Custom dataset class -> DataLoader wraps it -> handles batching, shuffling, parallel loading.

num_workers

num_workers=0 means data loading happens in the main process. For small datasets this is fine. For anything real on a GPU, set it higher - otherwise the GPU sits idle waiting for the next batch while the CPU fetches data. With num_workers > 0, worker processes load batches in the background while the model trains.

validation set

There’s usually a third split: validation set for tuning hyperparameters. You can peek at it repeatedly while tweaking. The test set you look at once - otherwise you’re cheating.

  • model.train() / model.eval() toggle training vs eval mode. Dropout and batchnorm behave differently.
  • optimizer.zero_grad() every step. PyTorch accumulates gradients by default.
# -- autograd --

y = torch.tensor([1.0]) # true label
x1 = torch.tensor([1.1]) # inp
w1 = torch.tensor([2.2], requires_grad=True)
b = torch.tensor([0.0], requires_grad=True)
z = x1 * w1 + b
a = torch.sigmoid(z)
loss = F.binary_cross_entropy(a, y)
print(loss)
grad_L_w1 = grad(loss, w1, retain_graph=True)
print(f"Gradient of Loss wrt Weight: {grad_L_w1}")
grad_L_b = grad(loss, b, retain_graph=True)
print(f"Gradient of Loss wrt Bias: {grad_L_b}")

loss.backward() # instead of doing all that jazz, just call this
print(w1.grad) # same results
print(b.grad)

import torch
import torch.nn.functional as F
target = torch.tensor([1.0]) 
input = torch.tensor([1.1])
weight = torch.tensor([1.2])
bias = torch.tensor([0.0]) 


# y = mx + c
forward_pass = input * weight + bias

# applying activation function to this
prediction = torch.sigmoid(forward_pass)

loss = F.binary_cross_entropy(prediction, target)
print(loss) # tensor(0.2368) So we have a loss of 0.23 (our goal is to minimise this but that is for later)

This is how the graph for this looks like: graph


  • If any terminal node has requires_grad=True, PyTorch builds the computation graph automatically. Gradients flow through it via backprop (chain rule).

ag

Start from the loss, work backward to the input. Compute dL/d(each parameter). Use those gradients to update weights via gradient descent.

  • Gradient = vector of all partial derivatives of a function.
  • Chain rule lets you decompose the gradient through a graph of operations.

Same thing we did by hand in numpy, but now PyTorch does it for us:

import torch
import torch.nn.functional as F
from torch.autograd import grad
target = torch.tensor([1.0]) 
input = torch.tensor([1.1])
weight = torch.tensor([1.2], requires_grad=True) # we need to save this and bias for it to compute partial derivative
bias = torch.tensor([0.0], requires_grad=True) 


# y = mx + c
forward_pass = input * weight + bias

# applying activation function to this
prediction = torch.sigmoid(forward_pass)

loss = F.binary_cross_entropy(prediction, target)
grad_loss_weight = grad(loss, weight, retain_graph=True)
grad_loss_bias = grad(loss, bias, retain_graph=True)
print(grad_loss_weight, grad_loss_bias) # (tensor([-0.2319]),) (tensor([-0.2108]),) [We need to update the weights and bias in the opposite direction of this (shift this) so that loss becomes minimum (same thing we were doing in numpy)]

Trainable params live in nn.Linear layers (y = xW^T + b). Weights get initialized with small random numbers - different every time unless you seed with manual_seed.

def note_about_randomness():
    Randomness shows up in many places: parameter initialization, dropout, data ordering, etc.
    For reproducibility, we recommend you always pass in a different random seed for each use of randomness.
    Determinism is particularly useful when debugging, so you can hunt down the bug.
    There are three places to set the random seed which you should do all at once just to be safe.
    # Torch
    seed = 0
    torch.manual_seed(seed)
    # NumPy
    import numpy as np
    np.random.seed(seed)
    # Python
    import random
    random.seed(seed)

torch.manual_seed(42)

model = NN(50, 3)
print(model.layers[0].weight) # weights of first linear layer (weights and biases)
'''

Parameter containing:
tensor([[ 0.1081,  0.1174, -0.0331,  ...,  0.0253,  0.0718, -0.0862],
        [-0.1400, -0.0546, -0.1085,  ..., -0.0477, -0.0501, -0.1368],
        [-0.0810,  0.0353, -0.0187,  ...,  0.1142,  0.1288, -0.1121],
        ...,
        [-0.0031, -0.0573,  0.0515,  ...,  0.0271, -0.0928, -0.1175],
        [-0.0444, -0.1318, -0.0660,  ...,  0.0647, -0.1230, -0.0531],
        [ 0.0023, -0.1223,  0.0797,  ...,  0.0369,  0.0862,  0.1328]],
       requires_grad=True)
'''

X = torch.rand((1, 50)) # random input tensor
y = model(X) # forward pass
print(y) # 3 outputs

# tensor([[ 0.1685, -0.1599,  0.2402]], grad_fn=<AddmmBackward0>)

# Addmm stands for matrix multiplication (mm) followed by an addition (Add) Which is the same as y = mx + c (forward pass)

forward pass and inference

Forward pass = input goes in, flows through all layers, output comes out. That’s it.

At inference time you don’t need gradients or the computation graph - just wrap in torch.no_grad() to save memory:

model = NN(50, 3) # 50 inp, 3 out
X = torch.rand((1, 50)) # random input tensor
with torch.no_grad():
    output = model(X)
print(output)

# tensor([[ 0.1073, -0.0125, -0.0168]]) (there is no grad function here we aren't saving it, we are just computing and printing output of our network which is what we need at inference time)


Raw logits are useless for interpretation - apply softmax to get probabilities in [0, 1]:

model = NN(50, 3) # 50 inp, 3 out
X = torch.rand((1, 50)) # random input tensor
with torch.no_grad():
    output = torch.softmax(model(X), dim=1) # softmax to convert to probabilities
print(output)
# tensor([[0.3101, 0.3544, 0.3355]]) (And now we take the highest (second number here and that is our prediction/answer)) {All these class prob sum upto 1, you can read more about softmax function its a fun trick in math which basically maps all numbers b/w 0-1 and they sum upto 1}

How softmax formula looks like \(\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\)

mlp

Three things you need for any model: load the data, define the model, train it.

  • torchvision.datasets - pre-packaged datasets (MNIST, CIFAR, etc). Downloads and organizes everything.
  • DataLoader - batching, shuffling, parallel loading. You loop over it during training.
  • transforms.ToTensor() - converts images to tensors, scales pixels from [0,255] to [0.0, 1.0].

loading the dataset

import torch
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Define transforms: Convert to tensor and Normalize
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)) # MNIST mean and std
])

# Download datasets
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

# Create DataLoaders
batch_size = 64
train_dataloader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: {batch_size}")

Batch shape: (64, 1, 28, 28) = 64 images, 1 channel (grayscale), 28x28 pixels. Labels: (64,) with digits 0-9.

Every model subclasses nn.Module. Define layers in __init__, wire them in forward:

  • nn.Flatten(): (1, 28, 28) -> (784)
  • nn.Linear(in, out): y = xW^T + b
  • nn.ReLU(): max(0, x) - without this the whole network is just one big linear function

the model

class MLP(nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        # flatten
        self.flatten = nn.Flatten()
        # first linear layer transformation (first y=mx+c) : 28*28(784) -> 128
        self.linear1 = nn.Linear(784, 128)
        self.activation = nn.ReLU()
        self.linear2 = nn.Linear(128, 64) # second layer (MNIST has 2 hidden layers we are using)
        self.activation = nn.ReLU()
        self.linear3 = nn.Linear(64, 10) # take 64 inputs from hidden layer 2 and output the final 10 class labels

    def forward(self, x):
        x = self.flatten(x)
        x = self.activation(self.linear1(x))
        x = self.activation(self.linear2(x))
        x = self.linear3(x)
        return x # raw logits (no softmax here, CrossEntropyLoss handles it)
    
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
model = MLP().to(device)
print(model)

'''
MLP(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear1): Linear(in_features=784, out_features=128, bias=True)
  (activation): ReLU()
  (linear2): Linear(in_features=128, out_features=64, bias=True)
  (linear3): Linear(in_features=64, out_features=10, bias=True)
)
'''

Training needs three things: a loss function (nn.CrossEntropyLoss - softmax + NLL in one), an optimizer (Adam is the default choice), and a training loop.

training loop

Five steps, every batch:

  1. Forward pass - model(batch) -> logits
  2. Loss - how wrong are we?
  3. Backward - loss.backward() computes all gradients
  4. Step - optimizer.step() updates weights
  5. Zero grad - optimizer.zero_grad() resets for next batch

Training + eval + inference:

print("Evaluation")

def evaluate(dataloader, model):
    model.eval() # set to evaluation mode 
    correct, total = 0, 0

    with torch.no_grad():
        for x, y in dataloader:
            x, y = x.to(device), y.to(device)
            logits = model(x)
            preds = logits.argmax(dim=1)
            correct += (preds == y).sum().item()
            total += y.size(0)

    print(f"Test accuracy: {correct / total:.4f}")

evaluate(test_dataloader, model)

# --------
print("Starting inference on a sample image from test dataset")
def inference():
    # sample input (digit 8) and predict its digit 8 and also show to verify
    # set to inference/eval mode
    model.eval()
    with torch.no_grad():
        sample_idx = 0
        sample_img, sample_label = test_dataset[sample_idx]
        sample_img = sample_img.to(device)
        print(f"Actual label: {sample_label}")
        sample_img_reshaped = sample_img.unsqueeze(0) # add batch dimension
        pred_logits = model(sample_img_reshaped)
        predicted_label = pred_logits.argmax(dim=1)
        print(f"Predicted label: {predicted_label}")

All together, output looks like:

Training samples: 60000
Test samples: 10000
Batch size: 64

Training completed
Test accuracy: 0.9797
Starting inference on a sample image from test dataset
Actual label: 7
Predicted label: tensor([7], device='mps:0')

same MNIST, different style

Same thing but using F.relu instead of nn.ReLU(), x.view() instead of nn.Flatten(), and SGD instead of Adam. Different spelling, same idea.

  • Input: 28x28 -> 784
  • Hidden layer 1: 784 -> H1
  • Hidden layer 2: H1 -> H2
  • Output: H2 -> 10 (10 numbers)
  • ReLU activations (if negative -> make it 0 else let it be)
  • Cross-entropy loss (logits, no softmax in forward, raw loss) - CrossEntropyLoss = LogSoftmax + NLLLoss internally.
#!/usr/bin/env python3

import torch
import torch.nn.functional as F
from torch.autograd import grad
from torchvision import datasets, transforms # for datasets and transformations
from torch.utils.data import DataLoader # to load data


'''
How a training loop looks like:
- any neural network / deep learning class has torch.nn.Module as a base class
- all neural networks are initialised in the init method
- there is a forward pass which is here y = mx + c (the magic) happens
- there is a loss calculation phase, and .backward() which is for backprop
- also updation of weights 
'''
class MNIST(torch.nn.Module):
    '''
    h1 = ReLU(XW1 + b1)
    h2 = ReLU(h1W2 + b2)
    logits = h2W3 + b3
    '''
    def __init__(self, h1=256, h2=128):
        super().__init__()
        self.fc1 = torch.nn.Linear(28*28, h1)
        self.fc2 = torch.nn.Linear(h1, h2)
        self.fc3 = torch.nn.Linear(h2, 10)

    def forward(self, x):
        # x: [B, 1, 28, 28] (batch size, channels, height, width)

        x = x.view(-1, 28*28) # # flatten → [B, 784]
        x = F.relu(self.fc1(x)) # hidden layer 1 with ReLU activation
        x = F.relu(self.fc2(x)) # hidden layer 2 with ReLU activation
        x = self.fc3(x)
        return x


transforms = transforms.ToTensor() # transform to convert images to tensors

# make the training and testing set
train_dataset = datasets.MNIST(root='./data', train=True, transform=transforms, download=True) 
test_dataset = datasets.MNIST(root='./data', train=False, transform=transforms)  

# load training and testing set via DataLoader 
# parameters, batch_size = how many images does the model see at once
train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=1000, shuffle=False)  

model = MNIST()
# GPU selection
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)


# learning rate gradient descent (how to take steps towards achieving the loss function minima)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)
num_epochs = 5 # iterations

for epoch in range(num_epochs):
    model.train() # set the model to training mode (important)
    for batch_idx, (data, target) in enumerate(train_loader): # data : to be predicted input, target: what we want
        optimizer.zero_grad() # zero the gradients (necessary cleanup for good training)
        output = model(data) # forward pass
        loss = F.cross_entropy(output, target) # compute the loss
        loss.backward() # backward pass
        optimizer.step() # update the weights

        if batch_idx % 100 == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}], Step [{batch_idx+1}/{len(train_loader)}], Loss: {loss.item():.4f}')

    model.eval() # set the model to evaluation mode 
    correct = 0
    total = 0
    with torch.no_grad():
        # test again test dataset
        for data, target in test_loader:
            output = model(data)
            _, predicted = torch.max(output.data, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    print(f'Accuracy of the model on the test images: {100 * correct / total:.2f}%')

# Save the model checkpoint
torch.save(model.state_dict(), 'mnist_model.pth')
  • Output (trimmed, showing first and last epoch):
Epoch [1/5], Step [1/938], Loss: 2.3024
...
Epoch [1/5], Step [901/938], Loss: 0.3475
Accuracy of the model on the test images: 92.91%
...
Epoch [5/5], Step [901/938], Loss: 0.0206
Accuracy of the model on the test images: 97.17%
  • Now using this to evaluate on a specific number sample
import torchvision
import matplotlib.pyplot as plt
import numpy as np
def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# Load the model for inference
model = MNIST()
model.load_state_dict(torch.load('mnist_model.pth')) # load the pth model you saved

model.eval()
# Get a sample from the test dataset
dataiter = iter(test_loader)
images, labels = next(dataiter)

# Show the image
imshow(torchvision.utils.make_grid(images[:1]))

# Predict the label
output = model(images[:1]) # slicing (from 0->1 (one image) : prediction on single image)
_, predicted = torch.max(output, 1)
print(f'Predicted Label: {predicted.item()}')
  • Output
> python3 ./main.py
Predicted Label: 7

So our model works just like it did in numpy, but now its way easier to do this in little lines

doing it with raw matrices (no nn.Linear)

Same training loop, but the model defines W and b manually with nn.Parameter instead of using nn.Linear. Just to prove there’s no magic.

#!/usr/bin/env python3
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

class MLP2Hidden(torch.nn.Module):
    def __init__(self, h1=256, h2=128):

        super().__init__()
        # Manually define weights and biases
        self.W1 = nn.Parameter(torch.randn(784, h1) * 0.01)
        self.b1 = nn.Parameter(torch.zeros(h1))
        self.W2 = nn.Parameter(torch.randn(h1, h2) * 0.01)
        self.b2 = nn.Parameter(torch.zeros(h2))
        self.W3 = nn.Parameter(torch.randn(h2, 10) * 0.01)
        self.b3 = nn.Parameter(torch.zeros(10))

    def forward(self, x):
        x = x.view(-1, 784) # Flatten

        # Layer 1: X @ W + b
        z1 = x @ self.W1 + self.b1
        h1 = F.relu(z1)

        # Layer 2
        z2 = h1 @ self.W2 + self.b2
        h2 = F.relu(z2)

        # Layer 3 (output)
        logits = h2 @ self.W3 + self.b3
        return logits

transform = transforms.ToTensor()
train_ds = datasets.MNIST(root="data", train=True, download=True, transform=transform)
test_ds = datasets.MNIST(root="data", train=False, download=True, transform=transform)

train_loader= DataLoader(train_ds, batch_size=128, shuffle=True)
test_loader= DataLoader(test_ds, batch_size=128)


model = MLP2Hidden()
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    total_loss = 0
    for x, y in train_loader:
        optimizer.zero_grad()
        prediction = model(x)
        loss = loss_fn(prediction, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    avg_loss = total_loss / len(train_loader)
    print(f"Epoch {epoch+1}, Loss: {avg_loss:.4f}")


# now prediction

def test():
    model.eval()
    with torch.no_grad():
        x, y = next(iter(test_loader))
        prediction = model(x)
        print("Shape: ", prediction.shape)  # [B, 10]
        loss = loss_fn(prediction, y)
        print("Loss: ", loss.item())
        predicted_classes = torch.argmax(prediction, dim=1)
        print("True labels:", y[:10].tolist())
        print("Pred labels:", predicted_classes[:10].tolist())
        accuracy = (predicted_classes == y).float().mean()
        print("Accuracy: ", accuracy.item())

def main():
    test()

main()

Loss converges as we train longer:

Epoch 1, Loss: 2.3016
...
Epoch 10, Loss: 0.5778
...
Epoch 20, Loss: 0.3305
True labels: [7, 2, 1, 0, 4, 1, 4, 9, 5, 9]
Pred labels: [7, 2, 1, 0, 4, 1, 4, 9, 6, 9]

backprop math for 2-hidden-layer MLP

Mapping the PyTorch code directly to the math:


model definition

Input:

  • \[X \in \mathbb{R}^{B \times 784}\]

Parameters:

  • \[W_1 \in \mathbb{R}^{784 \times H_1}, \quad b_1 \in \mathbb{R}^{H_1}\]
  • \[W_2 \in \mathbb{R}^{H_1 \times H_2}, \quad b_2 \in \mathbb{R}^{H_2}\]
  • \[W_3 \in \mathbb{R}^{H_2 \times 10}, \quad b_3 \in \mathbb{R}^{10}\]

forward pass

\[\begin{aligned} z_1 &= XW_1 + b_1 \\ h_1 &= \text{ReLU}(z_1) \\ z_2 &= h_1W_2 + b_2 \\ h_2 &= \text{ReLU}(z_2) \\ z_3 &= h_2W_3 + b_3 \end{aligned}\]

( z_3 ) = logits (raw scores before softmax).


loss (cross-entropy)

For one sample:

\[L = -\sum_{k=1}^{10} y_k \log(\hat{y}_k)\]

where:

\[\hat{y} = \text{softmax}(z_3)\]

key gradient identity (softmax + cross entropy)

\[\boxed{ \frac{\partial L}{\partial z_3} = \hat{y} - y }\]

This is why you don’t implement softmax in the forward pass - CrossEntropyLoss already handles it.


backpropagation

output layer (layer 3)

\[\begin{aligned} \delta_3 &= \hat{y} - y \quad (\text{Error at output}) \\ \frac{\partial L}{\partial W_3} &= h_2^\top \delta_3 \\ \frac{\partial L}{\partial b_3} &= \sum_{i=1}^{B} \delta_3^{(i)} \end{aligned}\]

backprop through ReLU (layer 2)

ReLU derivative:

\[\text{ReLU}'(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases}\] \[\delta_2 = (\delta_3 W_3^\top) \odot \mathbb{1}(z_2 > 0)\]

hidden layer 2 params

\[\begin{aligned} \frac{\partial L}{\partial W_2} &= h_1^\top \delta_2 \\ \frac{\partial L}{\partial b_2} &= \sum_{i=1}^{B} \delta_2^{(i)} \end{aligned}\]

backprop through ReLU (layer 1)

\[\delta_1 = (\delta_2 W_2^\top) \odot \mathbb{1}(z_1 > 0)\]

hidden layer 1 params

\[\begin{aligned} \frac{\partial L}{\partial W_1} &= X^\top \delta_1 \\ \frac{\partial L}{\partial b_1} &= \sum_{i=1}^{B} \delta_1^{(i)} \end{aligned}\]

gradient descent update

For any parameter $ \theta \ $:

\[\theta \leftarrow \theta - \eta \frac{\partial L}{\partial \theta}\]

( \eta ) = learning rate.


mapping to PyTorch code

PyTorch Code Mathematical Meaning
x @ W + b affine transformation
F.relu(z) $( \max(0, z) )$
CrossEntropyLoss softmax + NLL
loss.backward() chain rule
param.grad $ \frac{\partial L}{\partial \theta} $
optimizer.step() gradient descent

key insight

PyTorch autograd runs the exact same backprop equations above, node by node, via the chain rule.

Nothing is approximated. It’s exact symbolic differentiation.


custom dataloaders

dl

Building your own Dataset class. Five training examples, two features each:

from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch
X_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])
y_train = torch.tensor([0, 0, 0, 1, 1])

X_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])

y_test = torch.tensor([0, 1])

class ToyDataset(Dataset):
    def __init__(self, X, y, transform=None):
        self.X = X
        self.y = y
        self.transform = transform

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        image = self.X[idx]
        label = self.y[idx]
        if self.transform:
            image = self.transform(image)
        return image, label

train_ds = ToyDataset(X_train, y_train)
test_ds = ToyDataset(X_test, y_test)

torch.manual_seed(123)

train_loader = DataLoader(
    dataset=train_ds,
    batch_size=2,
    shuffle=True,
    num_workers=0
)

test_ds = ToyDataset(X_test, y_test)

test_loader = DataLoader(
    dataset=test_ds,
    batch_size=2,
    shuffle=False,
    num_workers=0
)

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx+1}:", x, y)

'''
Batch 1: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
Batch 2: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 3: tensor([[ 2.7000, -1.5000]]) tensor([1])
'''

One full pass through the training data = one epoch. num_workers > 0 loads batches in parallel so your GPU doesn’t sit idle waiting for data.

mw

training loop (recap)

Same pattern, toy model:

import torch.nn.functional as F


torch.manual_seed(123)
model = NeuralNetwork(num_inputs=2, num_outputs=2)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3

for epoch in range(num_epochs):

    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)

        loss = F.cross_entropy(logits, labels) # Loss function

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        ### LOGGING
        print(f"Epoch: {epoch+1:03d}/{num_epochs:03d}"
              f" | Batch {batch_idx:03d}/{len(train_loader):03d}"
              f" | Train/Val Loss: {loss:.2f}")

    model.eval()
    # Optional model evaluation
'''
Epoch: 001/003 | Batch 000/002 | Train/Val Loss: 0.75
Epoch: 001/003 | Batch 001/002 | Train/Val Loss: 0.65
Epoch: 002/003 | Batch 000/002 | Train/Val Loss: 0.44
Epoch: 002/003 | Batch 001/002 | Train/Val Loss: 0.13
Epoch: 003/003 | Batch 000/002 | Train/Val Loss: 0.03
Epoch: 003/003 | Batch 001/002 | Train/Val Loss: 0.00
'''

SGD with lr=0.5. Logits go directly into cross_entropy (applies softmax internally). loss.backward() computes gradients, optimizer.step() applies them. zero_grad() every iteration or gradients accumulate.

# After training -> evaluate
model.eval()

with torch.no_grad():
    outputs = model(X_train)

print(outputs)

'''
tensor([[ 2.8569, -4.1618],
        [ 2.5382, -3.7548],
        [ 2.0944, -3.1820],
        [-1.4814,  1.4816],
        [-1.7176,  1.7342]])
'''
probas = torch.softmax(outputs, dim=1)
print(probas)

'''
tensor([[    0.9991,     0.0009],
        [    0.9982,     0.0018],
        [    0.9949,     0.0051],
        [    0.0491,     0.9509],
        [    0.0307,     0.9693]])
first value (column) means that the training example has a 99.91% probability of belonging to class 0 and a 0.09% probability of belonging to class 1.
'''
predictions = torch.argmax(probas, dim=1)
print(predictions) # tensor([0, 0, 0, 1, 1])

# Just use argmax of outputs along first dimension (their actual values):
predictions = torch.argmax(outputs, dim=1)
print(predictions) # tensor([0, 0, 0, 1, 1])

# Now we compare this to true training labels to see if model is correct
predictions == y_train # tensor([True, True, True, True, True])


# Better way to do this:
torch.sum(predictions == y_train) # output = 5 (100% accuracy)

cleaner eval function:

def compute_accuracy(model, dataloader):

    model = model.eval()
    correct = 0.0
    total_examples = 0

    for idx, (features, labels) in enumerate(dataloader):

        with torch.no_grad():
            logits = model(features)

        predictions = torch.argmax(logits, dim=1)
        compare = labels == predictions
        correct += torch.sum(compare)
        total_examples += len(compare)

    return (correct / total_examples).item()

'''
compute_accuracy(model, train_loader) # result = 1.0
compute_accuracy(model, test_loader) # result = 1.0
All correct
'''

  • torch.save(model.state_dict(), "model.pth") - state_dict maps each layer to its weights/biases.

  • GPU transfer:

# New: Define a device variable that defaults to a GPU.
device = torch.device("cuda")
# New: Transfer the model onto the GPU.
model.to(device)

GPU and dtypes

Tensors default to float32 on CPU. Move to GPU for parallelism:

cputogpu

x = torch.ones(16, 32)
w = torch.ones(32, 2)
y = x @ w
assert y.size() == torch.Size([16, 2]) # true

bs

Operations happen per-example in a batch and per-token in a sequence:

x = torch.ones(4, 8, 16, 32)
w = torch.ones(32, 2)
y = x @ w
assert y.size() == torch.Size([4, 8, 16, 2]) 
# In this case, we iterate over values of the first 2 dimensions of x and multiply by w.

FLOPS

A FLOP = one addition or multiplication. Two confusingly identical acronyms:

  • FLOPs = floating-point operations (total compute done)
  • FLOP/s = floating-point operations per second (hardware speed)

  • Training GPT-3 (2020) took 3.14e23 FLOPs
  • Training GPT-4 (2023) is speculated to take 2e25 FLOPs

recap

 y = 0.5 (x * w - 5)^2
# Forward pass: compute loss
x = torch.tensor([1., 2, 3])
w = torch.tensor([1., 1, 1], requires_grad=True)  # Want gradient
pred_y = x @ w
loss = 0.5 * (pred_y - 5).pow(2)
# Backward pass: compute gradients
loss.backward()
assert loss.grad is None
assert pred_y.grad is None
assert x.grad is None
assert torch.equal(w.grad, torch.tensor([1, 2, 3]))

additional reading

PyTorch Internals

full flow (putting it all together)

# input layer: 10 inputs
# 1st hidden layer: 6 nodes and 1 bias unit (edges represent weight connections)
# 2nd hidden layer has 4 nodes and a node repr bias units
# output layer: 3 outputs

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader

class NeuralNetwork(torch.nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(NeuralNetwork, self).__init__()
        self.layers = torch.nn.Sequential(
            # 1st hidden layer
            torch.nn.Linear(num_inputs, 30),
            torch.nn.ReLU(),
            # 2nd hidden layer
            torch.nn.Linear(30, 20),
            torch.nn.ReLU(),
            # output layer
            torch.nn.Linear(20, num_outputs),
        )
    
    def forward(self, x):
        return self.layers(x) # logits
    
model = NeuralNetwork(50, 3)
# print(model)

num_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total number of trainable model parameters: {num_parameters}")
 
# print(model.layers[0].weight)
# print(model.layers[0].weight.shape) # 30, 50
# print(model.layers[0].bias.shape) # 30

    

torch.manual_seed(123)
# model = NeuralNetwork(50, 3)
# print(model.layers[0].weight)

x = torch.rand((1, 50)) # our network expects 50-dimensional feature vectors
out = model(x)
print(out) # 3 outputs

# disable gd, use for inference
with torch.no_grad():
    out=model(x)
print(out)

# class prob
with torch.no_grad():
    out=torch.softmax(model(x), dim=1)
print(out)

# 5 training examples with two features each, 3 classes belong to 0, 2 belong to class 1, also make test set of two entries.
x_train = torch.tensor([
    [-1.2, 3.1],
    [-0.9, 2.9],
    [-0.5, 2.6],
    [2.3, -1.1],
    [2.7, -1.5]
])
y_train = torch.tensor([0, 0, 0, 1, 1])
x_test = torch.tensor([
    [-0.8, 2.8],
    [2.6, -1.6],
])
y_test = torch.tensor([0, 1])

# toy dataset
class ToyDataset(Dataset):
    def __init__(self, x, y):
        self.features = x
        self.labels = y

    def __getitem__(self, index):
        one_x = self.features[index]
        one_y = self.labels[index]
        return one_x, one_y
    
    def __len__(self):
        return self.labels.shape[0]
    
train_ds = ToyDataset(x_train, y_train)
test_ds = ToyDataset(x_test, y_test)

# purpose: use it to instantiate dataloader 
# print(len(train_ds)) # 5

torch.manual_seed(123)
train_loader = DataLoader(dataset=train_ds, batch_size=2, shuffle=True, num_workers=0)
test_loader = DataLoader(dataset=test_ds, batch_size=2, shuffle=False, num_workers=0)

for idx, (x, y) in enumerate(train_loader):
    print(f"Batch {idx + 1}: ", x, y)


'''
Note that we specified a batch size of 2 above, but the 3rd batch only contains a single example. That’s because we have five training examples, which is not evenly divisible by 2. In practice, having a substantially smaller batch as the last batch in a training epoch can disturb the convergence during training. To prevent this, it’s recommended to set drop_last=True
which will drop the last batch in each epoch: drop_last=True

Batch 1: tensor([[-1.2000,  3.1000],
        [-0.5000,  2.6000]]) tensor([0, 0])
Batch 2: tensor([[ 2.3000, -1.1000],
        [-0.9000,  2.9000]]) tensor([1, 0])
'''


# Training loop
torch.manual_seed(123)
model = NeuralNetwork(2, 3)
optimizer = torch.optim.SGD(model.parameters(), lr=0.5)

num_epochs = 3
for epoch in range(num_epochs):
    model.train()
    for batch_idx, (features, labels) in enumerate(train_loader):

        logits = model(features)
        loss = F.cross_entropy(logits, labels)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        print(f"Epoch {epoch+1}/{num_epochs}, Batch {batch_idx+1}/{len(train_loader)}, Loss: {loss.item():.4f}")

    model.eval()

    with torch.no_grad():
        outputs = model(x_train)

    torch.set_printoptions(sci_mode=False)
    prob = torch.softmax(outputs, dim=1)
    print(prob)
    '''
     means that the training example has a 99.91% probability of belonging to class 0 and a 0.09% probability of belonging to class 1. (The set_printoptions call is used here to make the outputs more legible.)
    '''
    # convert these into class label predictions using argmax (returns index posn of highest val in each row if wet dim=1 and highest value in each column if dim=0)
    predictions = torch.argmax(prob, dim=1)
    print(predictions) # [0,0,0,1,1] our desired output
    # verifying
    print(f"Number of correct predictions out of 5: {torch.sum(predictions==y_train)}")

    # print(outputs)

def compute_accuracy(model, dataloader):
    model = model.eval()
    correct= 0.0
    total_examples = 0
    for idx, (features, labels) in enumerate(dataloader):
        with torch.no_grad():
            logits = model(features)
        predictions = torch.argmax(logits, dim=1)
        compare = (labels == predictions)
        correct += torch.sum(compare)
        total_examples += len(compare)
    return (correct / total_examples).item()

print(compute_accuracy(model, train_loader)) # 1.0 since all are correct predictions

# saving the model
torch.save(model.state_dict(), 'model.pth')


# loading the model
model = NeuralNetwork(2, 3)# needs to match orig model exactly
print(model.load_state_dict(torch.load('model.pth'))) # <All keys matched>

tensor_1 = torch.tensor([1., 2., 3.])
tensor_2 = torch.tensor([4., 5., 6.])

print(tensor_1 + tensor_2)
# using cuda
# tensor_1 = tensor_1.to('cuda')
# tensor_2 = tensor_2.to('cuda')
# print(tensor_1 + tensor_2)

# device = torch.device("cuda")
# New: Transfer the model onto the GPU. 
# model.to(device)
# features, labels = features.to(device), labels.to(device)    

# better method: device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

<
Previous Post
scaling laws
>
Next Post
all the math you need for ai