Performance¤
MiniTorch is slow compared to PyTorch. That's by design - the goal is readable code, not speed. This page explains where the time goes and what PyTorch does differently.
Benchmark¤
On a typical machine, for 100 epochs of linear regression on 100 samples:
| Framework | Time |
|---|---|
| MiniTorch | ~25ms |
| NumPy (manual) | ~12ms |
| PyTorch | ~5ms |
MiniTorch is roughly 2x slower than hand-written NumPy and 5x slower than PyTorch. The gap gets worse on larger models.
Where the time goes¤
1. Python object creation¤
Every operation creates a new Tensor object with a closure for its backward function. A simple y = x * w + b creates 3 new Tensor objects, 3 closures, 3 sets for _prev, and 3 strings for _op.
PyTorch does this in C++. Object creation is nearly free.
2. One NumPy call per operation¤
x @ w + b makes two separate NumPy calls: one for matmul, one for add. Each call has Python-to-C overhead, and NumPy allocates a new array for each result.
PyTorch fuses operations into optimized kernels. A fused matmul+bias is significantly faster than two separate calls.
3. No in-place operations¤
MiniTorch always creates new arrays. There's no way to do x += 1 without creating a new tensor. This means extra memory allocation and copying.
PyTorch has in-place variants (like add_, mul_) that modify tensors without allocation.
4. Autograd overhead¤
The backward pass walks a Python graph of Tensor objects, calling Python closures at each step. For a model with 100 operations, that's 100 Python function calls.
PyTorch's autograd is written in C++ and processes the graph much more efficiently.
5. No BLAS tuning¤
NumPy uses whatever BLAS library is installed (OpenBLAS, MKL, etc.) but doesn't tune for specific matrix sizes. PyTorch ships with optimized BLAS and uses cuBLAS on GPU, with tuning for common shapes.
What you can do about it¤
For MiniTorch specifically:
- Use batched operations:
x @ won a (100, 784) matrix is much faster per-sample than looping over individual samples - Keep tensors in float32: float64 uses 2x memory and is slower on most hardware
- Use Adam: it converges in fewer steps than SGD, so you need fewer epochs
For real work, use PyTorch. MiniTorch is for learning.
The trade-off¤
Everything in MiniTorch that makes it slow is also what makes it readable:
| Design choice | Cost | Benefit |
|---|---|---|
| Pure Python autograd | Slow backward pass | You can read every line |
| Closures for _backward | Memory overhead | Each op's gradient logic is right next to its forward logic |
| NumPy as the only backend | No GPU, no fusion | One dependency, works everywhere |
| New Tensor per operation | Extra allocation | No aliasing bugs, simple mental model |
PyTorch makes the opposite trade-offs: fast but the source code is 3 million lines of C++ and Python spread across hundreds of files.