scaling laws
OpenAI wrote a paper on what happens to model performance as you scale parameters, data, and compute. It’s called Scaling Laws for Neural Language Models, 2020.
scaling laws

Scaling parameters alone, data alone, or both (Chinchilla) - same compute, different loss.
SL: simple predictive laws / rules for behaviour of language model performance.
HOW WE TEST:
- OLD: Tune hyper-parameters on big models
- NEW: Tune on small models -> extrapolate to large ones
Scaling Laws Paper says validation and test loss decreases as parameter and number of layers increases along with compute increase.
IDEA: Do all experimentations on small model with less compute -> Nail the big model in one-go.
Kaplan et al. (2020) quantified this with three power law equations:
\[L(N) \sim N^{-0.076}\] \[L(D) \sim D^{-0.095}\] \[L(C) \sim C^{-0.050}\]where N = number of parameters, D = dataset size (tokens), C = compute budget (FLOPs), and L = cross-entropy loss. Loss decreases as a power law as you add more parameters, data, or compute, but the small exponents mean you get diminishing returns: each 10x increase in any factor yields only a modest drop in loss.
Maybe intelligence -> lot of compute applied to lot of data having lot of parameters.
This was the first idea of scaling laws in the 1970s
- Training on enough data matters: GPT-3 was undertrained.
- Chinchilla (half parameter size of GPT3 (70b) but 4x data -> performed better)
Data Scaling Laws: formula that maps dataset size(n)
Loss and ‘n’ is linear on a loglog plot

Engineering Data Laws:
How does data composition affect model performance (not just size) -> data composition only affects offset not slope (You can do data select experiments on a much smaller model)

More questions on data:
- We have finite data, how does repeating examples affect scaling? - Up to 4 epochs repeating data is almost as good as new but after that it shows rapidly diminishing returns.
- Given that repeated data is less valuable -> data selection should adapt to scale.
- Repeat high quality data OR include new data (trade-off)
how to design a huge LM
- Architecture: LSTM vs Transformer (Transformer loss decreases as we increase parameters (MOE is the only thing better than vanilla transformer))
- Optimiser: Adam is much better than SGD as we increase epochs (adaptive learning rate (basically takes steps automatically instead of a fixed size))
- Depth: layers >=6 is good, 1 vs 2 layers make a huge difference; after that, performance plateaus.
- Batch Size: batch size increase -> gradient steps increase (past certain point -> diminishing returns (bias dominates instead of learning deeper features)
- Critical BatchSize: minimum number of steps for target loss (compute increase -> steps can stay the same (BS fixed))
side note
What even are parameters
A number the model can change during training. Training = adjusting numbers so predictions improve.
Linear regression
-
𝑦 = 𝑤 𝑥 + 𝑏
-
Parameters: - w (weight) - b (bias)
-
So this model has 2 parameters. More parameters = more freedom to fit data.
Simple neural layer 𝑦 = 𝑊 𝑥 + 𝑏
If:
- input dim = 4
- output dim = 3
Then:
- W has 4 × 3 = 12 parameters
- b has 3 parameters
- Total = 15 parameters
For LLMs:
Input embeddings
│
▼
Multi-Head Attention
│ (Wq, Wk, Wv, Wo)
▼
Feedforward Network
│ (W1, W2)
▼
Output
Every W matrix is full of parameters.
When people say “LLaMA-7B has 7 billion parameters” : There are 7 billion trainable numbers inside the model. Each one:
- A floating-point value (e.g. 16-bit or 32-bit (how do we optimise this? Quantisation (later on (changing from FP32->FP16 without compromising performance so we use lesser bits to store our embeddings))))
- Learned during training
- Frozen at inference
Parameters are representational capacity, not intelligence (misconception)
Parameters alone aren’t enough. The model also needs enough training data and compute.
If:
- Model is huge
- Data is small
Then:
- Model memorizes
- Poor generalization
Hence: Big models need big data.
Compute ∝ Parameters ∝ Training ∝ Tokens
recap
- Scaling laws describe how LLM performance improves predictably as you increase parameters, dataset size, or training compute.
- Scale parameters without also scaling data and compute and you hit diminishing returns. Each factor depends on the others.
- The shift to GPUs let researchers scale both model size and dataset size simultaneously. Transformers made that even more efficient.
You’re always budget-constrained, so you pick which of the three knobs to turn: parameters, data, or compute. You can turn all three - there’s no impossibility theorem - but nobody has infinite money. The Chinchilla paper showed most labs were turning the wrong knob. They were making models too big and not training them on enough data. Chinchilla (70B params, 4x the data of GPT-3) outperformed the much larger GPT-3 by allocating the compute budget more wisely.
Loss ~ f(parameters, data, compute) - all three matter, but how you balance them matters more.
Different labs make different bets on this tradeoff. ChatGPT scales up parameters and broad data with massive compute - optimizing for generality. DeepSeek goes the other way: fewer parameters, higher-quality data, and more inference-time compute - optimizing for reasoning efficiency. Both obey the same scaling laws, they just allocate their budgets differently.
limitations and future
But we are going out of data:
- Running out of data (quality of data) -> ways to make data synthetic (deepmind alphago plays against itself and only has synthetic data)
- Reasoning model (o1) bridges this gap (chain of thought) -> longer o1 thinks better it performs (new paradigm for scaling llms -> reasoning (we need higher compute for this)- > current state of AI)
- Invent new arch? Numerical stability of model should be there (none so far, transformer works only)
