how to read a paper (deepseekv3)
Taking a paper (DeepSeekV3) and discussing why it works
DeepSeekv3
- First off, this is a heavy ass paper. 53 pages (same as Esio Trot by Roald Dahl).. is not a research paper.
- This is off first read of the paper and how I usually read papers.
- Step 1: Skim abstract + figures + tables + read conclusion
- Step 2: First glance -> read and think about it
- Step 3: Look at the formulas, upload them to GPT and explain in the easiest way, upload the paper to NotebookLLM, ask questions (all stupid questions you have after first read)
- Step 4: Second read -> just to make sure you read it right the first time.
- Step 5: Third read -> probably read a blog post on it, and write something about it explaining same stuff to yourself
breaking it down using this paper
Better than Llama 3.1 405b, Qwen, and Mistral. Trained for 2048 GPUs for 2 months, cost = $6M with no loss spikes reported.
key takeaways on first glance
major innovations I see as soon as I read this paper:
- Cheap as hell
- Performs really well
- MoE LLM scaled to 671B with only 37B active per token
- Auxiliary-loss-free load-balancing for MOE routing (keeps experts balanced without hurting performance)
- Multi-Token Prediction
If a paper is long, it almost always has 3-4 real ideas and a lot of supporting engineering.
For DeepSeek-V3, the core ideas are:
- Multi-Head Latent Attention (MLA) → fixes KV-cache / inference memory
- DeepSeekMoE with aux-loss-free balancing → fixes MoE instability
- Multi-Token Prediction (MTP) → denser training signal + faster inference
- FP8 + DualPipe (systems) → makes all of this feasible at scale
Some results for motivation:

architecture

Before reading text, pause here.
Ask:
- What replaces attention? → MLA
- What replaces FFN? → MoE
- What’s new vs standard transformer? → both attention and FFN are modified
At this point, you already know where to zoom in.
- Mixture of Experts
- Multi-Head Latent Attention
- Pretraining using 14.8T tokens
- First stage: max context length = 32k
- Second stage: max context length = 128k
- Post-training: SFT + RL on the base model of DSv3 (alignment to human preferences)
- During post-training, they have distilled (“teacher” model transfers its learned knowledge to a smaller, more efficient “student” model) using R1 models.
- Chain-of-Thought (CoT) model, specifically from one of the DeepSeek R1 series models, into standard LLMs, particularly DeepSeek-V3.
- During post-training, they have distilled (“teacher” model transfers its learned knowledge to a smaller, more efficient “student” model) using R1 models.
- Load Balancing Strategy and Multi-Token-Prediction Objective (MTP) for performance + inference speed
- FP8 mixed precision for Quantisation
multi-head latent attention
MLA compresses K/V into a small latent vector, stores it in a tiny cache, then reconstructs on demand - 93% memory reduction.
introduced in DeepSeek-v2 paper, fraction of resources + outperforms standard MHA
- Multi-Head attention works, but the KV cache is enormous. MLA compresses it.
- This is how attention scores look like for a normal MHA
- i = i-th attention head, j = indexes over all previous tokens in sequence (from 1 -> t)
- dh = dimension of each attention head = ( total embedding dimension of model ) / number_of_heads
- softmax = attention score (converts logits to [0-1] probability ensuring they sum to 1) ~ normalisation
- outputs from all attention heads are concatenated to a single vector * Weight matrix $W_o$
- Converts back to original embedded dimension (so that attention input = output dimension)
- $K_t$ and $V_t$ are stored in a cache so when new token comes in, it doesn’t have to calculate k, v for all previous tokens again to compute attention scores. Since we are only having new Q every time, K and V stay the same for previous tokens. This is smart, but we do need to store this K and V for previous tokens somewhere right? We store it in a cache. KV-Cache 101.
- Storing KV Cache for every token = memory bottleneck
- Don’t just believe what you read, what do we mean by a memory bottleneck here? How to compute memory for this?
- start
- For each token:
- [K1, K2, K3, .. K_number_of_heads] = number_of_heads * dimension_head for keys
- [V1, V2, V3, .. V_number_of_heads] = number_of_heads * dimension_head for values
- Total: 2 * number_of_heads * dimension_head
- Sequence Length = L
- Total Memory -> (2 * number_of_heads * dimension_heads * Length)
- As Length grows -> 2 * number_of_heads * dimension_heads per token becomes LARGEEE.
- fin
- Suffers from high KV-Cache requirements (inference bottleneck)
- MLA solves this with Low Rank Key-Value Joint Compression
- Instead of storing full key-value pairs, MLA compresses them into a shared latent space and reconstruct K and V only when needed
# Original MHA
key = Weight_K @ input_token # Full size key
value = Weight_V @ input_token # Full size value
# MLA Compression:
compressed_kv = Weight_down @ input_token # Compressed latent vector (store only this) low dim repre that stores only essential info
# Weight_down = down projection matrix of Weight that reduces the dimension
# when needed -> reconstruct key and values
key = Weight_up_k @ compressed_kv
value = Weight_up_v @ compressed_kv
# These can also be applied to queries to reduce memory usage
- What are these Weight_down and Weight_up? Projections : Instead of storing K, V, store a small compressed vector.
You’re just projecting the K and V vectors down to a smaller dimension, storing that, and projecting back up when you need them. Standard linear algebra - nothing fancy about it.
- How to do this? Store weight_up_key, weight_up_val (Fixed weights of model, one-time storage cost)
- MLA does have large projection matrices, they’re part of the model parameters (stored once) rather than per-token memory requirements.
- Per-token memory (what we need to cache during inference) is just the small d_c-dimensional vector called compressed_kv

# d_c = compressed latent dimension (much smaller than number_of_heads * dimension_heads)
# fixed weights of the model:
Weight_up_k: (number_of_heads * dimension_heads) × d_c
Weight_up_v: (number_of_heads * dimension_heads) × d_c
# caching per token
compressed_kv: dimension d_c only
# DIFFERENCE
# mha
mha_tokens_memory = sequence_length * (2 * number_of_heads * dimension_heads)
# mla
mla_tokens_memory = sequence_length * d_c # Much smaller!
model_params = 2 * (number_of_heads * dimension_heads * d_c) # Fixed, one-time cost
- Only storing small compressed vectors for each token instead of full key-value pairs. Reduces KV Cache by 93.3%
one problem: RoPE is incompatible with the naive low-rank compression approach since RoPE = position sensitive to keys and queries making absorption impossible
RoPE is position-sensitive
Low-rank compression is linear algebra
Traditional RoPE applies position encoding to both K and Q
This becomes problematic with compressed KV pairs because matrix multiplication isn’t commutative
So in MLA, we can’t merge RoPE with compressed representations efficiently
“The main issue is that RoPE is sensitive to the exact position of a token. The compression step in MLA, however, involves matrix multiplication which is not commutative (A B ≠ B A). This means you can’t just apply RoPE before compression and expect it to work correctly afterward. The positional information gets scrambled.”
To solve this, DeepSeek decouples it: they first create the compressed representations and then apply positional encodings to generate the final, position-aware keys and queries.
- Decouple positional and compressed components
- Apply RoPE after reconstruction
This is a classic example of:
A practical hack driven by math constraints
- Compressed query is projected to obtain the decoupled queries
- Rope is then applied to this to obtain positional queries : result-> set of positional queries across all attention heads
- Similarly, input tokens are projected to obtain decoupled keys, ROPE is applied to make these keys positional aware
- These two vectors are concatenated (compressed repr and positional info)
- Similarly this happens for Keys (not just Q)
- Attention Score is calculated
# New approach - Decoupled RoPE:
# 1. Separate position-aware queries
query_R = RoPE(Weight^QR @ compressed_query) # Position-aware queries
key_R = RoPE(Weight^KR @ input_token) # Position-aware shared key
# 2. Split into heads and concatenate
query_t,i = [query_R_1, query_R_2, ..., query_R_no_of_heads]
key_t = [key_R_1, key_R_2]
# 3. Concatenate for final attention
query_full = [query_t, 1; query_t,2]
key_full = [key_t,1; key_t,2]
memory requirements summary

let’s look at the code for MLA
'''
Replace normal multi-head attention, Reduce KV cache size, Share latent representation
Standard Attention
Q = X * Wq
K = X * Wk
V = X * Wv
Attention(Q, K, V)
MLA
Q = X Wq
Latent = X * W_latent
K = Latent * Wk_recon
V = Latent * Wv_recon
Attention(Q, K, V)
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
class MLA(nn.Module):
def __init__(self, d_model, n_heads, d_latent, dropout=0.1):
super().__init__()
assert d_model % n_heads == 0
self.h = n_heads
self.dh = d_model // n_heads
self.q = nn.Linear(d_model, d_model, bias=False)
self.z = nn.Linear(d_model, d_latent, bias=False)
self.kv = nn.Linear(d_latent, 2 * d_model, bias=False)
self.out = nn.Linear(d_model, d_model)
self.drop = nn.Dropout(dropout)
def forward(self, x):
B, T, D = x.shape
q = self.q(x)
k, v = self.kv(self.z(x)).chunk(2, dim=-1)
def split(t):
return t.view(B, T, self.h, self.dh).transpose(1, 2)
q, k, v = map(split, (q, k, v))
attn = (q @ k.transpose(-2, -1)) / (self.dh ** 0.5)
attn = self.drop(attn.softmax(dim=-1))
out = (attn @ v).transpose(1, 2).reshape(B, T, D)
return self.out(out)
class MLABlock(nn.Module):
def __init__(self, d_model, n_heads, d_latent):
super().__init__()
self.attn = MLA(d_model, n_heads, d_latent)
self.ffn = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model)
)
self.n1 = nn.LayerNorm(d_model)
self.n2 = nn.LayerNorm(d_model)
def forward(self, x):
x = self.n1(x + self.attn(x))
return self.n2(x + self.ffn(x))
# transformer
class MLATransformer(nn.Module):
def __init__(self, vocab, d_model=4096, n_heads=32, d_latent=512, n_layers=24):
super().__init__()
self.embed = nn.Embedding(vocab, d_model)
self.blocks = nn.ModuleList(
[MLABlock(d_model, n_heads, d_latent) for _ in range(n_layers)]
)
self.head = nn.Linear(d_model, vocab)
def forward(self, ids):
x = self.embed(ids)
for b in self.blocks:
x = b(x)
return self.head(x)
# MLA compresses keys and values through a shared latent space, reducing KV-cache memory from O(2·d_model) to O(d_latent) per token, without changing the attention equation.
# let embed_dim = 4096 and latent_dim = 512
# | Method | Stored per token |
# | ------------ | ---------------- |
# | Standard MHA | `2 × 4096` |
# | MLA | `512` |
# ~8× KV memory reduction :: this is HUGE.
MOE
MOE: reduces computation cost to just 18% of dense models
The router selects top-2 experts per token. Different tokens activate different experts - only a fraction of parameters are used per forward pass.
Q. What are dense models?

Q. What are sparse models?

- Instead of one huge brain doing everything, MoE uses many small specialist brains, and a router that decides which ones to consult for each token.
How standard transformer works (recap) Token → Attention → Feed-Forward Network (FFN) → Next layer
- Every token uses the same FFN
- Compute cost grows linearly with model size
-
Bigger model = slower + more expensive
- In Mixture of Experts, the FFN is replaced with many FFNs (experts)

Token → Attention → Router → Selected Experts → Combine → Next layer
Instead of 1 FFN for all tokens we get: N experts (e.g. 256 FFNs) from which only K experts are used per token (e.g. K = 2)
Routing is per-token, not per-sequence - different tokens in the same input can go to different experts.
Intuition
In a hospital, we have
- Cardiologist
- Neurologist
- Orthopedic
- Dermatologist
We don’t send every patient to all doctors, A nurse (router) decides who sees whom
- MoE = hospital
- Experts = doctors
- Router = triage nurse
- Token = patient
Similarly DeepSeekv3 uses:
- Many experts per MoE layer
- Sparse activation: Only top-K experts are active per token
Experts are:
- Independent FFNs
- Same input/output shape
Outputs are weighted and summed
Examples
So for a token like “gradient”:
- Expert #17 (math)
- Expert #42 (ML) might get activated (top K here is 2)
For “Shakespeare”:
- Expert #3 (literature)
- Expert #91 (language style) might get activated
How is this better than vanilla dense model?
| Model | Total Params | Active Params per token |
|---|---|---|
| Dense | 70B | 70B |
| MoE | 230B | ~30B |
- Knowledge of a 230B model
- Cost of a ~30B model
Q. What does the router do actually? token_embedding → router → scores for each expert Then:
- Pick top-K experts
- Normalize scores
- Dispatch token to those experts
Important: Routing happens per token, not per sentence.

architecture of MOE
MOE architecture consists of the following 2 components-
- Experts
- Gating Network / Router
Token embedding hits the router, router scores all experts, top-K get selected (sparse activation), each selected expert processes the input independently, outputs are weighted by gating scores and summed. That sum replaces the standard FFN output and moves on to the next layer.

Where can this be a problem? Router collapse (same expert always picked)
- expert1: 5%
- expert2’: 2%
- expert3’: 1%
- expert4: 6%
- expert5: 76%
This is not balanced at all. Workload isn’t being distributed properly. If the same experts get picked:
- Others don’t train
- Capacity is wasted
- Performance drops
Q. What is the problem? Not only will there be an uneven distribution of experts chosen, but some experts will hardly be trained at all. This results in issues during both training and inference.
Instead, we want equal importance among experts during training and inference. How to prevent overfitting on the same experts.
How to solve this? Load Balancing!!!!
- DeepSeek v3 does Bias-Based Routing for DeepseekMoE

- score_for_routing = original_score + bias
- For each expert after each training step:
- If expert is overloaded:
- bias -= γ (make them less likely to get picked next time)
- If expert is underloaded:
- bias += γ (make them more likely to get picked next time)
- If expert is overloaded:
bias term (γ) is dynamically added or subtracted from an expert’s score based on whether it is over- or under-loaded during training Maintains quality (original affinity scores) while achieving balance (through bias adjustments)
Instead of adding a big auxiliary loss (which hurts quality), they:
- Add a bias term to routing scores
- Adjust it dynamically based on expert load
This is subtle but important:
- Keeps routing mostly semantic
- Gently nudges balance
When reading papers, look for:
“Does this fix introduce a new problem elsewhere?”
Here, the answer seems mostly no,
- For each token:
- We want many experts available, but we want to run only a few & combine their outputs intelligently.
- Routing answers two questions:
- Which experts should run?
- How much should each selected expert contribute?
Let $u_t$ (Everything the model knows so far about this token) be the embedding of the t-th token coming into the FFN/MoE layer
DeepSeek-style MoE splits experts into:
- A. Shared experts (always on)
- Capture general-purpose transformations
- B. Routed experts (conditionally on)
- Only a few are activated, captures specialized behavior
- Shared experts → no weights
- Routed experts → weighted by routing scores
output = input + shared expert outputs + routed expert outputs
Here’s the formula for the same (it’s the input + FFN output for shared + FFN for gated routed (gi,t) output)

how routing decides which experts to use
Step 1: Compute an affinity score for each routed expert
For each routed expert i, compute:
- s(i, t) = sigmoid (input.T * $e_i$) where $e_i$ = learned vector representing expert i (dot product here means similarity and sigmoid squashes score to [0,1]
The core of this formula is the dot product input.T * $e_i$. In vector math, a dot product is a measure of similarity.
This step asks: ‘How similar is the current token’s meaning to the specialty of expert i?’ The result is a similarity score.
Meaning: How suitable is expert i for token t?
Step 2: Sparsity: only keep the top-K experts Instead of using all routed experts:
- Select the top K experts with highest scores
- Set all others to zero

Step 3: Normalize the selected experts (soft weighting)

So they sum upto 1 (makes outputs stable, prevents exploding activations)
Step 4: Apply routed experts

Final MoE output:
- Start with the original input (residual)
- Add shared expert outputs
- Add weighted routed expert outputs
Each token asks a few experts “how would you process this?”, weighs their answers, and combines them - instead of asking the entire model.
| Design choice | Reason |
|---|---|
Dot product with e_i |
Learn expert specialization |
| Sigmoid | Stable scores |
| Top-K | Sparsity & efficiency |
| Normalization | Stable training |
| Weighted sum | Smooth expert blending |
| Residual connection | Training stability |
let’s look at the code for MOE
'''
In a Mixture-of-Experts layer, each token is routed to only a small subset of experts (top-k) instead of all parameters being activated. A lightweight router assigns tokens to experts, and only the selected experts process that token. This keeps compute roughly constant while allowing the total parameter count to scale.
x_after_mla ──► router scores ─────────► top-k experts selected
│ (gating network)
▼
selected experts compute
weighted outputs → aggregated
g_i' = Router(x) # raw scores for each expert
g_i = top-k + normalize
output = Σ (g_i * Expert_i(x))
for token:
pick top-k experts
run only those
'''
# Top-K MOE
import torch
import torch.nn as nn
import torch.nn.functional as F
class MoE(nn.Module):
def __init__(self, d_model, n_experts=8, k=2):
super().__init__()
self.k = k
self.router = nn.Linear(d_model, n_experts, bias=False)
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.GELU(),
nn.Linear(4 * d_model, d_model)
)
for _ in range(n_experts)
])
def forward(self, x):
B, T, D = x.shape
scores = self.router(x) # (B,T,E)
probs = scores.softmax(dim=-1)
topk_val, topk_idx = probs.topk(self.k, dim=-1)
out = torch.zeros_like(x)
for i in range(self.k):
expert_ids = topk_idx[..., i]
expert_wts = topk_val[..., i].unsqueeze(-1)
for e, expert in enumerate(self.experts):
mask = (expert_ids == e)
if mask.any():
out[mask] += expert(x[mask]) * expert_wts[mask]
return out
# MOE
class MoEBlock(nn.Module):
def __init__(self, d_model, n_heads, n_experts, k):
super().__init__()
self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
self.moe = MoE(d_model, n_experts, k)
self.n1 = nn.LayerNorm(d_model)
self.n2 = nn.LayerNorm(d_model)
def forward(self, x):
x = self.n1(x + self.attn(x, x, x)[0])
return self.n2(x + self.moe(x))
# model
class SimpleMoE(nn.Module):
def __init__(self, vocab, d_model=512, n_heads=8,
n_layers=6, n_experts=8, k=2):
super().__init__()
self.emb = nn.Embedding(vocab, d_model)
self.layers = nn.ModuleList([
MoEBlock(d_model, n_heads, n_experts, k)
for _ in range(n_layers)
])
self.head = nn.Linear(d_model, vocab)
def forward(self, ids):
x = self.emb(ids)
for l in self.layers:
x = l(x)
return self.head(x)
joining the dots
- How does transformer for DSv3 look like?
- FFN -> MOE
- Attention -> MLA (low rank key-val joint compression)
Input
↓
MLA (attention)
↓
MoE (FFN replacement)
↓
Output
┌─────────────────────────┐
│ Input Tokens (X) │
│ shape: (B, T, D) │
└─────────────────────────┘
│
▼
┌───────────────────────────────────────┐
│ Input Embeddings │
│ (or output from previous layer) │
└───────────────────────────────────────┘
│
▼
┌──────────────────┐
│ MLA Layer │
└──────────────────┘
│
┌──────────────────────────────────────┐
│ 1) Query projection Q │
│ 2) Latent projection Z │
│ 3) Reconstruct K,V from latent Z │
│ 4) Attention(Q,K,V) │
└──────────────────────────────────────┘
│
▼
┌────────────────────────┐
│ MLA Output (Attn Out) │
└────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Add & LayerNorm │
│ (residual connection) │
└──────────────────────────────────────┘
│
▼
┌──────────────────┐
│ MoE Layer │
└──────────────────┘
│
┌─────────────────────────────────────────────────┐
│ MoE Inside │
│ 1) Router computes scores for routed experts │
│ 2) Take Top-K experts per token │
│ 3) Normalize gating weights │
│ 4) Run only selected experts │
│ 5) Weighted sum of expert outputs + shared FFNs│
└─────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────┐
│ Add & LayerNorm (MoE residual) │
└──────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Output to Next Block │
└─────────────────────────┘
Now let’s look at the final optimisation DSv3 does:
multi-token prediction
- Most standard language models (GPT, LLaMA, etc.) are trained with next-token prediction (NTP)
- This is what people mean when they say the word autoregressive (one token at a time)
- Multi-token prediction is predicting multiple future tokens at once (e.g., the next 𝑛 tokens) from the same input context
- Instead of p(xt+1 ∣ x1:t), we do p(xt+1, xt+2, …, xt+n∣x1:t)
- Each head predicts one future token. These predictions are trained jointly.
- During training, the model predicts multiple future tokens. At inference, these extra prediction heads can be used for speculative decoding to speed up generation.
- Shared backbone: no extra transformer compute
- Works best with long sequences
Q. Why?
- Denser learning signal
- With next-token prediction, each training example yields one scalar loss.
- With multi-token prediction, each position yields n prediction losses simultaneously.
- More gradient signal per token (faster learning per example)
- Inference acceleration
- If you can predict multiple tokens in one pass, you reduce the number of sequential forward passes needed
- Fewer sequential decoding step
Q. How does this speed up inference?
- A standard autoregressive model generates one token at a time, requiring one full forward pass of the model for each token.
- To generate 100 tokens, it needs 100 sequential passes.
- With MTP, if the model can predict, say, 3 tokens in a single pass, you might only need around 33 passes to generate the same 100 tokens. This reduction in the number of sequential steps is a major source of acceleration.
DeepseekV3 does the same, it doesn’t just train on next-token prediction. It also learns to predict several future tokens from each position, jointly, during training.
how is it different than original MTP
- The original academic MTP proposal predicts all future tokens in parallel using independent heads at each position.
- DeepSeek’s implementation is slightly different
- From a single hidden state (hidden_state_t), the model uses multiple independent ‘prediction heads.’ Each head is trained to predict a different future token.
- For example, head_1 predicts token t+1, head_2 predicts token t+2, and so on.
- All these predictions happen simultaneously from the same point in the sequence, which is what provides the rich training signal.
- From a single hidden state (hidden_state_t), the model uses multiple independent ‘prediction heads.’ Each head is trained to predict a different future token.
- DeepSeek-V3 predicts two or more future tokens in this chained manner
- Advantages
- Because the model learns to anticipate more outcomes from a single context position, it gets more training signal per token seen
- Empirical results in the paper show that multi-token prediction improves performance on generative tasks (e.g., code generation tasks)

- All losses are accumulated, giving a denser total objective per position
let’s look at the code for MTP
'''
hidden_t ── linear ──► [P(t+1), P(t+2), P(t+3)]
'''
import torch
import torch.nn as nn
import torch.nn.functional as F
class MTPTransformer(nn.Module):
def __init__(self, vocab, d_model=512, n_layers=6, n_heads=8, K=3):
super().__init__()
self.K = K
self.tok = nn.Embedding(vocab, d_model)
self.pos = nn.Embedding(2048, d_model)
layer = nn.TransformerEncoderLayer(
d_model=d_model, nhead=n_heads, batch_first=True
)
self.encoder = nn.TransformerEncoder(layer, n_layers)
# One projection predicts K future tokens
self.head = nn.Linear(d_model, K * vocab)
self.vocab = vocab
def forward(self, ids):
B, T = ids.shape
pos = torch.arange(T, device=ids.device).unsqueeze(0)
x = self.tok(ids) + self.pos(pos)
h = self.encoder(x) # (B,T,D)
logits = self.head(h) # (B,T,K·V)
return logits.view(B, T, self.K, self.vocab) # (B,T,K,V)
# loss
def mtp_loss(logits, ids):
"""
logits: (B, T, K, V)
ids: (B, T)
"""
B, T, K, V = logits.shape
loss = 0.0
for k in range(K):
pred = logits[:, :-k-1, k] # predict t+k+1
tgt = ids[:, k+1:] # ground truth
loss += F.cross_entropy(
pred.reshape(-1, V),
tgt.reshape(-1)
)
return loss
# usage
model = MTPTransformer(vocab=32000, K=3).cuda()
opt = torch.optim.AdamW(model.parameters(), 3e-4)
ids = torch.randint(0, 32000, (8, 128)).cuda()
logits = model(ids)
loss = mtp_loss(logits, ids)
loss.backward()
opt.step()
print(loss.item())
summary
Evidence?
Look for:
- Ablations (does removing X hurt?)
- Costs (not just accuracy)
- Stability (loss spikes, divergence)
DeepSeek-V3 shows:
- MTP improves downstream performance
- MoE balancing stabilizes training
- FP8 does not cause instability at scale
Every good paper has limits.
DeepSeek-V3:
- Requires massive infra (2048 H800s)
- Custom kernels are non-trivial to reproduce
- Deployment unit is still large
- Inference speed ≠ solved forever
Input
↓
MLA → fixes memory
↓
MoE → fixes compute
↓
MTP → fixes learning signal
↓
FP8 + DualPipe → makes it trainable
