understanding object detection
TL;DR: YOLO reframed object detection as a single regression problem: one forward pass through a CNN predicts bounding boxes and class probabilities for the entire image at once. This post covers the grid system, loss function, NMS, mAP, and traces the architecture evolution from YOLOv1 through v10.
You look at a street and instantly spot a car, a person, a traffic light, a dog. You don’t scan pixel by pixel. You just see objects. Getting a computer to do this was slow for a long time - scan a patch, classify it, move to the next patch, repeat thousands of times.
Then in 2015: You Only Look Once (YOLO).
One look. One forward pass. Done. This post covers how YOLO works, why it works, where it falls apart, and how it evolved from v1 to v10.
table of contents
- Pipeline
- Before YOLO
- Why YOLO: The Motivation
- YOLO High Level Overview
- YOLO Architecture
- Grid Cells & Predictions
- Bounding Box Format & Encoding
- Prediction Vector Breakdown
- Training Process
- Loss Function Explained
- Inference: From Predictions to Bounding Boxes
- Post-Processing: IOU & NMS
- Evaluation Metrics: mAP
- Performance & Results
- Limitations of YOLOv1
- Evolution: From v1 to v10
- Implementation: PyTorch Code
- Extra
pipeline
┌─────────────────────────────────────────────────────────────────────────┐
│ YOLO OBJECT DETECTION │
└─────────────────────────────────────────────────────────────────────────┘
Input Image (448×448)
│
├──► Step 1: Grid Division
│ ┌───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │ │ 7×7 grid
│ ├───┼───┼───┼───┼───┼───┼───┤ Each cell = 64×64 pixels
│ │ │ P │ │ │ │ │ │ P = Person center
│ ├───┼───┼───┼───┼───┼───┼───┤ D = Dog center
│ │ │ │ │ D │ │ │ │
│ └───┴───┴───┴───┴───┴───┴───┘
│
├──► Step 2: CNN Forward Pass (24 conv layers)
│ Features extracted: 7×7×1024
│
├──► Step 3: Predictions per Cell
│ Each cell outputs 30 values:
│ • 2 boxes: (x,y,w,h,conf) × 2 = 10 values
│ • 20 class probabilities = 20 values
│ Total: 7×7×30 = 1,470 predictions
│
├──► Step 4: Post-Processing
│ • Filter by confidence (threshold = 0.2)
│ • Apply Non-Maximum Suppression (NMS)
│ • Keep top predictions per class
│
└──► Final Output
┌──────────────────┐
│ Person: 95.3% │ [Bounding boxes with labels]
│ Dog: 87.6% │
└──────────────────┘
metrics
| Metric | Value | Comparison |
|---|---|---|
| Speed | 45 fps | 9× faster than Faster R-CNN |
| Accuracy | 63.4% mAP | -7% vs Faster R-CNN |
| Grid Size | 7×7 cells | 49 possible detections |
| Parameters | ~20M | Custom GoogLeNet-inspired backbone (the implementation below uses ResNet34) |
| Input Size | 448×448 | Fixed resolution |
before YOLO
traditional computer vision
Earlier approaches picked up patterns based on mathematical features from images in matrix form. We figured out where hue, contrast, and pixel density were high and used that to classify images (using models like ResNet).
The Problem: This doesn’t work well for complex scenarios like CCTV footage with:
- Dozens of objects
- Heavy occlusion
- Complex features
- Real-time requirements (ResNet takes huge amounts of time even during inference)
two-stage detectors: R-CNN & Faster R-CNN
Two-stage detection pipeline:
Image → Backbone (CNN/ResNet) → Convolutional Feature Map
→ Stage 1: Generate Bounding Box Proposals
→ Stage 2: Classification + Box Refinement (Regression Head)
The feature map encodes what’s in the image. Stage 1 proposes where objects might be. Stage 2 refines those proposals and assigns class labels.
Why “Two-Stage”? The first pass generates a set of proposals (potential object locations), and the second pass refines these proposals to make final predictions.
Advantages:
- More accurate than previous methods
- Each component can be optimized separately
Disadvantages:
- Multi-stage pipeline is complex
- Each component trained separately
- Not suitable for real-time applications (~5 fps on GPU)
- Computationally expensive
comparison: object detection methods (2014-2016)
| Method | Year | Speed (fps) | mAP (%) | Pipeline Type | Proposals | Real-time? |
|---|---|---|---|---|---|---|
| R-CNN | 2014 | 0.02 | 66.0 | Two-stage | Selective Search (2000+) | ❌ |
| Fast R-CNN | 2015 | 0.5 | 70.0 | Two-stage | Selective Search | ❌ |
| Faster R-CNN | 2015 | 7 | 73.2 | Two-stage | RPN (300) | ❌ |
| DPM | 2015 | <1 | 30.4 | Sliding window | Dense sampling | ❌ |
| YOLOv1 | 2016 | 45 | 63.4 | Single-stage | None | ✅ |
why YOLO: the motivation
The core idea: treat object detection as a single regression problem.
Instead of:
- Generate proposals → 2. Classify proposals
Do: Single network → Box detection + Category prediction simultaneously
key advantages of single-stage detection:
- Speed: One forward pass through the network
- Simplicity: End-to-end training with a unified loss function
- Global reasoning: The network sees the entire image, understanding context
- Real-time performance: 45+ fps (compared to 5 fps for two-stage detectors)
the trade-off:
- Slightly lower accuracy on metrics
Joseph Redmon et al. (2016) built exactly this: one network, one pass, boxes + classes out the other end.
YOLO high level overview
Object detection as a single regression problem:
concept
- Input: Image of any size (resized to 448×448)
- Process: Divide image into S×S grid cells (S=7 in the paper)
- Predict: For each grid cell, predict B bounding boxes and C class probabilities
- Output: Tensor of shape S×S×(B×5+C)
why “you only look once”?
Unlike sliding window approaches (like DPM) that run a classifier at every position, or two-stage detectors that process thousands of region proposals, YOLO looks at the image once in a single forward pass.
single pass, end-to-end
- No separate proposal generation
- No batch processing of regions
- One network, one forward pass
- Direct prediction from pixels to bounding boxes and classes
Figure: YOLO’s single-stage detection pipeline - the entire image is processed once to produce bounding boxes and class predictions
Figure: The complete YOLO detection algorithm from input image to final predictions
the architecture
YOLO is a CNN. 24 convolutional layers for feature extraction, 2 fully connected layers at the end to produce the final output.
Early layers detect low-level stuff (edges, color patches). Deeper layers combine those into higher-level features (eyes, wheels, textures). The FC layers at the end take all of that and produce bounding box predictions.
Backbone (Feature Extraction):
- 24 convolutional layers with alternating 1×1 and 3×3 filters
- 4 max-pooling layers for spatial downsampling
- Leaky ReLU activation (for all layers except the last)
Detection Head:
- 2 fully connected layers (4096 neurons → 1470 neurons)
- Linear activation on the final layer

Figure: Convolutional Neural Network layers and how they extract features from images

Figure: Detailed view of convolutional operations and feature map generation

Figure: How convolution filters slide across the image to detect patterns
input → output flow
Input: 448×448×3 (RGB image)
↓
[24 Convolutional Layers with MaxPooling]
↓
Feature Map: 7×7×1024
↓
[Flatten: 7×7×1024 = 50,176 features]
↓
[FC Layer 1: 50,176 → 4096]
↓
[FC Layer 2: 4096 → 1470]
↓
[Reshape: 1470 → 7×7×30]
↓
Output: 7×7×30 tensor
why stack convolutional layers?
Purpose: Extract spatial features from the image while preserving structure.
The Problem: Simply stacking convolutions doesn’t increase non-linearity in the data.
The Solution:
- Use non-linear activation functions (Leaky ReLU) between layers
- Use 1×1 convolutions to reduce dimensionality and add non-linearity
- Use 3×3 convolutions to extract spatial features and downsample
why leaky ReLU?
Standard ReLU: f(x) = max(0, x) → can cause “dying neurons” (always output 0)
Leaky ReLU: f(x) = x if x > 0 else 0.1*x → allows small negative gradients, preventing dead neurons
regularization techniques:
- Dropout (rate=0.5) after the first FC layer
- Data augmentation (random scaling, translation, HSV adjustments)
Figure: Complete YOLOv1 network architecture with 24 convolutional layers and 2 fully connected layers
Figure: Layer-by-layer breakdown showing filter sizes, dimensions, and feature map transformations
Figure: The full network from input (448×448×3) to output (7×7×30) tensor
how YOLO works
-
Take the image and lay a grid over it. The original YOLO uses a 7x7 grid. This divides the image into 49 cells.
-
Each grid cell gets a job. It’s responsible for detecting any object whose center point falls within that cell.
-
Make a prediction for every cell. A single, powerful neural network looks at the whole image and, for every single one of the 49 grid cells, it spits out a prediction. This prediction answers a few key questions:
- “Is an object’s center in here? How confident am I?”
- “If so, where is the bounding box for that object?”
- “And what class is it (a dog, a person, a car)?”
That’s it. One look, one pass through the network, and out comes a flood of predictions from all 49 cells at once. No proposals, no second stage. Just a direct mapping from image pixels to bounding boxes and class probabilities.
the grid system
Step 1: Divide the 448×448 image into a 7×7 grid (S=7 in the paper)
- Each grid cell: 64×64 pixels (448/7 = 64)
Step 2: Assign responsibility
- Each cell is responsible for detecting one object
- Which object? The one whose center point falls into that cell
Figure: Image divided into 7×7 grid with object center points marked - each cell is responsible for objects whose centers fall within it

Figure: Identifying which grid cell is responsible for each object based on center point location
Figure: Close-up view of how object center points determine grid cell responsibility

Figure: Computing the center point and assigning it to the appropriate grid cell
what does “responsible” mean?
If an object’s center point falls into a grid cell, that cell must:
- Predict the bounding box coordinates
- Predict the confidence that an object is present
- Predict the class probabilities
example
- Person’s center at (100, 200) → falls into grid cell (1, 3)
- Horse’s center at (300, 150) → falls into grid cell (4, 2)
- Grid cell (1, 3) is responsible for detecting “person”
- Grid cell (4, 2) is responsible for detecting “horse”
But there’s a catch… a big one.
What happens if the center of two objects falls into the same grid cell? Imagine a person standing directly in front of a car.
YOLOv1’s limitation: Each grid cell can only detect ONE object.
This is a fundamental rule of the original YOLO. Each cell proposes two bounding boxes (more on that below), but it can only output one set of class probabilities. Two objects in the same cell? Too bad, one gets ignored.
This is YOLOv1’s biggest weakness. Crowds, flocks of birds, anything with clustered small objects - it chokes. Later versions fix this. For now: one cell, one vote.
Fixed in YOLOv2+: Anchor boxes allow multiple detections per cell
Impact: Maximum 49 objects per image (7×7 grid). Crowded scenes (flocks of birds, dense crowds) will have missed detections.
bounding box format and encoding
Ground truth bounding box for person: (100, 200, 130, 202)
- These are large absolute values
- Hard for the network to predict directly
- Solution: Make predictions relative to the grid cell
target encoding (ground truth → YOLO format)
For a bounding box with center (x, y), width w, height h:
Center Coordinates (relative to grid cell):
x' = (x - x_anchor) / cell_width
y' = (y - y_anchor) / cell_height
Where:
(x_anchor, y_anchor)= top-left corner of the grid cellcell_width=cell_height= 64 (for 448/7 grid)
Basically: starting from the top-left corner of the grid cell, how far (as a fraction of the cell size) to the object’s center?
Width & Height (relative to entire image):
w' = w / image_width
h' = h / image_height
Where image_width = image_height = 448
Box size as a fraction of the total image.
Figure: Computing relative offsets (Δx, Δy) from grid cell top-left corner to object center
Figure: Detailed breakdown of how center coordinates are normalized relative to grid cell dimensions
Figure: Visualizing the x,y center point encoding process with actual coordinate values
Figure: Final normalized ground truth values (x’, y’, w’, h’) ready for training
label encoding for training
For each grid cell, we create a target vector:
If cell contains an object:
- Bounding box:
(x', y', w', h', confidence=1.0) - Class: One-hot encoding
[0, 0, 1, 0, ..., 0](20 values for Pascal VOC)
If cell contains no object:
- All zeros:
(0, 0, 0, 0, 0, 0, 0, ..., 0)(30 values total)
Figure: 30-dimensional vector format with bounding box coordinates, confidence, and class probabilities
prediction vector
what does each grid cell predict?
Each grid cell outputs a 30-dimensional vector:
Bounding Boxes (B=2 boxes):
- Box 1:
(x₁', y₁', w₁', h₁', c₁)- 5 values - Box 2:
(x₂', y₂', w₂', h₂', c₂)- 5 values
Where:
x', y'= center offset relative to top-left corner of grid cell (0 to 1)w', h'= width/height relative to entire image (0 to 1)c= objectness confidence = Pr(Object) × IOU_pred^truth
Class Probabilities (C=20 for Pascal VOC):
[p₁, p₂, ..., p₂₀]- 20 values-
These are conditional probabilities: Pr(Class_i Object) - “What is the probability of each class, given that an object exists in this cell?”
Total: 5×2 + 20 = 30 values per grid cell
For the entire image: 7×7×30 = 1,470 values
why predict 2 boxes per cell?
The Problem: Some objects may have different aspect ratios or sizes.
The Solution: Each cell predicts 2 boxes, and during training, we assign each ground truth object to the box with the highest IOU with that object.
At inference: We keep only the box with the highest confidence score for each cell.
understanding confidence score
The confidence c encodes two things:
- Pr(Object): Probability that the box contains an object
- IOU_pred^truth: How well the predicted box fits the object
c = Pr(Object) × IOU_pred^truth
Three scenarios:
- No object in cell: c = 0
- Object present, poor fit: c = low (e.g., 0.3)
- Object present, good fit: c = high (e.g., 0.9)
Figure: How class probabilities are combined with objectness confidence to produce final class-specific confidence scores
inference: from predictions to bounding boxes
step 1: get raw predictions
Network outputs: 7×7×30 tensor
For grid cell (i, j):
- Box 1:
(x₁', y₁', w₁', h₁', c₁) - Box 2:
(x₂', y₂', w₂', h₂', c₂) - Classes:
[p₁, p₂, ..., p₂₀]
step 2: convert to image coordinates
Center coordinates:
x_anchor, y_anchor = i * 64, j * 64 # top-left of grid cell
x₁ = x₁' * 64 + x_anchor
y₁ = y₁' * 64 + y_anchor
Width and height:
w₁ = w₁' * 448
h₁ = h₁' * 448
Now we have absolute pixel coordinates for the bounding box.
Figure: Converting relative YOLO format coordinates back to absolute pixel coordinates for visualization
step 3: compute class-specific confidence
For each box, multiply objectness by class probabilities:
class_confidence = c × max(p₁, p₂, ..., p₂₀)
Example:
- Box 1 confidence:
c₁ = 0.85 - Highest class probability:
p₁₄ = 0.9(person class) - Final confidence:
0.85 × 0.9 = 0.765
step 4: select best box per cell
Each cell predicted 2 boxes → keep the one with highest class-specific confidence.
Example:
- Box 1:
c₁' = 0.765 - Box 2:
c₂' = 0.621 - Keep Box 1, discard Box 2
Figure: Target confidence values and how they’re computed from IOU between predicted and ground truth boxes
step 5: filter by confidence threshold
Remove all boxes with confidence below a threshold (e.g., 0.5):
final_boxes = [box for box in boxes if box.confidence > 0.5]
step 6: apply NMS
Many grid cells may detect the same object → use Non-Maximum Suppression to remove duplicates.
post-processing: IOU and NMS
intersection over union (IOU)
What is IOU? A measure of overlap between two bounding boxes. It tells you how well two boxes align.
Formula:
IOU = Area of Intersection / Area of Union
Interpretation:
- IOU = 0.0 → No overlap
- IOU = 0.5 → Decent overlap
- IOU = 0.8+ → Excellent overlap (typically considered a correct detection)
Figure: Intersection over Union (IOU) calculation showing overlap between predicted box (blue) and ground truth box (green), with intersection area highlighted
converting between box formats
Center format: (x_center, y_center, width, height)
Corner format: (x₁, y₁, x₂, y₂) where (x₁,y₁) = top-left, (x₂,y₂) = bottom-right
Conversion:
x₁ = x_center - width/2
y₁ = y_center - height/2
x₂ = x_center + width/2
y₂ = y_center + height/2
computing IOU
Step 1: Find intersection rectangle coordinates
x_left = max(x₁_box1, x₁_box2)
y_top = max(y₁_box1, y₁_box2)
x_right = min(x₂_box1, x₂_box2)
y_bottom = min(y₂_box1, y₂_box2)
Step 2: Compute intersection area
if x_right < x_left or y_bottom < y_top:
intersection = 0 # Boxes don't overlap
else:
intersection = (x_right - x_left) * (y_bottom - y_top)
Step 3: Compute union area
area_box1 = (x₂_box1 - x₁_box1) * (y₂_box1 - y₁_box1)
area_box2 = (x₂_box2 - x₁_box2) * (y₂_box2 - y₁_box2)
union = area_box1 + area_box2 - intersection
Step 4: Compute IOU
iou = intersection / union

Figure: Two common bounding box representations - corner format (x₁,y₁,x₂,y₂) vs center format (x_center, y_center, w, h) and how to convert between them
IOU implementation
def iou(pred, gt):
"""
Calculate Intersection over Union between two bounding boxes.
Args:
pred: [x1, y1, x2, y2] predicted box
gt: [x1, y1, x2, y2] ground truth box
Returns:
iou: float between 0 and 1
"""
pred_x1, pred_y1, pred_x2, pred_y2 = pred
gt_x1, gt_y1, gt_x2, gt_y2 = gt
# Find intersection rectangle
x_topleft = max(pred_x1, gt_x1)
y_topleft = max(pred_y1, gt_y1)
x_bottomright = min(pred_x2, gt_x2)
y_bottomright = min(pred_y2, gt_y2)
# Check if boxes overlap at all
if x_bottomright < x_topleft or y_bottomright < y_topleft:
return 0.0
# Calculate areas
intersection = (x_bottomright - x_topleft) * (y_bottomright - y_topleft)
pred_area = (pred_x2 - pred_x1) * (pred_y2 - pred_y1)
gt_area = (gt_x2 - gt_x1) * (gt_y2 - gt_y1)
union = pred_area + gt_area - intersection
iou = intersection / union
return iou
non-maximum suppression (NMS)
The Problem: Multiple grid cells often detect the same object, creating duplicate/redundant boxes.
The Solution: Keep only the box with the highest confidence and remove all others that significantly overlap with it.

NMS algorithm
For each class separately:
- Sort all detections by confidence (descending)
- Select the box with highest confidence
- Remove all boxes with
IOU > threshold(e.g., 0.5) with the selected box - Repeat until no boxes remain
Why per-class? We want to suppress duplicate detections of the same object, but not suppress a person detection because it overlaps with a car detection.
Picking NMS Threshold
Choosing the wrong IOU threshold for NMS drastically affects performance.
Threshold too low (e.g., 0.3):
- Suppresses too many boxes
- Result: Misses objects that are close together
- Example: Two people standing next to each other → only one detected
Threshold too high (e.g., 0.7):
- Suppresses too few boxes
- Result: Multiple boxes per object (duplicates)
- Example: Single person gets 3 bounding boxes
Paper uses 0.5:
- Balances duplicate suppression and multi-object detection
- Works well for most scenarios
NMS Intuition: Keep the best box, kill everything that overlaps too much with it. Repeat until nothing’s left to suppress.
NMS implementation
def nms(detections, nms_threshold=0.5):
"""
Apply Non-Maximum Suppression to remove duplicate detections.
Args:
detections: list of [x1, y1, x2, y2, score]
nms_threshold: IOU threshold for suppression
Returns:
keep_detections: filtered list of detections
"""
# Sort detections by score (descending)
sorted_det = sorted(detections, key=lambda k: -k[-1])
keep_detections = []
while len(sorted_det) > 0:
# Keep the highest confidence box
best_box = sorted_det[0]
keep_detections.append(best_box)
# Remove this box and all boxes with high overlap
sorted_det = [
box for box in sorted_det[1:]
if iou(best_box[:-1], box[:-1]) < nms_threshold
]
return keep_detections
training process
dataset: Pascal VOC
Pascal VOC 2007 + 2012:
- 20 object classes
- Contains images with bounding box annotations
- Standard benchmark for object detection
two-stage training strategy
Why two stages? To help the network learn useful features first, then fine-tune for detection.
stage 1: pretraining (classification)
- Task: Image classification on ImageNet
- Input size: 224×224
- Network: First 20 convolutional layers + 1 FC layer
- Goal: Learn general visual features
- Why? Reduces training time and improves convergence
stage 2: detection training
- Task: Object detection on Pascal VOC
- Input size: 448×448 (doubled for finer localization)
- Network: Full 24 conv layers + 2 FC layers
- Why higher resolution? Detection requires fine-grained spatial information
training hyperparameters
Epochs: ~135 epochs
Batch size: 64
Optimizer: SGD with momentum
- Momentum: 0.9
- Weight decay: 0.0005
Learning rate schedule:
- Epochs 1-5: Warm-up from 10⁻³ to 10⁻² (prevents divergence from unstable gradients)
- Epochs 6-75: 10⁻² (main training)
- Epochs 76-105: 10⁻³ (fine-tuning)
- Epochs 106-135: 10⁻⁴ (final refinement)
Why warm-up? Starting with high learning rate causes unstable gradients and divergence. Gradual warm-up stabilizes training.
data augmentation
To prevent overfitting and improve generalization:
- Random scaling and translation: up to 20% of original image size
- Random HSV adjustment:
- Exposure and saturation: up to 1.5× factor
- Helps with lighting variations
- Dropout: Rate = 0.5 after first FC layer
annotation format conversion
Original annotation: [x_min, y_min, x_max, y_max, class_id]
YOLO format conversion:
# Calculate center and size
center_x = (x_min + x_max) / 2
center_y = (y_min + y_max) / 2
width = x_max - x_min
height = y_max - y_min
# Normalize to 0-1 range
center_x_norm = center_x / image_width
center_y_norm = center_y / image_height
width_norm = width / image_width
height_norm = height / image_height
# Find which grid cell this object belongs to
grid_x = int(center_x_norm * 7)
grid_y = int(center_y_norm * 7)
# Calculate offsets relative to grid cell
x_offset = (center_x_norm * 7) - grid_x # 0 to 1
y_offset = (center_y_norm * 7) - grid_y # 0 to 1
# Create target: [x_offset, y_offset, width_norm, height_norm, 1.0, class_one_hot]
going back to loss function
why regression loss?
Object detection in YOLO is formulated as a regression problem: we’re predicting continuous values (box coordinates, confidence scores, class probabilities).
The loss function uses Mean Squared Error (MSE) for all components, but with careful weighting to handle class imbalance and scale differences.
three components of loss
- Localization Loss: Ensures predicted box coordinates match ground truth
- Confidence Loss: Trains confidence scores to reflect object presence and fit quality
- Classification Loss: Ensures correct class probabilities for cells with objects
Total Loss = λ_coord × L_box + L_conf + L_class
the class imbalance problem
Problem: In most images, most grid cells don’t contain objects.
- ~2-5 cells with objects
- ~44-47 cells without objects
Impact:
- Gradients from “no object” cells dominate training
- Overwhelms gradients from “object” cells
- Network learns to predict “no object” everywhere
Solution: Weight the losses differently
- λ_coord = 5: Increase weight of box coordinate loss
- λ_noobj = 0.5: Decrease weight of “no object” confidence loss
Intuition: Why λ_coord = 5 and λ_noobj = 0.5?
The Numbers Game:
- Typical image: 2-3 cells with objects, 46-47 cells without objects
- Ratio: ~20:1 (no-object : object)
Without weighting (all λ = 1):
Total confidence loss = 2 × (object conf loss) + 47 × (no-object conf loss) ≈ 2 × 0.5 + 47 × 0.1 = 5.7 ≈ 82% from no-object cells!
The “no object” gradient overwhelms “object” gradient → network learns to predict “no object” everywhere.
With weighting (λ_coord=5, λ_noobj=0.5):
Box coord loss: 5 × (2 cells × coord error) = 5 × 1.0 = 5.0 Object conf loss: 1 × (2 cells × conf error) = 1 × 0.5 = 0.5 No-obj conf loss: 0.5 × (47 cells × conf error) = 0.5 × 4.7 = 2.35 Total ≈ 7.85, where localization now matters most.Why these specific values?
- λ_coord = 5: Emphasizes getting boxes right (5× more important than classification)
- λ_noobj = 0.5: De-emphasizes empty cells (2× less important than object cells)
- Balance found empirically through ablation studies in the paper
λ_coord=5 says “getting box positions right matters a lot.” λ_noobj=0.5 says “empty cells, chill out, you’re not that important.”
Wrong Lambda Values
Mistake #1: Using λ_coord = 1 (same as others)
- Result: Network struggles with localization. Boxes are detected but poorly positioned.
- Symptom: Low IOU scores even when objects are detected
Mistake #2: Using λ_noobj = 1 (same as object cells)
- Result: No-object gradient dominates, network predicts low confidence for everything
- Symptom: Very few detections, even on objects that are clearly visible
Mistake #3: Using λ_coord = 10 (too high)
- Result: Network focuses only on boxes, ignores classification
- Symptom: Good IOU but wrong class labels
How to tune: Start with paper values (5, 0.5). Only adjust if you have domain-specific reasons (e.g., small dataset → increase λ_coord to 7).
Figure: Visualization of class imbalance problem - most grid cells (gray) contain no objects while only a few (colored) contain objects, showing why we need λ weighting factors
1. localization loss (box regression)
Goal: Make predicted box coordinates close to ground truth.
Formula:
L_box = λ_coord × Σ Σ 𝟙ᵢⱼᵒᵇʲ [(xᵢ - x̂ᵢ)² + (yᵢ - ŷᵢ)²]
+ λ_coord × Σ Σ 𝟙ᵢⱼᵒᵇʲ [(√wᵢ - √ŵᵢ)² + (√hᵢ - √ĥᵢ)²]
Where:
𝟙ᵢⱼᵒᵇʲ= 1 if cell i contains an object and box j is responsible for it(x, y, w, h)= ground truth box coordinates (YOLO format)(x̂, ŷ, ŵ, ĥ)= predicted box coordinates- Responsible box = the box with highest IOU with ground truth
Why square root of w and h?
Problem: MSE treats errors equally regardless of box size.
- 10-pixel error in 100×100 box = small mistake
- 10-pixel error in 20×20 box = huge mistake
Solution: Predict √w and √h instead of w and h directly.
- Small boxes: √w changes more for same absolute change
- Large boxes: √w changes less for same absolute change
- This makes errors more balanced across different box sizes
Example:
- Box 1: w=100, √w=10, error of 10 → √110 - 10 = 0.49
- Box 2: w=25, √w=5, error of 10 → √35 - 5 = 0.92 (penalized more)
Intuition: Why √w and √h?
The Problem: Absolute errors matter more for small objects.
Imagine two prediction errors:
- Large box (200×200 pixels): Predicted width = 210, GT = 200 → error = 10 pixels
- Small box (20×20 pixels): Predicted width = 30, GT = 20 → error = 10 pixels
Both have the same absolute error (10 pixels), but the small box error is much worse (50% error vs 5% error)!
Without √: MSE treats both equally → Loss = 10² = 100 for both.
With √: Square root scaling makes errors proportional:
Large box: (√210 - √200)² = (14.49 - 14.14)² = 0.12 Small box: (√30 - √20)² = (5.48 - 4.47)² = 1.02Now the small box error is ~8× larger in the loss → Network learns precision for small objects!
Square root compresses large values more than small ones. Balances the loss across different object sizes.
Forgetting the Square Root
Mistake: Predicting w and h directly instead of √w and √h
Consequence: Network struggles to learn small objects accurately. Large objects dominate the loss, making the model biased toward bigger objects.
How to spot: If your model detects large objects well but completely misses small ones, check if you’re applying sqrt() to the width and height predictions!
In code: Always use
torch.sqrt(w)in your target encoding andtorch.square(w_pred)when converting back.
Figure: The three components of YOLO loss function - localization (box coordinates), confidence (objectness), and classification (class probabilities)
2. confidence loss
Goal: Train confidence scores to reflect:
- Whether an object is present
- How well the predicted box fits the object
Formula:
L_conf = Σ Σ 𝟙ᵢⱼᵒᵇʲ (Cᵢ - Ĉᵢ)²
+ λ_noobj × Σ Σ 𝟙ᵢⱼⁿᵒᵒᵇʲ (Cᵢ - Ĉᵢ)²
Where:
𝟙ᵢⱼᵒᵇʲ= 1 if object exists and box j is responsible𝟙ᵢⱼⁿᵒᵒᵇʲ= 1 if no object exists in cell iC= target confidence (1 if object, 0 if no object)Ĉ= predicted confidenceλ_noobj = 0.5= weight for “no object” cells
Two parts:
- Object cells: Push confidence toward 1.0
- No-object cells: Push confidence toward 0.0 (but weighted less)
Target confidence for object cells:
C = Pr(Object) × IOU_pred^truth = 1.0 × IOU
So if predicted box has IOU=0.9 with ground truth, target confidence = 0.9.
Figure: Detailed breakdown of how each loss component is computed for a single grid cell prediction
3. classification loss
Goal: Predict correct class probabilities for cells containing objects.
Formula:
L_class = Σ 𝟙ᵢᵒᵇʲ Σ (pᵢ(c) - p̂ᵢ(c))²
Where:
𝟙ᵢᵒᵇʲ= 1 if cell i contains an objectpᵢ(c)= ground truth probability for class c (one-hot: 1 for correct class, 0 otherwise)p̂ᵢ(c)= predicted probability for class c- Sum over all C classes (20 for Pascal VOC)
Important: This loss is only computed for cells that contain objects.
Example:
- Cell contains a person (class 14)
- Target:
[0, 0, ..., 1, ..., 0](1 at position 14) - Predicted:
[0.1, 0.05, ..., 0.8, ..., 0.02] - Loss:
Σ(target - pred)² = (0-0.1)² + ... + (1-0.8)² + ... = 0.05
complete loss function
Figure: Classification loss computation showing how class probabilities are compared with one-hot encoded ground truth labels
Figure: The complete YOLO loss function combining all three components with their respective weighting factors
Figure: Mathematical formulation of the complete loss function with indicator functions and summations
Figure: All three loss components shown together with their mathematical expressions and weighting parameters
L = λ_coord × L_box + L_conf + L_class
Where:
- λ_coord = 5.0
- λ_noobj = 0.5

training process with loss
Forward pass:
- Image → Network → Predictions (7×7×30)
- For each grid cell, compare predictions to targets
- Compute three loss components
- Sum weighted losses
Backward pass:
- Compute gradients via backpropagation
- Update weights using SGD with momentum
Key insight: The indicator functions 𝟙ᵢⱼᵒᵇʲ ensure we only compute loss for relevant predictions:
- Box coordinates: only for responsible boxes
- Confidence: for all boxes (but weighted differently)
- Classification: only for cells with objects
evaluation: mAP
Model’s trained. Is it any good? The standard metric for object detection is mAP (mean Average Precision).
First, precision and recall. Say an image has 10 dogs. The model outputs 8 boxes claiming to be dogs.
- Of those 8 boxes, 6 are actually correct (they overlap a real dog with an IOU > 0.5). These are True Positives(TP)
- The other 2 boxes were mistakes (e.g., it drew a box around a bush). These are False Positives (FP)
- The model completely missed 4 of the real dogs. These are False Negatives (FN).
understanding detection metrics
Three types of predictions:
- True Positive (TP): Predicted box matches ground truth (IOU ≥ threshold)
- False Positive (FP): Predicted box doesn’t match any ground truth
- False Negative (FN): Ground truth object that wasn’t detected
precision and recall
With these numbers, we can ask two questions:
- Precision: “Of the answers you gave, how many were right?”
Precision = TP / (TP + FP) = 6 / (6 + 2) = 75%- This measures how trustworthy the model’s predictions are. A high precision model doesn’t make many silly mistakes.
- Recall: “Of all the things you should have found, how many did you find?”
Recall = TP / (TP + FN) = 6 / (6 + 4) = 60%- This measures how comprehensive the model is. A high recall model doesn’t miss much.
The trade-off
Precision and recall fight each other.
- Only predict when you’re 100% sure? High precision, low recall. You miss stuff.
- Predict aggressively on everything? High recall, low precision. Lots of garbage boxes.
so, what is mAP?
We want both. Average Precision (AP) captures this for a single class - it’s the area under the precision-recall curve. Higher AP means the model stays precise even as it tries to find more objects.
Precision-Recall Curve: Plot precision vs recall at different confidence thresholds.
Steps to calculate AP:
- Sort all predictions by confidence
- For each confidence threshold, compute precision and recall
- Plot the curve
- Compute area under curve
mAP = average AP across all classes. 20 classes? Compute AP for each, average them.
mAP = (AP_class1 + AP_class2 + ... + AP_class20) / 20
One number to rule them all.
mAP@0.5: Use IOU threshold of 0.5 to determine TP mAP@0.75: Use IOU threshold of 0.75 (stricter) mAP@[0.5:0.95]: Average mAP across IOU thresholds from 0.5 to 0.95 in steps of 0.05
mAP Implementation
def compute_map(pred_boxes, gt_boxes, iou_threshold=0.5):
"""
Calculate mean Average Precision for object detection.
Args:
pred_boxes: List of predictions per image
[{class: [[x1,y1,x2,y2,score], ...], ...}, ...]
gt_boxes: List of ground truths per image
[{class: [[x1,y1,x2,y2], ...], ...}, ...]
iou_threshold: IOU threshold to consider a detection as TP
Returns:
mean_ap: Mean average precision across all classes
all_aps: Dictionary of AP for each class
"""
# Get all class labels from ground truth
gt_labels = set()
for im_gt in gt_boxes:
for cls_key in im_gt.keys():
gt_labels.add(cls_key)
gt_labels = sorted(list(gt_labels))
all_aps = {}
aps = []
# Compute AP for each class
for label in gt_labels:
# Collect all predictions for this class across all images
cls_preds = []
for im_idx, im_pred in enumerate(pred_boxes):
if label in im_pred:
for box in im_pred[label]:
cls_preds.append((im_idx, box))
# Sort predictions by confidence (descending)
cls_preds = sorted(cls_preds, key=lambda k: -k[1][-1])
# Track which GT boxes have been matched
gt_matched = [[False for _ in im_gts.get(label, [])]
for im_gts in gt_boxes]
# Count total GT boxes for this class (for recall)
num_gts = sum([len(im_gts.get(label, [])) for im_gts in gt_boxes])
# Track TP and FP for each prediction
tp = [0] * len(cls_preds)
fp = [0] * len(cls_preds)
# For each prediction, determine if TP or FP
for pred_idx, (im_idx, pred_box) in enumerate(cls_preds):
# Get GT boxes for this image and class
im_gts = gt_boxes[im_idx].get(label, [])
# Find best matching GT box
max_iou_found = -1
max_iou_gt_idx = -1
for gt_box_idx, gt_box in enumerate(im_gts):
gt_box_iou = iou(pred_box[:-1], gt_box)
if gt_box_iou > max_iou_found:
max_iou_found = gt_box_iou
max_iou_gt_idx = gt_box_idx
# TP only if IOU >= threshold AND GT box hasn't been matched yet
if max_iou_found < iou_threshold or \
(max_iou_gt_idx >= 0 and gt_matched[im_idx][max_iou_gt_idx]):
fp[pred_idx] = 1
else:
tp[pred_idx] = 1
gt_matched[im_idx][max_iou_gt_idx] = True
# Compute cumulative TP and FP
tp = np.cumsum(tp)
fp = np.cumsum(fp)
# Compute precision and recall at each threshold
recalls = tp / num_gts
precisions = tp / (tp + fp)
# Smooth precision curve (ensures precision is monotonic)
# Add boundary values
recalls = np.concatenate(([0.0], recalls, [1.0]))
precisions = np.concatenate(([0.0], precisions, [0.0]))
# Make precision monotonically decreasing
for i in range(precisions.size - 1, 0, -1):
precisions[i - 1] = np.maximum(precisions[i - 1], precisions[i])
# Compute AP as area under curve
# Get points where recall changes
i = np.where(recalls[1:] != recalls[:-1])[0]
ap = np.sum((recalls[i + 1] - recalls[i]) * precisions[i + 1])
if num_gts > 0:
aps.append(ap)
all_aps[label] = ap
else:
all_aps[label] = np.nan
mean_ap = sum(aps) / len(aps) if len(aps) > 0 else 0.0
return mean_ap, all_aps
performance and results
speed: real-time detection
YOLOv1 Performance:
- 45 fps on Nvidia Titan X GPU
- 155 fps for Fast-YOLO (9 conv layers instead of 24)
- This is 9× faster than previous state-of-the-art (Faster R-CNN at ~5 fps)
Why so fast?
- Single forward pass: No region proposals, no batch processing
- Unified architecture: One network handles everything
- Efficient design: Optimized from the ground up for speed
accuracy comparison
YOLOv1 on Pascal VOC 2007:
- mAP: 63.4%
- Fast-YOLO: 52.7% mAP
Comparison with other methods:
- Faster R-CNN: ~70% mAP, but only 7 fps
- DPM (Deformable Part Model): ~30% mAP, slower than YOLO
- R-CNN: Higher accuracy, but 47 seconds per image
The trade-off:
- YOLO sacrifices ~7% mAP for 9× speed improvement
- Enables entirely new applications (real-time video, embedded systems)
key advantages over two-stage detectors
- Global reasoning: YOLO sees the entire image during prediction
- Understands context (less likely to classify background as object)
- Fewer background false positives than Fast R-CNN
- Generalization: Better performance on new domains
- Learns generalizable features
- Better transfer to artwork, sketches, etc.
- Simplicity: Single network, end-to-end training
- No separate proposal generation
- Unified loss function
- Easier to optimize
- Real-time performance: 45+ fps enables:
- Live video analysis
- Robotics and autonomous vehicles
- Interactive applications
Figure: Performance comparison of YOLOv1 with other detection methods on Pascal VOC 2007 dataset
Figure: Detailed accuracy and speed metrics showing YOLO’s superiority in real-time performance while maintaining competitive accuracy
limitations of YOLOv1
1. limited objects per grid cell
Constraint: Each grid cell can only predict one object (despite predicting 2 boxes).
- Maximum detections: 7×7 = 49 objects per image
Problem: Fails in crowded scenarios
- Flocks of birds
- Dense crowds
- Multiple small objects in one cell
Why? Each cell predicts only one set of class probabilities, shared by both boxes.
Example failure case:
- Grid cell contains centers of both a dog and bicycle
- Cell can only classify as one class
- One object will be missed
2. struggles with small objects
Problem: Small objects (especially in groups) are hard to detect.
Why?
- 7×7 grid is coarse (each cell = 64×64 pixels)
- Small objects may not strongly activate any single cell
- Multiple small objects may share a grid cell
Example: A flock of small birds in the sky.
3. unusual aspect ratios
Problem: Objects with uncommon shapes or aspect ratios are hard to detect.
Why?
- Network learns aspect ratio priors from training data
- If training data contains mostly cars that are 2:1 (width:height)
- Long, thin cars or unusually shaped objects will be missed
Example: Very long limousine, tall narrow doorway
4. localization errors
Problem: Box coordinates are less accurate than two-stage detectors.
Why?
- Coarse feature map (7×7) for final prediction
- No iterative refinement (unlike Faster R-CNN)
- Especially affects small objects
Impact: Lower IOU values, affects mAP at higher IOU thresholds
5. box size sensitivity
Problem: Despite using √w and √h, small vs large box errors still not perfectly balanced.
Why? MSE treats all errors equally in loss calculation.
6. class imbalance despite weighting
Problem: Even with λ_noobj = 0.5, “no object” cells still dominate.
Why?
- Typically 47 cells without objects vs 2-3 with objects
- Still creates ~20× more “no object” loss terms
Impact:
- Confidence scores may be systematically lower
- May miss objects in difficult scenarios
failure cases: when YOLO struggles
failure case #1: small grouped objects
┌─────────────────────────────────────┐
│ Image: Flock of Birds in Sky │
│ │
│ • • • • • • • • • • │ ← 15 birds
│ • • • • • • • • │
│ • • • • • • • • • │
│ │
│ YOLOv1 Detection: │
│ ✓ 3 birds detected │
│ ✗ 12 birds missed! │
└─────────────────────────────────────┘
Why it fails:
- 7×7 grid = only 49 possible detections maximum
- Small birds (5×5 pixels each) → multiple birds share grid cells
- Each cell can only detect ONE object (one set of class probs)
- Small objects have weak features at 7×7 resolution
YOLOv3 Fix: Multi-scale detection (13×13, 26×26, 52×52 grids) → 8,000+ possible detections
YOLOv10 Fix: NMS-free detection + multi-scale → perfect for dense small objects
failure case #2: unusual aspect ratios
┌────────────────────────────────────────┐
│ Image: Stretch Limousine │
│ ▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ │ ← Very long car (10:1 ratio)
│ ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀ │
│ │
│ YOLOv1 Prediction: │
│ ┌─────┐ ┌─────┐ │ ← Two separate "car" boxes
│ │ Car │ ...gap...│ Car │ │ (misses it's one object)
│ └─────┘ └─────┘ │
│ ✗ Poor fit, low confidence │
└────────────────────────────────────────┘
Why it fails:
- Training data: mostly ~2:1 aspect ratio cars
- Limousine: 10:1 aspect ratio → out of distribution
- Network hasn’t learned to predict such extreme shapes
- Two grid cells detect parts separately
YOLOv2 Fix: Anchor boxes with multiple aspect ratios (1:1, 1:2, 2:1, 1:3, 3:1)
YOLOv10 Fix: Anchor-free design adapts to any shape automatically
failure case #3: crowded scenes
┌─────────────────────────────────────────┐
│ Image: Dense Crowd (Concert) │
│ P P P P P P P <- 100+ people │
│ P P P P P P P │
│ P P P P P P P │
│ │
│ YOLOv1 Detection: │
│ ✓ 35 people detected │
│ ✗ 65 people missed! │
│ (7×7 = 49 max, but overlap reduces) │
└─────────────────────────────────────────┘
Why it fails:
- Hard limit: 7×7 = 49 grid cells
- Each cell: only 1 object
- Crowded scene: many people share cells
- Result: Systematic undercounting
YOLOv3 Fix: Finer grids (52×52 = 2,704 cells) + 3 predictions per cell
YOLOv10 Fix: No cell limit + NMS-free → handles 100s of objects
failure case #4: heavy occlusion
┌──────────────────────────────────┐
│ Image: Overlapping Objects │
│ │
│ ┌─────┐ │
│ │ A │ ← Person A │
│ ┌─┼─────┤ │
│ │B│ A │ ← Person B (behind) │
│ └─┴─────┘ │
│ Center of B is inside A │
│ │
│ YOLOv1 Detection: │
│ ✓ Person A detected │
│ ✗ Person B missed! │
│ (Both centers in same cell) │
└──────────────────────────────────┘
Why it fails:
- Both object centers fall in same grid cell
- YOLO chooses higher confidence box
- Occluded object has lower features → lower confidence → ignored
YOLOv2 Fix: Multiple anchors per cell → can detect both
YOLOv10 Fix: Dual heads (o2m + o2o) → better occlusion handling
failure rate summary
The following table shows illustrative estimates to convey qualitative trends - these are not actual benchmark results.
| Scenario | YOLOv1 Performance (illustrative) | YOLOv10 Performance (illustrative) |
|---|---|---|
| Small grouped objects (birds) | ~20% recall | ~90% recall |
| Unusual aspect ratios (limousine) | ~40% IOU | ~85% IOU |
| Crowded scenes (>50 objects) | ~35% recall | ~95% recall |
| Heavy occlusion (overlapping) | ~50% recall | ~85% recall |
YOLOv1 works great on sparse, medium-sized objects with normal aspect ratios. Anything else? Use v3+ or v10.
evolution: from v1 to v10
the YOLO family tree
Each version after v1 tried to fix its problems without killing the speed.
key improvements across versions
YOLOv2 (YOLO9000) - 2017:
- Anchor boxes: Instead of predicting boxes directly, predict offsets from predefined anchors
- Batch normalization: Added to all conv layers, improved convergence
- High-resolution classifier: Pretrain at 448×448 (not 224×224)
- Multi-scale training: Train on different input sizes
- Better backbone: Darknet-19 (19 conv layers + 5 maxpool)
- Performance: 67% mAP at 40 fps, 78.6% mAP at slower speeds
YOLOv3 - 2018:
- Multi-scale predictions: Detect at 3 different scales (better for small objects)
- Better backbone: Darknet-53 (53 conv layers, residual connections)
- Binary classification: Use logistic regression instead of softmax (allows multi-label)
- Performance: 57.9% mAP@0.5, comparable to RetinaNet but 3-4× faster
YOLOv4 - 2020:
- Bag of freebies: Techniques that improve accuracy without increasing inference cost
- Data augmentation (Mosaic, MixUp)
- Label smoothing
- DropBlock regularization
- Bag of specials: Techniques with small inference cost increase
- Mish activation
- CSPNet backbone
- SPP (Spatial Pyramid Pooling)
- PANet neck
- Performance: 43.5% mAP@0.5:0.95, state-of-the-art at the time
YOLOv5 - 2020 (Ultralytics):
- PyTorch implementation: Easier to use and customize
- Auto-learning anchors: Automatically cluster anchors from training data
- Model family: Nano, Small, Medium, Large, XLarge variants
- Better augmentations: Albumentations integration
- Export options: ONNX, TensorRT, CoreML, etc.
YOLOv6, v7, v8 (2022-2023): Mostly incremental improvements. v6 (Meituan) focused on industrial deployment with a hardware-aware design. v7 introduced re-parameterized convolutions and a more efficient training pipeline. v8 (Ultralytics) went anchor-free and decoupled the classification and regression heads. All three pushed accuracy up a few points on COCO, but the next real architectural jump is v10.
YOLOv10 - 2024:
- NMS-free detection: Eliminates need for NMS post-processing
- Consistent matching strategy during training
- Dual label assignment
- Significantly faster inference
- Efficiency optimizations:
- Compact inverted block design
- Partial self-attention
- Spatial-channel decoupled downsampling
- Model variants: N/S/M/B/L/X for different speed/accuracy trade-offs
- Performance:
- YOLOv10-X: ~54% mAP@0.5:0.95 on COCO
- YOLOv10-N: Real-time on edge devices
key shifts across versions
- v1→v2: Direct prediction → Anchor-based
- v2→v3: Single-scale → Multi-scale detection
- v3→v4: Manual design → Automated architecture search and bag-of-tricks
- v4→v5: Darknet (C) → PyTorch (Python), better tooling
- v8→v10: Anchor-based → Anchor-free, NMS → NMS-free
what remained constant
Ten versions in, the core hasn’t changed:
- Single-stage detection: One network, one forward pass
- Real-time performance: Speed is a primary goal
- End-to-end training: Unified loss function
- Practical focus: Easy to deploy and use
YOLOv10: what’s actually new
Two things that actually matter in v10: NMS-free detection and a rethought architecture.
1. NMS-free training with dual heads
Traditional YOLO models generate multiple overlapping boxes per object, then rely on Non-Maximum Suppression (NMS) as post-processing. NMS adds latency and isn’t differentiable, so you can’t optimize it end-to-end. YOLOv10 gets rid of it entirely.
The trick is two detection heads during training:
- One-to-Many (o2m) Head: Matches each ground truth object with multiple predictions. Rich supervision signal - explores diverse locations for the same object.
- One-to-One (o2o) Head: Matches each ground truth object with exactly one prediction. Learns to output one clean box per object.
Both heads share the same matching metric so their supervision stays consistent. During training, the o2o head benefits from the rich signal of the o2m head (which explores multiple locations) while learning to produce clean, single predictions. At inference, the o2m head is discarded entirely - zero extra cost. You just use the o2o head, pick top-K class scores, filter by confidence, and you’re done. No NMS needed.
The result: about 40% faster inference than YOLOv8 since there’s no post-processing overhead, and cleaner predictions with one strong box per object.
2. architectural efficiency
YOLOv10 also rethinks the model architecture for better speed-accuracy tradeoffs:
- Lightweight classification head: The classification head in YOLOv8 was 2.5x heavier than the regression head despite regression being more important for accuracy. YOLOv10 replaces it with depthwise separable convolutions - much cheaper.
- Decoupled downsampling: Instead of one expensive 3x3 stride-2 conv that handles both spatial reduction and channel expansion, YOLOv10 splits it into a pointwise conv (channels) and a depthwise conv (spatial). Significantly cheaper.
- Rank-guided block design: Not all network stages are equally important. YOLOv10 computes the intrinsic rank of each stage’s convolutions (via SVD), then replaces redundant stages with lightweight Compact Inverted Blocks while keeping high-rank stages strong with large kernel convolutions and Partial Self-Attention.
comparison: YOLOv1 vs YOLOv10
| Metric | YOLOv1 (2016) | YOLOv10 (2024) |
|---|---|---|
| mAP50 | 63.4% (VOC) | 90.6% (ODN dataset) |
| mAP50-95 | Not reported | 76% |
| Speed (fps) | 45 fps | ~63 fps (40% faster than v8) |
| Grid Size | 7×7 (fixed) | Multi-scale, adaptive |
| Boxes per Cell | 2 | Dynamic (o2o/o2m) |
| Max Objects | 49 (7×7 grid limit) | Unlimited |
| Post-processing | NMS required | NMS-free |
| Small Objects | Struggles | Excellent (multi-scale detection) |
| Architecture | 24 conv + 2 FC | Efficient CSPNet + lightweight heads |
| Training | Single-head | Dual-head (o2m + o2o) |
v1 proved single-stage detection works. v10 makes it truly end-to-end - no post-processing, even faster. Same core idea, just way more refined.
implementation: PyTorch code examples
Full PyTorch implementation of YOLOv1 below.
1. model architecture
The YOLO network uses a ResNet34 backbone (pretrained on ImageNet) followed by detection layers:
import torch
import torch.nn as nn
import torchvision
class YOLOV1(nn.Module):
"""
YOLOv1 Implementation using ResNet34 backbone
Args:
img_size: Input image size (448x448)
num_classes: Number of classes (20 for Pascal VOC)
model_config: Configuration dict with S, B, and architectural params
Output:
Tensor of shape (batch_size, S, S, 5*B + C)
"""
def __init__(self, img_size, num_classes, model_config):
super(YOLOV1, self).__init__()
self.img_size = img_size
self.S = model_config['S'] # Grid size (7x7)
self.B = model_config['B'] # Boxes per cell (2)
self.C = num_classes # Number of classes (20)
# Load pretrained ResNet34 backbone (trained on ImageNet 224x224)
backbone = torchvision.models.resnet34(
weights=torchvision.models.ResNet34_Weights.IMAGENET1K_V1
)
# Feature extraction layers (before FC layers)
self.features = nn.Sequential(
backbone.conv1, # 7x7 conv, stride 2
backbone.bn1,
backbone.relu,
backbone.maxpool,
backbone.layer1, # ResNet blocks
backbone.layer2,
backbone.layer3,
backbone.layer4, # Output: 512 channels
)
# Detection head: 3 conv layers for feature refinement
yolo_conv_channels = model_config['yolo_conv_channels'] # 1024
leaky_relu_slope = model_config['leaky_relu_slope'] # 0.1
self.conv_layers = nn.Sequential(
nn.Conv2d(512, yolo_conv_channels, 3, padding=1, bias=False),
nn.BatchNorm2d(yolo_conv_channels),
nn.LeakyReLU(leaky_relu_slope),
nn.Conv2d(yolo_conv_channels, yolo_conv_channels, 3,
stride=2, padding=1, bias=False),
nn.BatchNorm2d(yolo_conv_channels),
nn.LeakyReLU(leaky_relu_slope),
nn.Conv2d(yolo_conv_channels, yolo_conv_channels, 3,
padding=1, bias=False),
nn.BatchNorm2d(yolo_conv_channels),
nn.LeakyReLU(leaky_relu_slope)
)
# Final 1x1 conv to get S*S*(5B+C) output
self.final_conv = nn.Conv2d(yolo_conv_channels, 5 * self.B + self.C, 1)
def forward(self, x):
# x: (batch, 3, 448, 448)
out = self.features(x) # (batch, 512, 14, 14)
out = self.conv_layers(out) # (batch, 1024, 7, 7)
out = self.final_conv(out) # (batch, 30, 7, 7)
# Permute to (batch, S, S, 5B+C)
out = out.permute(0, 2, 3, 1) # (batch, 7, 7, 30)
return out
2. loss function
The complete YOLOv1 loss with all three components:
import torch
import torch.nn as nn
def iou(box1, box2):
"""
Calculate Intersection over Union between two sets of boxes.
Args:
box1, box2: Tensors of shape (..., 4) in format (x1, y1, x2, y2)
Returns:
iou: Tensor of shape (...) with IOU values
"""
# Calculate areas
area1 = (box1[..., 2] - box1[..., 0]) * (box1[..., 3] - box1[..., 1])
area2 = (box2[..., 2] - box2[..., 0]) * (box2[..., 3] - box2[..., 1])
# Find intersection rectangle
x_topleft = torch.max(box1[..., 0], box2[..., 0])
y_topleft = torch.max(box1[..., 1], box2[..., 1])
x_bottomright = torch.min(box1[..., 2], box2[..., 2])
y_bottomright = torch.min(box1[..., 3], box2[..., 3])
# Calculate intersection area (clamp to handle non-overlapping boxes)
intersection = (x_bottomright - x_topleft).clamp(min=0) * \
(y_bottomright - y_topleft).clamp(min=0)
# Calculate union and IOU
union = area1.clamp(min=0) + area2.clamp(min=0) - intersection
iou = intersection / (union + 1e-6) # Add epsilon to avoid division by zero
return iou
class YOLOLoss(nn.Module):
"""
YOLOv1 Loss Function: Localization + Confidence + Classification
Loss = λ_coord × L_box + L_conf + L_class
"""
def __init__(self, S=7, B=2, C=20):
super(YOLOLoss, self).__init__()
self.S = S
self.B = B
self.C = C
self.lambda_coord = 5.0 # Increase weight for box coordinates
self.lambda_noobj = 0.5 # Decrease weight for no-object cells
def forward(self, preds, targets):
"""
Args:
preds: (batch, S, S, 5*B + C) - model predictions
targets: (batch, S, S, 5*B + C) - ground truth targets
Returns:
loss: Scalar tensor
"""
batch_size = preds.size(0)
# Create coordinate shift grids for converting relative → absolute coords
xshift = torch.arange(0, self.S, device=preds.device) / float(self.S)
yshift = torch.arange(0, self.S, device=preds.device) / float(self.S)
yshift, xshift = torch.meshgrid(yshift, xshift, indexing='ij')
xshift = xshift.reshape((1, self.S, self.S, 1)).repeat(1, 1, 1, self.B)
yshift = yshift.reshape((1, self.S, self.S, 1)).repeat(1, 1, 1, self.B)
# Reshape predictions and targets: (batch, S, S, B, 5)
pred_boxes = preds[..., :5*self.B].reshape(batch_size, self.S, self.S, self.B, 5)
target_boxes = targets[..., :5*self.B].reshape(batch_size, self.S, self.S, self.B, 5)
# Convert from (x_offset, y_offset, √w, √h) to (x1, y1, x2, y2) format
def boxes_to_x1y1x2y2(boxes, xshift, yshift):
x_center = boxes[..., 0] / self.S + xshift
y_center = boxes[..., 1] / self.S + yshift
width = torch.square(boxes[..., 2]) # w = (√w)²
height = torch.square(boxes[..., 3]) # h = (√h)²
x1 = (x_center - 0.5 * width).unsqueeze(-1)
y1 = (y_center - 0.5 * height).unsqueeze(-1)
x2 = (x_center + 0.5 * width).unsqueeze(-1)
y2 = (y_center + 0.5 * height).unsqueeze(-1)
return torch.cat([x1, y1, x2, y2], dim=-1)
pred_boxes_xyxy = boxes_to_x1y1x2y2(pred_boxes, xshift, yshift)
target_boxes_xyxy = boxes_to_x1y1x2y2(target_boxes, xshift, yshift)
# Calculate IOU between predicted and target boxes
iou_pred_target = iou(pred_boxes_xyxy, target_boxes_xyxy)
# Find responsible box (highest IOU with ground truth)
max_iou, max_iou_idx = iou_pred_target.max(dim=-1, keepdim=True)
max_iou_idx = max_iou_idx.repeat(1, 1, 1, self.B)
# Create mask for responsible boxes
box_indices = torch.arange(self.B, device=preds.device).reshape(1, 1, 1, self.B)
box_indices = box_indices.expand_as(max_iou_idx)
is_responsible_box = (max_iou_idx == box_indices).long()
# Object indicator: 1 if cell contains object, 0 otherwise
obj_indicator = targets[..., 4:5] # Shape: (batch, S, S, 1)
# Indicator for responsible boxes in cells with objects
responsible_obj_indicator = is_responsible_box * obj_indicator
# --- 1. LOCALIZATION LOSS (only for responsible boxes) ---
x_loss = (pred_boxes[..., 0] - target_boxes[..., 0]) ** 2
y_loss = (pred_boxes[..., 1] - target_boxes[..., 1]) ** 2
w_loss = (pred_boxes[..., 2] - target_boxes[..., 2]) ** 2 # √w loss
h_loss = (pred_boxes[..., 3] - target_boxes[..., 3]) ** 2 # √h loss
localization_loss = self.lambda_coord * (
(responsible_obj_indicator * x_loss).sum() +
(responsible_obj_indicator * y_loss).sum() +
(responsible_obj_indicator * w_loss).sum() +
(responsible_obj_indicator * h_loss).sum()
)
# --- 2. CONFIDENCE LOSS (for object cells) ---
# Target confidence = IOU for responsible boxes
obj_conf_loss = ((pred_boxes[..., 4] - max_iou) ** 2 *
responsible_obj_indicator).sum()
# --- 3. CONFIDENCE LOSS (for no-object cells) ---
no_obj_indicator = 1 - responsible_obj_indicator
noobj_conf_loss = self.lambda_noobj * (
(pred_boxes[..., 4] ** 2 * no_obj_indicator).sum()
)
# --- 4. CLASSIFICATION LOSS (only for cells with objects) ---
class_preds = preds[..., 5*self.B:]
class_targets = targets[..., 5*self.B:]
class_loss = ((class_preds - class_targets) ** 2 * obj_indicator).sum()
# Total loss
total_loss = (localization_loss + obj_conf_loss +
noobj_conf_loss + class_loss) / batch_size
return total_loss
3. dataset & target encoding
Pascal VOC annotations to YOLO format:
import torch
import albumentations as alb
import cv2
from torch.utils.data import Dataset
class VOCDataset(Dataset):
"""Pascal VOC Dataset with YOLO target encoding"""
def __init__(self, split='train', img_size=448, S=7, B=2, C=20):
self.split = split
self.img_size = img_size
self.S = S # Grid size
self.B = B # Boxes per cell
self.C = C # Number of classes
# Data augmentation for training
self.transforms = {
'train': alb.Compose([
alb.HorizontalFlip(p=0.5),
alb.Affine(scale=(0.8, 1.2),
translate_percent=(-0.2, 0.2)),
alb.ColorJitter(brightness=(0.8, 1.2),
saturation=(0.8, 1.2)),
alb.Resize(self.img_size, self.img_size)
], bbox_params=alb.BboxParams(format='pascal_voc',
label_fields=['labels'])),
'test': alb.Compose([
alb.Resize(self.img_size, self.img_size)
], bbox_params=alb.BboxParams(format='pascal_voc',
label_fields=['labels']))
}
# Load Pascal VOC annotations...
# (XML parsing code omitted for brevity)
def __getitem__(self, index):
# Load image and annotations
img_info = self.images_info[index]
img = cv2.imread(img_info['filename'])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
bboxes = [det['bbox'] for det in img_info['detections']] # (x1,y1,x2,y2)
labels = [det['label'] for det in img_info['detections']]
# Apply augmentations
transformed = self.transforms[self.split](
image=img, bboxes=bboxes, labels=labels
)
img = transformed['image']
bboxes = torch.tensor(transformed['bboxes'])
labels = torch.tensor(transformed['labels'])
# Normalize image to [0, 1] and apply ImageNet normalization
img_tensor = torch.from_numpy(img / 255.0).permute(2, 0, 1).float()
# --- Create YOLO target tensor ---
target_dim = 5 * self.B + self.C
yolo_target = torch.zeros(self.S, self.S, target_dim)
h, w = img.shape[:2]
cell_size = h // self.S # Pixels per grid cell
if len(bboxes) > 0:
# Convert (x1, y1, x2, y2) → (x_center, y_center, width, height)
box_width = bboxes[:, 2] - bboxes[:, 0]
box_height = bboxes[:, 3] - bboxes[:, 1]
box_center_x = bboxes[:, 0] + 0.5 * box_width
box_center_y = bboxes[:, 1] + 0.5 * box_height
# Determine which grid cell each object belongs to
grid_i = torch.floor(box_center_x / cell_size).long()
grid_j = torch.floor(box_center_y / cell_size).long()
# Compute relative coordinates within grid cell (0 to 1)
box_x_offset = (box_center_x - grid_i * cell_size) / cell_size
box_y_offset = (box_center_y - grid_j * cell_size) / cell_size
# Normalize width and height to image size
box_w_norm = box_width / w
box_h_norm = box_height / h
# Fill YOLO target tensor
for idx in range(len(bboxes)):
# Assign same target to all B boxes (model picks responsible one)
for b in range(self.B):
s = 5 * b
yolo_target[grid_j[idx], grid_i[idx], s] = box_x_offset[idx]
yolo_target[grid_j[idx], grid_i[idx], s+1] = box_y_offset[idx]
yolo_target[grid_j[idx], grid_i[idx], s+2] = box_w_norm[idx].sqrt()
yolo_target[grid_j[idx], grid_i[idx], s+3] = box_h_norm[idx].sqrt()
yolo_target[grid_j[idx], grid_i[idx], s+4] = 1.0 # Confidence
# One-hot encode class
label = int(labels[idx])
yolo_target[grid_j[idx], grid_i[idx], 5*self.B + label] = 1.0
return img_tensor, yolo_target
4. training loop
Training loop:
import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import MultiStepLR
from torch.utils.data import DataLoader
# Initialize model, loss, and dataset
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = YOLOV1(img_size=448, num_classes=20, model_config={
'S': 7, 'B': 2, 'yolo_conv_channels': 1024, 'leaky_relu_slope': 0.1
}).to(device)
criterion = YOLOLoss(S=7, B=2, C=20)
train_dataset = VOCDataset(split='train', img_size=448)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
# Optimizer: SGD with momentum (as per paper)
optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4)
# Learning rate schedule: reduce at epochs [75, 105]
# Paper uses warm-up from 1e-3 to 1e-2 for first epochs, then steps down
scheduler = MultiStepLR(optimizer, milestones=[75, 105], gamma=0.1)
# Training loop
num_epochs = 135 # As per paper
model.train()
for epoch in range(num_epochs):
epoch_loss = 0.0
for images, targets in train_loader:
images = images.to(device)
targets = targets.to(device)
# Forward pass
predictions = model(images)
loss = criterion(predictions, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
scheduler.step()
print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}')
5. inference with NMS
Converting raw predictions to actual boxes:
def convert_predictions_to_boxes(predictions, S=7, B=2, C=20,
conf_threshold=0.2, nms_threshold=0.5):
"""
Convert YOLO predictions to bounding boxes with NMS.
Args:
predictions: (S, S, 5*B + C) tensor
conf_threshold: Minimum confidence to keep box
nms_threshold: IOU threshold for NMS
Returns:
boxes: (N, 4) tensor in (x1, y1, x2, y2) format
scores: (N,) confidence scores
labels: (N,) class labels
"""
predictions = predictions.reshape(S, S, 5*B + C)
# Get class predictions (same for all boxes in a cell)
class_probs, class_labels = predictions[..., 5*B:].max(dim=-1)
# Create coordinate shift grid
shifts_x = torch.arange(S, device=predictions.device) / float(S)
shifts_y = torch.arange(S, device=predictions.device) / float(S)
shifts_y, shifts_x = torch.meshgrid(shifts_y, shifts_x, indexing='ij')
all_boxes = []
all_scores = []
all_labels = []
# Process each of B boxes per cell
for b in range(B):
# Extract box parameters
x_offset = predictions[..., b*5 + 0]
y_offset = predictions[..., b*5 + 1]
w = predictions[..., b*5 + 2]
h = predictions[..., b*5 + 3]
conf = predictions[..., b*5 + 4]
# Convert to absolute coordinates
x_center = (x_offset / S + shifts_x)
y_center = (y_offset / S + shifts_y)
width = torch.square(w) # w = √w_pred²
height = torch.square(h)
# Convert to (x1, y1, x2, y2) format
x1 = (x_center - 0.5 * width).reshape(-1, 1)
y1 = (y_center - 0.5 * height).reshape(-1, 1)
x2 = (x_center + 0.5 * width).reshape(-1, 1)
y2 = (y_center + 0.5 * height).reshape(-1, 1)
boxes = torch.cat([x1, y1, x2, y2], dim=-1)
# Compute class-specific confidence scores
scores = conf.reshape(-1) * class_probs.reshape(-1)
labels = class_labels.reshape(-1)
all_boxes.append(boxes)
all_scores.append(scores)
all_labels.append(labels)
# Concatenate all boxes
boxes = torch.cat(all_boxes, dim=0)
scores = torch.cat(all_scores, dim=0)
labels = torch.cat(all_labels, dim=0)
# Confidence thresholding
keep = scores > conf_threshold
boxes = boxes[keep]
scores = scores[keep]
labels = labels[keep]
# Apply NMS per class
keep_mask = torch.zeros_like(scores, dtype=torch.bool)
for class_id in torch.unique(labels):
class_indices = labels == class_id
class_boxes = boxes[class_indices]
class_scores = scores[class_indices]
# NMS (using torchvision)
keep_indices = torch.ops.torchvision.nms(
class_boxes, class_scores, nms_threshold
)
# Mark these boxes as kept
class_keep_indices = torch.where(class_indices)[0][keep_indices]
keep_mask[class_keep_indices] = True
final_boxes = boxes[keep_mask]
final_scores = scores[keep_mask]
final_labels = labels[keep_mask]
return final_boxes, final_scores, final_labels
# Example usage
model.eval()
with torch.no_grad():
img_tensor = ... # Load and preprocess image
predictions = model(img_tensor.unsqueeze(0))[0] # Remove batch dim
boxes, scores, labels = convert_predictions_to_boxes(
predictions, conf_threshold=0.2, nms_threshold=0.5
)
# boxes: (N, 4) in normalized 0-1 coordinates
# Multiply by image dimensions to get pixel coordinates
key implementation details
- √w and √h: The model predicts square root of width/height to balance loss across different box sizes
- Relative Coordinates: x,y offsets are relative to grid cell top-left corner
- Responsible Box Selection: During training, only the box with highest IOU with GT is penalized for coordinates
- Class Probabilities: Shared across all B boxes in a cell (limitation of v1)
- Lambda Weighting: λ_coord=5 to emphasize localization, λ_noobj=0.5 to de-emphasize empty cells
Follows the original paper. Gets ~63% mAP on Pascal VOC 2007 after 135 epochs.
optimization techniques
precision options
| Precision | Speed | Accuracy | Memory | Best For |
|---|---|---|---|---|
| FP32 (default) | 1× | 100% (baseline) | 4 bytes/param | Training, research |
| FP16 (half precision) | 2-3× | 99.5% | 2 bytes/param | Production (GPU) |
| INT8 (quantization) | 4× | 97-99% | 1 byte/param | Edge devices |
INT8 Quantization (PyTorch):
import torch.quantization
model = YOLOV1(...)
model.eval()
# Quantization-aware training (best accuracy)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model)
# Train for a few epochs...
model_quantized = torch.quantization.convert(model_prepared)
# Result: 4× smaller, 2-4× faster on CPU
layer fusion
TensorRT automatically fuses operations:
Conv → BatchNorm → ReLU → Single fused kernel
(3× fewer memory transfers)
dynamic tensor memory
- Reuses memory buffers across layers
- Reduces GPU memory usage by 30-50%
3. hardware-specific deployment
NVIDIA GPU (RTX 3060/4090) Recommendation: TensorRT with FP16
# Export PyTorch → ONNX → TensorRT
python export.py --weights best.pt --format onnx
trtexec --onnx=yolov1.onnx --saveEngine=yolov1_fp16.trt --fp16
Expected Performance:
- RTX 4090: 180 fps (FP16), 90 fps (FP32)
- RTX 3060: 120 fps (FP16), 60 fps (FP32)
- Edge Devices (Jetson, Raspberry Pi) Recommendation: TensorRT INT8 or TFLite
NVIDIA Jetson Orin:
# INT8 quantization for Jetson
trtexec --onnx=yolov1.onnx \
--int8 \
--workspace=2048 \
--saveEngine=yolov1_int8.trt
Performance: 120 fps (INT8), 45 fps (FP16)
conclusion
This post covered the full YOLO pipeline: how the grid system works, why the loss function is shaped the way it is, what NMS and mAP actually measure, and how the architecture evolved from v1 through v10. The core insight from the original 2016 paper - treat detection as a single regression problem instead of a multi-stage pipeline - turned out to be the right bet. Ten versions later, the models are faster and more accurate, but that basic idea hasn’t changed.
references
- Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016.
- Pascal VOC Dataset: http://host.robots.ox.ac.uk/pascal/VOC/
- ImageNet: http://www.image-net.org/
- PyTorch Implementation
extra
Training a phone detection model. What is happening at each epoch?
1. training
For each batch (32 images):
a. Data Loading
- Load 32 images from train set
- Apply augmentations (copy_paste, HSV, rotation, etc.)
- Resize to 640x640
- Normalize pixel values
b. Forward Pass
- Pass batch through YOLOv11m network
- Get predictions: bboxes (x,y,w,h), class probabilities, objectness scores
- Network has ~20M parameters (backbone + neck + head)
c. Loss Calculation
- Box loss (7.5x weight): IoU/GIoU loss for bbox coordinates
- Class loss (1.5x weight): Binary cross-entropy for classification
- DFL loss (1.5x weight): Distribution focal loss for bbox refinement
- Total loss = weighted sum of above
What do these mean?
- Box loss: How well the model draws a box around the phone
- Class loss: How sure the model is that the thing is a phone
- DFL loss: How precisely the model guesses the edges of the box
d. Backward Pass
- Calculate gradients using total loss
- Backpropagation and updating of gradients using Optimizer (I used AdamW with momentum)
e. Optimizer Step (AdamW)
- Update weights using gradients
- Apply learning rate (starts at 0.0005, decays to 0.01 × lr0)
- Apply weight decay (0.0005) for regularization
- Apply momentum (0.9)
2. validation (every epoch)
After all training batches:
a. Switch to eval mode
- Disable dropout (full model instead of skipping some layers)
- Use batch normalization in eval mode (already learned features instead of learning new ones)
- No augmentations (no flips)
b. For each validation image
- Forward pass (no gradient calculation)
- Get predictions with conf threshold (default 0.001 for val)
- Apply NMS (non-maximum suppression: if the model draws many boxes on the same phone, keep only the best one and throw away the rest) with IoU threshold 0.45
c. Metric Calculation
- Match predictions to ground truth (IoU >= 0.5)
- Calculate per-class metrics:
- Precision = TP / (TP + FP)
- Recall = TP / (TP + FN)
- mAP@50 = mean AP at IoU=0.5
- mAP@50-95 = mean AP averaged over IoU 0.5 to 0.95
- Calculate overall metrics (averaged across all classes)
You can see how CNN works here
PyTorch implementation of YOLOV1 paper