TL;DR: YOLO reframed object detection as a single regression problem: one forward pass through a CNN predicts bounding boxes and class probabilities for the entire image at once. This post covers the grid system, loss function, NMS, mAP, and traces the architecture evolution from YOLOv1 through v10.

You look at a street and instantly spot a car, a person, a traffic light, a dog. You don’t scan pixel by pixel. You just see objects. Getting a computer to do this was slow for a long time - scan a patch, classify it, move to the next patch, repeat thousands of times.

Then in 2015: You Only Look Once (YOLO).

One look. One forward pass. Done. This post covers how YOLO works, why it works, where it falls apart, and how it evolved from v1 to v10.

table of contents

  1. Pipeline
  2. Before YOLO
  3. Why YOLO: The Motivation
  4. YOLO High Level Overview
  5. YOLO Architecture
  6. Grid Cells & Predictions
  7. Bounding Box Format & Encoding
  8. Prediction Vector Breakdown
  9. Training Process
  10. Loss Function Explained
  11. Inference: From Predictions to Bounding Boxes
  12. Post-Processing: IOU & NMS
  13. Evaluation Metrics: mAP
  14. Performance & Results
  15. Limitations of YOLOv1
  16. Evolution: From v1 to v10
  17. Implementation: PyTorch Code
  18. Extra

pipeline

┌─────────────────────────────────────────────────────────────────────────┐
│                         YOLO OBJECT DETECTION                            │
└─────────────────────────────────────────────────────────────────────────┘

Input Image (448×448)
      │
      ├──► Step 1: Grid Division
      │          ┌───┬───┬───┬───┬───┬───┬───┐
      │          │   │   │   │   │   │   │   │     7×7 grid
      │          ├───┼───┼───┼───┼───┼───┼───┤     Each cell = 64×64 pixels
      │          │   │ P │   │   │   │   │   │     P = Person center
      │          ├───┼───┼───┼───┼───┼───┼───┤     D = Dog center
      │          │   │   │   │ D │   │   │   │
      │          └───┴───┴───┴───┴───┴───┴───┘
      │
      ├──► Step 2: CNN Forward Pass (24 conv layers)
      │          Features extracted: 7×7×1024
      │
      ├──► Step 3: Predictions per Cell
      │          Each cell outputs 30 values:
      │          • 2 boxes: (x,y,w,h,conf) × 2 = 10 values
      │          • 20 class probabilities = 20 values
      │          Total: 7×7×30 = 1,470 predictions
      │
      ├──► Step 4: Post-Processing
      │          • Filter by confidence (threshold = 0.2)
      │          • Apply Non-Maximum Suppression (NMS)
      │          • Keep top predictions per class
      │
      └──► Final Output
               ┌──────────────────┐
               │  Person: 95.3%   │  [Bounding boxes with labels]
               │  Dog: 87.6%      │
               └──────────────────┘

metrics

Metric Value Comparison
Speed 45 fps 9× faster than Faster R-CNN
Accuracy 63.4% mAP -7% vs Faster R-CNN
Grid Size 7×7 cells 49 possible detections
Parameters ~20M Custom GoogLeNet-inspired backbone (the implementation below uses ResNet34)
Input Size 448×448 Fixed resolution

before YOLO

traditional computer vision

Earlier approaches picked up patterns based on mathematical features from images in matrix form. We figured out where hue, contrast, and pixel density were high and used that to classify images (using models like ResNet).

The Problem: This doesn’t work well for complex scenarios like CCTV footage with:

  • Dozens of objects
  • Heavy occlusion
  • Complex features
  • Real-time requirements (ResNet takes huge amounts of time even during inference)

two-stage detectors: R-CNN & Faster R-CNN

Two-stage detection pipeline:

Image → Backbone (CNN/ResNet) → Convolutional Feature Map
      → Stage 1: Generate Bounding Box Proposals
      → Stage 2: Classification + Box Refinement (Regression Head)

The feature map encodes what’s in the image. Stage 1 proposes where objects might be. Stage 2 refines those proposals and assigns class labels.

Why “Two-Stage”? The first pass generates a set of proposals (potential object locations), and the second pass refines these proposals to make final predictions.

Advantages:

  • More accurate than previous methods
  • Each component can be optimized separately

Disadvantages:

  • Multi-stage pipeline is complex
  • Each component trained separately
  • Not suitable for real-time applications (~5 fps on GPU)
  • Computationally expensive

comparison: object detection methods (2014-2016)

Method Year Speed (fps) mAP (%) Pipeline Type Proposals Real-time?
R-CNN 2014 0.02 66.0 Two-stage Selective Search (2000+)
Fast R-CNN 2015 0.5 70.0 Two-stage Selective Search
Faster R-CNN 2015 7 73.2 Two-stage RPN (300)
DPM 2015 <1 30.4 Sliding window Dense sampling
YOLOv1 2016 45 63.4 Single-stage None

why YOLO: the motivation

The core idea: treat object detection as a single regression problem.

Instead of:

  1. Generate proposals → 2. Classify proposals

Do: Single network → Box detection + Category prediction simultaneously

key advantages of single-stage detection:

  • Speed: One forward pass through the network
  • Simplicity: End-to-end training with a unified loss function
  • Global reasoning: The network sees the entire image, understanding context
  • Real-time performance: 45+ fps (compared to 5 fps for two-stage detectors)

the trade-off:

  • Slightly lower accuracy on metrics

Joseph Redmon et al. (2016) built exactly this: one network, one pass, boxes + classes out the other end.


YOLO high level overview

Object detection as a single regression problem:

concept

  1. Input: Image of any size (resized to 448×448)
  2. Process: Divide image into S×S grid cells (S=7 in the paper)
  3. Predict: For each grid cell, predict B bounding boxes and C class probabilities
  4. Output: Tensor of shape S×S×(B×5+C)

why “you only look once”?

Unlike sliding window approaches (like DPM) that run a classifier at every position, or two-stage detectors that process thousands of region proposals, YOLO looks at the image once in a single forward pass.

single pass, end-to-end

  • No separate proposal generation
  • No batch processing of regions
  • One network, one forward pass
  • Direct prediction from pixels to bounding boxes and classes

YOLO High-Level Pipeline Figure: YOLO’s single-stage detection pipeline - the entire image is processed once to produce bounding boxes and class predictions

YOLO Algorithm Overview Figure: The complete YOLO detection algorithm from input image to final predictions


the architecture

YOLO is a CNN. 24 convolutional layers for feature extraction, 2 fully connected layers at the end to produce the final output.

Early layers detect low-level stuff (edges, color patches). Deeper layers combine those into higher-level features (eyes, wheels, textures). The FC layers at the end take all of that and produce bounding box predictions.

Backbone (Feature Extraction):

  • 24 convolutional layers with alternating 1×1 and 3×3 filters
  • 4 max-pooling layers for spatial downsampling
  • Leaky ReLU activation (for all layers except the last)

Detection Head:

  • 2 fully connected layers (4096 neurons → 1470 neurons)
  • Linear activation on the final layer

CNN Visualization

Figure: Convolutional Neural Network layers and how they extract features from images

CNN Feature Extraction

Figure: Detailed view of convolutional operations and feature map generation

Convolution Operation

Figure: How convolution filters slide across the image to detect patterns

input → output flow

Input: 448×448×3 (RGB image)
   ↓
[24 Convolutional Layers with MaxPooling]
   ↓
Feature Map: 7×7×1024
   ↓
[Flatten: 7×7×1024 = 50,176 features]
   ↓
[FC Layer 1: 50,176 → 4096]
   ↓
[FC Layer 2: 4096 → 1470]
   ↓
[Reshape: 1470 → 7×7×30]
   ↓
Output: 7×7×30 tensor

why stack convolutional layers?

Purpose: Extract spatial features from the image while preserving structure.

The Problem: Simply stacking convolutions doesn’t increase non-linearity in the data.

The Solution:

  • Use non-linear activation functions (Leaky ReLU) between layers
  • Use 1×1 convolutions to reduce dimensionality and add non-linearity
  • Use 3×3 convolutions to extract spatial features and downsample

why leaky ReLU?

Standard ReLU: f(x) = max(0, x) → can cause “dying neurons” (always output 0)

Leaky ReLU: f(x) = x if x > 0 else 0.1*x → allows small negative gradients, preventing dead neurons

regularization techniques:

  • Dropout (rate=0.5) after the first FC layer
  • Data augmentation (random scaling, translation, HSV adjustments)

YOLO Architecture Figure: Complete YOLOv1 network architecture with 24 convolutional layers and 2 fully connected layers

YOLO Architecture Detailed Figure: Layer-by-layer breakdown showing filter sizes, dimensions, and feature map transformations

YOLO Architecture Full Figure: The full network from input (448×448×3) to output (7×7×30) tensor


how YOLO works

  1. Take the image and lay a grid over it. The original YOLO uses a 7x7 grid. This divides the image into 49 cells.

  2. Each grid cell gets a job. It’s responsible for detecting any object whose center point falls within that cell.

  3. Make a prediction for every cell. A single, powerful neural network looks at the whole image and, for every single one of the 49 grid cells, it spits out a prediction. This prediction answers a few key questions:

    • “Is an object’s center in here? How confident am I?”
    • “If so, where is the bounding box for that object?”
    • “And what class is it (a dog, a person, a car)?”

That’s it. One look, one pass through the network, and out comes a flood of predictions from all 49 cells at once. No proposals, no second stage. Just a direct mapping from image pixels to bounding boxes and class probabilities.

the grid system

Step 1: Divide the 448×448 image into a 7×7 grid (S=7 in the paper)

  • Each grid cell: 64×64 pixels (448/7 = 64)

Step 2: Assign responsibility

  • Each cell is responsible for detecting one object
  • Which object? The one whose center point falls into that cell

Grid Cells and Center Points Figure: Image divided into 7×7 grid with object center points marked - each cell is responsible for objects whose centers fall within it

Object Centers

Figure: Identifying which grid cell is responsible for each object based on center point location

Center Point Details Figure: Close-up view of how object center points determine grid cell responsibility

Center of Cell Calculation

Figure: Computing the center point and assigning it to the appropriate grid cell

what does “responsible” mean?

If an object’s center point falls into a grid cell, that cell must:

  1. Predict the bounding box coordinates
  2. Predict the confidence that an object is present
  3. Predict the class probabilities

example

  • Person’s center at (100, 200) → falls into grid cell (1, 3)
  • Horse’s center at (300, 150) → falls into grid cell (4, 2)
  • Grid cell (1, 3) is responsible for detecting “person”
  • Grid cell (4, 2) is responsible for detecting “horse”

But there’s a catch… a big one.

What happens if the center of two objects falls into the same grid cell? Imagine a person standing directly in front of a car.

YOLOv1’s limitation: Each grid cell can only detect ONE object.

This is a fundamental rule of the original YOLO. Each cell proposes two bounding boxes (more on that below), but it can only output one set of class probabilities. Two objects in the same cell? Too bad, one gets ignored.

This is YOLOv1’s biggest weakness. Crowds, flocks of birds, anything with clustered small objects - it chokes. Later versions fix this. For now: one cell, one vote.

Fixed in YOLOv2+: Anchor boxes allow multiple detections per cell

Impact: Maximum 49 objects per image (7×7 grid). Crowded scenes (flocks of birds, dense crowds) will have missed detections.


bounding box format and encoding

Ground truth bounding box for person: (100, 200, 130, 202)

  • These are large absolute values
  • Hard for the network to predict directly
  • Solution: Make predictions relative to the grid cell

target encoding (ground truth → YOLO format)

For a bounding box with center (x, y), width w, height h:

Center Coordinates (relative to grid cell):

x' = (x - x_anchor) / cell_width
y' = (y - y_anchor) / cell_height

Where:

  • (x_anchor, y_anchor) = top-left corner of the grid cell
  • cell_width = cell_height = 64 (for 448/7 grid)

Basically: starting from the top-left corner of the grid cell, how far (as a fraction of the cell size) to the object’s center?

Width & Height (relative to entire image):

w' = w / image_width
h' = h / image_height

Where image_width = image_height = 448

Box size as a fraction of the total image.

Center Calculations Figure: Computing relative offsets (Δx, Δy) from grid cell top-left corner to object center

More on Center Encoding Figure: Detailed breakdown of how center coordinates are normalized relative to grid cell dimensions

XY Center Coordinates Figure: Visualizing the x,y center point encoding process with actual coordinate values

Normalized Values Figure: Final normalized ground truth values (x’, y’, w’, h’) ready for training

label encoding for training

For each grid cell, we create a target vector:

If cell contains an object:

  • Bounding box: (x', y', w', h', confidence=1.0)
  • Class: One-hot encoding [0, 0, 1, 0, ..., 0] (20 values for Pascal VOC)

If cell contains no object:

  • All zeros: (0, 0, 0, 0, 0, 0, 0, ..., 0) (30 values total)

YOLO Target Format Figure: 30-dimensional vector format with bounding box coordinates, confidence, and class probabilities


prediction vector

what does each grid cell predict?

Each grid cell outputs a 30-dimensional vector:

Bounding Boxes (B=2 boxes):

  • Box 1: (x₁', y₁', w₁', h₁', c₁) - 5 values
  • Box 2: (x₂', y₂', w₂', h₂', c₂) - 5 values

Where:

  • x', y' = center offset relative to top-left corner of grid cell (0 to 1)
  • w', h' = width/height relative to entire image (0 to 1)
  • c = objectness confidence = Pr(Object) × IOU_pred^truth

Class Probabilities (C=20 for Pascal VOC):

  • [p₁, p₂, ..., p₂₀] - 20 values
  • These are conditional probabilities: Pr(Class_i Object)
  • “What is the probability of each class, given that an object exists in this cell?”

Total: 5×2 + 20 = 30 values per grid cell

For the entire image: 7×7×30 = 1,470 values

why predict 2 boxes per cell?

The Problem: Some objects may have different aspect ratios or sizes.

The Solution: Each cell predicts 2 boxes, and during training, we assign each ground truth object to the box with the highest IOU with that object.

At inference: We keep only the box with the highest confidence score for each cell.

understanding confidence score

The confidence c encodes two things:

  1. Pr(Object): Probability that the box contains an object
  2. IOU_pred^truth: How well the predicted box fits the object
c = Pr(Object) × IOU_pred^truth

Three scenarios:

  • No object in cell: c = 0
  • Object present, poor fit: c = low (e.g., 0.3)
  • Object present, good fit: c = high (e.g., 0.9)

Class Confidence Scores Figure: How class probabilities are combined with objectness confidence to produce final class-specific confidence scores


inference: from predictions to bounding boxes

step 1: get raw predictions

Network outputs: 7×7×30 tensor

For grid cell (i, j):

  • Box 1: (x₁', y₁', w₁', h₁', c₁)
  • Box 2: (x₂', y₂', w₂', h₂', c₂)
  • Classes: [p₁, p₂, ..., p₂₀]

step 2: convert to image coordinates

Center coordinates:

x_anchor, y_anchor = i * 64, j * 64  # top-left of grid cell
x = x' * 64 + x_anchor
y₁ = y₁' * 64 + y_anchor

Width and height:

w = w' * 448
h₁ = h₁' * 448

Now we have absolute pixel coordinates for the bounding box.

Converting Coordinates Figure: Converting relative YOLO format coordinates back to absolute pixel coordinates for visualization

step 3: compute class-specific confidence

For each box, multiply objectness by class probabilities:

class_confidence = c × max(p, p, ..., p₂₀)

Example:

  • Box 1 confidence: c₁ = 0.85
  • Highest class probability: p₁₄ = 0.9 (person class)
  • Final confidence: 0.85 × 0.9 = 0.765

step 4: select best box per cell

Each cell predicted 2 boxes → keep the one with highest class-specific confidence.

Example:

  • Box 1: c₁' = 0.765
  • Box 2: c₂' = 0.621
  • Keep Box 1, discard Box 2

Ground Truth Confidence Figure: Target confidence values and how they’re computed from IOU between predicted and ground truth boxes

step 5: filter by confidence threshold

Remove all boxes with confidence below a threshold (e.g., 0.5):

final_boxes = [box for box in boxes if box.confidence > 0.5]

step 6: apply NMS

Many grid cells may detect the same object → use Non-Maximum Suppression to remove duplicates.


post-processing: IOU and NMS

intersection over union (IOU)

What is IOU? A measure of overlap between two bounding boxes. It tells you how well two boxes align.

Formula:

IOU = Area of Intersection / Area of Union

Interpretation:

  • IOU = 0.0 → No overlap
  • IOU = 0.5 → Decent overlap
  • IOU = 0.8+ → Excellent overlap (typically considered a correct detection)

IOU Visualization Figure: Intersection over Union (IOU) calculation showing overlap between predicted box (blue) and ground truth box (green), with intersection area highlighted

converting between box formats

Center format: (x_center, y_center, width, height)

Corner format: (x₁, y₁, x₂, y₂) where (x₁,y₁) = top-left, (x₂,y₂) = bottom-right

Conversion:

x = x_center - width/2
y = y_center - height/2
x = x_center + width/2
y = y_center + height/2

computing IOU

Step 1: Find intersection rectangle coordinates

x_left = max(x_box1, x_box2)
y_top = max(y_box1, y_box2)
x_right = min(x_box1, x_box2)
y_bottom = min(y_box1, y_box2)

Step 2: Compute intersection area

if x_right < x_left or y_bottom < y_top:
    intersection = 0  # Boxes don't overlap
else:
    intersection = (x_right - x_left) * (y_bottom - y_top)

Step 3: Compute union area

area_box1 = (x_box1 - x_box1) * (y_box1 - y_box1)
area_box2 = (x_box2 - x_box2) * (y_box2 - y_box2)
union = area_box1 + area_box2 - intersection

Step 4: Compute IOU

iou = intersection / union

Bounding Box Formats

Figure: Two common bounding box representations - corner format (x₁,y₁,x₂,y₂) vs center format (x_center, y_center, w, h) and how to convert between them

IOU implementation

def iou(pred, gt):
    """
    Calculate Intersection over Union between two bounding boxes.

    Args:
        pred: [x1, y1, x2, y2] predicted box
        gt: [x1, y1, x2, y2] ground truth box

    Returns:
        iou: float between 0 and 1
    """
    pred_x1, pred_y1, pred_x2, pred_y2 = pred
    gt_x1, gt_y1, gt_x2, gt_y2 = gt

    # Find intersection rectangle
    x_topleft = max(pred_x1, gt_x1)
    y_topleft = max(pred_y1, gt_y1)
    x_bottomright = min(pred_x2, gt_x2)
    y_bottomright = min(pred_y2, gt_y2)

    # Check if boxes overlap at all
    if x_bottomright < x_topleft or y_bottomright < y_topleft:
        return 0.0

    # Calculate areas
    intersection = (x_bottomright - x_topleft) * (y_bottomright - y_topleft)
    pred_area = (pred_x2 - pred_x1) * (pred_y2 - pred_y1)
    gt_area = (gt_x2 - gt_x1) * (gt_y2 - gt_y1)
    union = pred_area + gt_area - intersection

    iou = intersection / union
    return iou

non-maximum suppression (NMS)

The Problem: Multiple grid cells often detect the same object, creating duplicate/redundant boxes.

The Solution: Keep only the box with the highest confidence and remove all others that significantly overlap with it.

NMS Process

NMS algorithm

For each class separately:

  1. Sort all detections by confidence (descending)
  2. Select the box with highest confidence
  3. Remove all boxes with IOU > threshold (e.g., 0.5) with the selected box
  4. Repeat until no boxes remain

Why per-class? We want to suppress duplicate detections of the same object, but not suppress a person detection because it overlaps with a car detection.

Picking NMS Threshold

Choosing the wrong IOU threshold for NMS drastically affects performance.

Threshold too low (e.g., 0.3):

  • Suppresses too many boxes
  • Result: Misses objects that are close together
  • Example: Two people standing next to each other → only one detected

Threshold too high (e.g., 0.7):

  • Suppresses too few boxes
  • Result: Multiple boxes per object (duplicates)
  • Example: Single person gets 3 bounding boxes

Paper uses 0.5:

  • Balances duplicate suppression and multi-object detection
  • Works well for most scenarios

NMS Intuition: Keep the best box, kill everything that overlaps too much with it. Repeat until nothing’s left to suppress.

NMS implementation

def nms(detections, nms_threshold=0.5):
    """
    Apply Non-Maximum Suppression to remove duplicate detections.

    Args:
        detections: list of [x1, y1, x2, y2, score]
        nms_threshold: IOU threshold for suppression

    Returns:
        keep_detections: filtered list of detections
    """
    # Sort detections by score (descending)
    sorted_det = sorted(detections, key=lambda k: -k[-1])
    keep_detections = []

    while len(sorted_det) > 0:
        # Keep the highest confidence box
        best_box = sorted_det[0]
        keep_detections.append(best_box)

        # Remove this box and all boxes with high overlap
        sorted_det = [
            box for box in sorted_det[1:]
            if iou(best_box[:-1], box[:-1]) < nms_threshold
        ]

    return keep_detections

training process

dataset: Pascal VOC

Pascal VOC 2007 + 2012:

  • 20 object classes
  • Contains images with bounding box annotations
  • Standard benchmark for object detection

two-stage training strategy

Why two stages? To help the network learn useful features first, then fine-tune for detection.

stage 1: pretraining (classification)

  • Task: Image classification on ImageNet
  • Input size: 224×224
  • Network: First 20 convolutional layers + 1 FC layer
  • Goal: Learn general visual features
  • Why? Reduces training time and improves convergence

stage 2: detection training

  • Task: Object detection on Pascal VOC
  • Input size: 448×448 (doubled for finer localization)
  • Network: Full 24 conv layers + 2 FC layers
  • Why higher resolution? Detection requires fine-grained spatial information

training hyperparameters

Epochs: ~135 epochs

Batch size: 64

Optimizer: SGD with momentum

  • Momentum: 0.9
  • Weight decay: 0.0005

Learning rate schedule:

  • Epochs 1-5: Warm-up from 10⁻³ to 10⁻² (prevents divergence from unstable gradients)
  • Epochs 6-75: 10⁻² (main training)
  • Epochs 76-105: 10⁻³ (fine-tuning)
  • Epochs 106-135: 10⁻⁴ (final refinement)

Why warm-up? Starting with high learning rate causes unstable gradients and divergence. Gradual warm-up stabilizes training.

data augmentation

To prevent overfitting and improve generalization:

  1. Random scaling and translation: up to 20% of original image size
  2. Random HSV adjustment:
    • Exposure and saturation: up to 1.5× factor
    • Helps with lighting variations
  3. Dropout: Rate = 0.5 after first FC layer

annotation format conversion

Original annotation: [x_min, y_min, x_max, y_max, class_id]

YOLO format conversion:

# Calculate center and size
center_x = (x_min + x_max) / 2
center_y = (y_min + y_max) / 2
width = x_max - x_min
height = y_max - y_min

# Normalize to 0-1 range
center_x_norm = center_x / image_width
center_y_norm = center_y / image_height
width_norm = width / image_width
height_norm = height / image_height

# Find which grid cell this object belongs to
grid_x = int(center_x_norm * 7)
grid_y = int(center_y_norm * 7)

# Calculate offsets relative to grid cell
x_offset = (center_x_norm * 7) - grid_x  # 0 to 1
y_offset = (center_y_norm * 7) - grid_y  # 0 to 1

# Create target: [x_offset, y_offset, width_norm, height_norm, 1.0, class_one_hot]

going back to loss function

why regression loss?

Object detection in YOLO is formulated as a regression problem: we’re predicting continuous values (box coordinates, confidence scores, class probabilities).

The loss function uses Mean Squared Error (MSE) for all components, but with careful weighting to handle class imbalance and scale differences.

three components of loss

  1. Localization Loss: Ensures predicted box coordinates match ground truth
  2. Confidence Loss: Trains confidence scores to reflect object presence and fit quality
  3. Classification Loss: Ensures correct class probabilities for cells with objects

Total Loss = λ_coord × L_box + L_conf + L_class

the class imbalance problem

Problem: In most images, most grid cells don’t contain objects.

  • ~2-5 cells with objects
  • ~44-47 cells without objects

Impact:

  • Gradients from “no object” cells dominate training
  • Overwhelms gradients from “object” cells
  • Network learns to predict “no object” everywhere

Solution: Weight the losses differently

  • λ_coord = 5: Increase weight of box coordinate loss
  • λ_noobj = 0.5: Decrease weight of “no object” confidence loss

Intuition: Why λ_coord = 5 and λ_noobj = 0.5?

The Numbers Game:

  • Typical image: 2-3 cells with objects, 46-47 cells without objects
  • Ratio: ~20:1 (no-object : object)

Without weighting (all λ = 1):

Total confidence loss = 2 × (object conf loss) + 47 × (no-object conf loss)
                     ≈ 2 × 0.5 + 47 × 0.1 = 5.7
                     ≈ 82% from no-object cells!

The “no object” gradient overwhelms “object” gradient → network learns to predict “no object” everywhere.

With weighting (λ_coord=5, λ_noobj=0.5):

Box coord loss:     5 × (2 cells × coord error) = 5 × 1.0 = 5.0
Object conf loss:   1 × (2 cells × conf error) = 1 × 0.5 = 0.5
No-obj conf loss:   0.5 × (47 cells × conf error) = 0.5 × 4.7 = 2.35
Total ≈ 7.85, where localization now matters most.

Why these specific values?

  • λ_coord = 5: Emphasizes getting boxes right (5× more important than classification)
  • λ_noobj = 0.5: De-emphasizes empty cells (2× less important than object cells)
  • Balance found empirically through ablation studies in the paper

λ_coord=5 says “getting box positions right matters a lot.” λ_noobj=0.5 says “empty cells, chill out, you’re not that important.”

Wrong Lambda Values

Mistake #1: Using λ_coord = 1 (same as others)

  • Result: Network struggles with localization. Boxes are detected but poorly positioned.
  • Symptom: Low IOU scores even when objects are detected

Mistake #2: Using λ_noobj = 1 (same as object cells)

  • Result: No-object gradient dominates, network predicts low confidence for everything
  • Symptom: Very few detections, even on objects that are clearly visible

Mistake #3: Using λ_coord = 10 (too high)

  • Result: Network focuses only on boxes, ignores classification
  • Symptom: Good IOU but wrong class labels

How to tune: Start with paper values (5, 0.5). Only adjust if you have domain-specific reasons (e.g., small dataset → increase λ_coord to 7).

Loss Penalty Weighting Figure: Visualization of class imbalance problem - most grid cells (gray) contain no objects while only a few (colored) contain objects, showing why we need λ weighting factors

1. localization loss (box regression)

Goal: Make predicted box coordinates close to ground truth.

Formula:

L_box = λ_coord × Σ Σ 𝟙ᵢⱼᵒᵇʲ [(xᵢ - x̂ᵢ)² + (yᵢ - ŷᵢ)²]
        + λ_coord × Σ Σ 𝟙ᵢⱼᵒᵇʲ [(√wᵢ - √ŵᵢ)² + (√hᵢ - √ĥᵢ)²]

Where:

  • 𝟙ᵢⱼᵒᵇʲ = 1 if cell i contains an object and box j is responsible for it
  • (x, y, w, h) = ground truth box coordinates (YOLO format)
  • (x̂, ŷ, ŵ, ĥ) = predicted box coordinates
  • Responsible box = the box with highest IOU with ground truth

Why square root of w and h?

Problem: MSE treats errors equally regardless of box size.

  • 10-pixel error in 100×100 box = small mistake
  • 10-pixel error in 20×20 box = huge mistake

Solution: Predict √w and √h instead of w and h directly.

  • Small boxes: √w changes more for same absolute change
  • Large boxes: √w changes less for same absolute change
  • This makes errors more balanced across different box sizes

Example:

  • Box 1: w=100, √w=10, error of 10 → √110 - 10 = 0.49
  • Box 2: w=25, √w=5, error of 10 → √35 - 5 = 0.92 (penalized more)

Intuition: Why √w and √h?

The Problem: Absolute errors matter more for small objects.

Imagine two prediction errors:

  • Large box (200×200 pixels): Predicted width = 210, GT = 200 → error = 10 pixels
  • Small box (20×20 pixels): Predicted width = 30, GT = 20 → error = 10 pixels

Both have the same absolute error (10 pixels), but the small box error is much worse (50% error vs 5% error)!

Without √: MSE treats both equally → Loss = 10² = 100 for both.

With √: Square root scaling makes errors proportional:

Large box: (√210 - √200)² = (14.49 - 14.14)² = 0.12
Small box: (√30 - √20)² = (5.48 - 4.47)² = 1.02

Now the small box error is ~8× larger in the loss → Network learns precision for small objects!

Square root compresses large values more than small ones. Balances the loss across different object sizes.

Forgetting the Square Root

Mistake: Predicting w and h directly instead of √w and √h

Consequence: Network struggles to learn small objects accurately. Large objects dominate the loss, making the model biased toward bigger objects.

How to spot: If your model detects large objects well but completely misses small ones, check if you’re applying sqrt() to the width and height predictions!

In code: Always use torch.sqrt(w) in your target encoding and torch.square(w_pred) when converting back.

All Loss Components Figure: The three components of YOLO loss function - localization (box coordinates), confidence (objectness), and classification (class probabilities)

2. confidence loss

Goal: Train confidence scores to reflect:

  1. Whether an object is present
  2. How well the predicted box fits the object

Formula:

L_conf = Σ Σ 𝟙ᵢⱼᵒᵇʲ (Cᵢ - Ĉᵢ)²
         + λ_noobj × Σ Σ 𝟙ᵢⱼⁿᵒᵒᵇʲ (Cᵢ - Ĉᵢ)²

Where:

  • 𝟙ᵢⱼᵒᵇʲ = 1 if object exists and box j is responsible
  • 𝟙ᵢⱼⁿᵒᵒᵇʲ = 1 if no object exists in cell i
  • C = target confidence (1 if object, 0 if no object)
  • Ĉ = predicted confidence
  • λ_noobj = 0.5 = weight for “no object” cells

Two parts:

  1. Object cells: Push confidence toward 1.0
  2. No-object cells: Push confidence toward 0.0 (but weighted less)

Target confidence for object cells:

C = Pr(Object) × IOU_pred^truth = 1.0 × IOU

So if predicted box has IOU=0.9 with ground truth, target confidence = 0.9.

Loss Again Figure: Detailed breakdown of how each loss component is computed for a single grid cell prediction

3. classification loss

Goal: Predict correct class probabilities for cells containing objects.

Formula:

L_class = Σ 𝟙ᵢᵒᵇʲ Σ (pᵢ(c) - p̂ᵢ(c))²

Where:

  • 𝟙ᵢᵒᵇʲ = 1 if cell i contains an object
  • pᵢ(c) = ground truth probability for class c (one-hot: 1 for correct class, 0 otherwise)
  • p̂ᵢ(c) = predicted probability for class c
  • Sum over all C classes (20 for Pascal VOC)

Important: This loss is only computed for cells that contain objects.

Example:

  • Cell contains a person (class 14)
  • Target: [0, 0, ..., 1, ..., 0] (1 at position 14)
  • Predicted: [0.1, 0.05, ..., 0.8, ..., 0.02]
  • Loss: Σ(target - pred)² = (0-0.1)² + ... + (1-0.8)² + ... = 0.05

complete loss function

Classification Loss Figure: Classification loss computation showing how class probabilities are compared with one-hot encoded ground truth labels

Complete Loss Function Figure: The complete YOLO loss function combining all three components with their respective weighting factors

Loss Equation Final Figure: Mathematical formulation of the complete loss function with indicator functions and summations

All Losses Equation Figure: All three loss components shown together with their mathematical expressions and weighting parameters

L = λ_coord × L_box + L_conf + L_class

Where:
- λ_coord = 5.0
- λ_noobj = 0.5

Recap of Loss

training process with loss

Forward pass:

  1. Image → Network → Predictions (7×7×30)
  2. For each grid cell, compare predictions to targets
  3. Compute three loss components
  4. Sum weighted losses

Backward pass:

  1. Compute gradients via backpropagation
  2. Update weights using SGD with momentum

Key insight: The indicator functions 𝟙ᵢⱼᵒᵇʲ ensure we only compute loss for relevant predictions:

  • Box coordinates: only for responsible boxes
  • Confidence: for all boxes (but weighted differently)
  • Classification: only for cells with objects

evaluation: mAP

Model’s trained. Is it any good? The standard metric for object detection is mAP (mean Average Precision).

First, precision and recall. Say an image has 10 dogs. The model outputs 8 boxes claiming to be dogs.

  • Of those 8 boxes, 6 are actually correct (they overlap a real dog with an IOU > 0.5). These are True Positives(TP)
  • The other 2 boxes were mistakes (e.g., it drew a box around a bush). These are False Positives (FP)
  • The model completely missed 4 of the real dogs. These are False Negatives (FN).

understanding detection metrics

Three types of predictions:

  1. True Positive (TP): Predicted box matches ground truth (IOU ≥ threshold)
  2. False Positive (FP): Predicted box doesn’t match any ground truth
  3. False Negative (FN): Ground truth object that wasn’t detected

precision and recall

With these numbers, we can ask two questions:

  1. Precision: “Of the answers you gave, how many were right?”
    • Precision = TP / (TP + FP) = 6 / (6 + 2) = 75%
    • This measures how trustworthy the model’s predictions are. A high precision model doesn’t make many silly mistakes.
  2. Recall: “Of all the things you should have found, how many did you find?”
    • Recall = TP / (TP + FN) = 6 / (6 + 4) = 60%
    • This measures how comprehensive the model is. A high recall model doesn’t miss much.

The trade-off

Precision and recall fight each other.

  • Only predict when you’re 100% sure? High precision, low recall. You miss stuff.
  • Predict aggressively on everything? High recall, low precision. Lots of garbage boxes.

so, what is mAP?

We want both. Average Precision (AP) captures this for a single class - it’s the area under the precision-recall curve. Higher AP means the model stays precise even as it tries to find more objects.

Precision-Recall Curve: Plot precision vs recall at different confidence thresholds.

Steps to calculate AP:

  1. Sort all predictions by confidence
  2. For each confidence threshold, compute precision and recall
  3. Plot the curve
  4. Compute area under curve

mAP = average AP across all classes. 20 classes? Compute AP for each, average them.

mAP = (AP_class1 + AP_class2 + ... + AP_class20) / 20

One number to rule them all.

mAP@0.5: Use IOU threshold of 0.5 to determine TP mAP@0.75: Use IOU threshold of 0.75 (stricter) mAP@[0.5:0.95]: Average mAP across IOU thresholds from 0.5 to 0.95 in steps of 0.05

mAP Implementation

def compute_map(pred_boxes, gt_boxes, iou_threshold=0.5):
    """
    Calculate mean Average Precision for object detection.

    Args:
        pred_boxes: List of predictions per image
                   [{class: [[x1,y1,x2,y2,score], ...], ...}, ...]
        gt_boxes: List of ground truths per image
                 [{class: [[x1,y1,x2,y2], ...], ...}, ...]
        iou_threshold: IOU threshold to consider a detection as TP

    Returns:
        mean_ap: Mean average precision across all classes
        all_aps: Dictionary of AP for each class
    """
    # Get all class labels from ground truth
    gt_labels = set()
    for im_gt in gt_boxes:
        for cls_key in im_gt.keys():
            gt_labels.add(cls_key)
    gt_labels = sorted(list(gt_labels))

    all_aps = {}
    aps = []

    # Compute AP for each class
    for label in gt_labels:
        # Collect all predictions for this class across all images
        cls_preds = []
        for im_idx, im_pred in enumerate(pred_boxes):
            if label in im_pred:
                for box in im_pred[label]:
                    cls_preds.append((im_idx, box))

        # Sort predictions by confidence (descending)
        cls_preds = sorted(cls_preds, key=lambda k: -k[1][-1])

        # Track which GT boxes have been matched
        gt_matched = [[False for _ in im_gts.get(label, [])]
                      for im_gts in gt_boxes]

        # Count total GT boxes for this class (for recall)
        num_gts = sum([len(im_gts.get(label, [])) for im_gts in gt_boxes])

        # Track TP and FP for each prediction
        tp = [0] * len(cls_preds)
        fp = [0] * len(cls_preds)

        # For each prediction, determine if TP or FP
        for pred_idx, (im_idx, pred_box) in enumerate(cls_preds):
            # Get GT boxes for this image and class
            im_gts = gt_boxes[im_idx].get(label, [])

            # Find best matching GT box
            max_iou_found = -1
            max_iou_gt_idx = -1
            for gt_box_idx, gt_box in enumerate(im_gts):
                gt_box_iou = iou(pred_box[:-1], gt_box)
                if gt_box_iou > max_iou_found:
                    max_iou_found = gt_box_iou
                    max_iou_gt_idx = gt_box_idx

            # TP only if IOU >= threshold AND GT box hasn't been matched yet
            if max_iou_found < iou_threshold or \
               (max_iou_gt_idx >= 0 and gt_matched[im_idx][max_iou_gt_idx]):
                fp[pred_idx] = 1
            else:
                tp[pred_idx] = 1
                gt_matched[im_idx][max_iou_gt_idx] = True

        # Compute cumulative TP and FP
        tp = np.cumsum(tp)
        fp = np.cumsum(fp)

        # Compute precision and recall at each threshold
        recalls = tp / num_gts
        precisions = tp / (tp + fp)

        # Smooth precision curve (ensures precision is monotonic)
        # Add boundary values
        recalls = np.concatenate(([0.0], recalls, [1.0]))
        precisions = np.concatenate(([0.0], precisions, [0.0]))

        # Make precision monotonically decreasing
        for i in range(precisions.size - 1, 0, -1):
            precisions[i - 1] = np.maximum(precisions[i - 1], precisions[i])

        # Compute AP as area under curve
        # Get points where recall changes
        i = np.where(recalls[1:] != recalls[:-1])[0]
        ap = np.sum((recalls[i + 1] - recalls[i]) * precisions[i + 1])

        if num_gts > 0:
            aps.append(ap)
            all_aps[label] = ap
        else:
            all_aps[label] = np.nan

    mean_ap = sum(aps) / len(aps) if len(aps) > 0 else 0.0
    return mean_ap, all_aps

performance and results

speed: real-time detection

YOLOv1 Performance:

  • 45 fps on Nvidia Titan X GPU
  • 155 fps for Fast-YOLO (9 conv layers instead of 24)
  • This is 9× faster than previous state-of-the-art (Faster R-CNN at ~5 fps)

Why so fast?

  1. Single forward pass: No region proposals, no batch processing
  2. Unified architecture: One network handles everything
  3. Efficient design: Optimized from the ground up for speed

accuracy comparison

YOLOv1 on Pascal VOC 2007:

  • mAP: 63.4%
  • Fast-YOLO: 52.7% mAP

Comparison with other methods:

  • Faster R-CNN: ~70% mAP, but only 7 fps
  • DPM (Deformable Part Model): ~30% mAP, slower than YOLO
  • R-CNN: Higher accuracy, but 47 seconds per image

The trade-off:

  • YOLO sacrifices ~7% mAP for 9× speed improvement
  • Enables entirely new applications (real-time video, embedded systems)

key advantages over two-stage detectors

  1. Global reasoning: YOLO sees the entire image during prediction
    • Understands context (less likely to classify background as object)
    • Fewer background false positives than Fast R-CNN
  2. Generalization: Better performance on new domains
    • Learns generalizable features
    • Better transfer to artwork, sketches, etc.
  3. Simplicity: Single network, end-to-end training
    • No separate proposal generation
    • Unified loss function
    • Easier to optimize
  4. Real-time performance: 45+ fps enables:
    • Live video analysis
    • Robotics and autonomous vehicles
    • Interactive applications

YOLOv1 Results Figure: Performance comparison of YOLOv1 with other detection methods on Pascal VOC 2007 dataset

YOLOv1 Detailed Results Figure: Detailed accuracy and speed metrics showing YOLO’s superiority in real-time performance while maintaining competitive accuracy


limitations of YOLOv1

1. limited objects per grid cell

Constraint: Each grid cell can only predict one object (despite predicting 2 boxes).

  • Maximum detections: 7×7 = 49 objects per image

Problem: Fails in crowded scenarios

  • Flocks of birds
  • Dense crowds
  • Multiple small objects in one cell

Why? Each cell predicts only one set of class probabilities, shared by both boxes.

Example failure case:

  • Grid cell contains centers of both a dog and bicycle
  • Cell can only classify as one class
  • One object will be missed

2. struggles with small objects

Problem: Small objects (especially in groups) are hard to detect.

Why?

  • 7×7 grid is coarse (each cell = 64×64 pixels)
  • Small objects may not strongly activate any single cell
  • Multiple small objects may share a grid cell

Example: A flock of small birds in the sky.

3. unusual aspect ratios

Problem: Objects with uncommon shapes or aspect ratios are hard to detect.

Why?

  • Network learns aspect ratio priors from training data
  • If training data contains mostly cars that are 2:1 (width:height)
  • Long, thin cars or unusually shaped objects will be missed

Example: Very long limousine, tall narrow doorway

4. localization errors

Problem: Box coordinates are less accurate than two-stage detectors.

Why?

  • Coarse feature map (7×7) for final prediction
  • No iterative refinement (unlike Faster R-CNN)
  • Especially affects small objects

Impact: Lower IOU values, affects mAP at higher IOU thresholds

5. box size sensitivity

Problem: Despite using √w and √h, small vs large box errors still not perfectly balanced.

Why? MSE treats all errors equally in loss calculation.

6. class imbalance despite weighting

Problem: Even with λ_noobj = 0.5, “no object” cells still dominate.

Why?

  • Typically 47 cells without objects vs 2-3 with objects
  • Still creates ~20× more “no object” loss terms

Impact:

  • Confidence scores may be systematically lower
  • May miss objects in difficult scenarios

failure cases: when YOLO struggles

failure case #1: small grouped objects

┌─────────────────────────────────────┐
│  Image: Flock of Birds in Sky       │
│                                      │
│    • • •  • •  •  •  • • •          │  ← 15 birds
│   •  •  •  • •  •  •  •             │
│  •  •  •  • •   •  •  •   •         │
│                                      │
│  YOLOv1 Detection:                   │
│  ✓ 3 birds detected                  │
│  ✗ 12 birds missed!                  │
└─────────────────────────────────────┘

Why it fails:

  • 7×7 grid = only 49 possible detections maximum
  • Small birds (5×5 pixels each) → multiple birds share grid cells
  • Each cell can only detect ONE object (one set of class probs)
  • Small objects have weak features at 7×7 resolution

YOLOv3 Fix: Multi-scale detection (13×13, 26×26, 52×52 grids) → 8,000+ possible detections

YOLOv10 Fix: NMS-free detection + multi-scale → perfect for dense small objects

failure case #2: unusual aspect ratios

┌────────────────────────────────────────┐
│  Image: Stretch Limousine              │
│  ▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂            │  ← Very long car (10:1 ratio)
│  ▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀            │
│                                        │
│  YOLOv1 Prediction:                    │
│  ┌─────┐        ┌─────┐                │  ← Two separate "car" boxes
│  │ Car │ ...gap...│ Car │              │     (misses it's one object)
│  └─────┘        └─────┘                │
│  ✗ Poor fit, low confidence            │
└────────────────────────────────────────┘

Why it fails:

  • Training data: mostly ~2:1 aspect ratio cars
  • Limousine: 10:1 aspect ratio → out of distribution
  • Network hasn’t learned to predict such extreme shapes
  • Two grid cells detect parts separately

YOLOv2 Fix: Anchor boxes with multiple aspect ratios (1:1, 1:2, 2:1, 1:3, 3:1)

YOLOv10 Fix: Anchor-free design adapts to any shape automatically

failure case #3: crowded scenes

┌─────────────────────────────────────────┐
│  Image: Dense Crowd (Concert)           │
│  P P P P P P P  <- 100+ people            │
│  P P P P P P P                           │
│  P P P P P P P                           │
│                                         │
│  YOLOv1 Detection:                      │
│  ✓ 35 people detected                   │
│  ✗ 65 people missed!                    │
│  (7×7 = 49 max, but overlap reduces)    │
└─────────────────────────────────────────┘

Why it fails:

  • Hard limit: 7×7 = 49 grid cells
  • Each cell: only 1 object
  • Crowded scene: many people share cells
  • Result: Systematic undercounting

YOLOv3 Fix: Finer grids (52×52 = 2,704 cells) + 3 predictions per cell

YOLOv10 Fix: No cell limit + NMS-free → handles 100s of objects

failure case #4: heavy occlusion

┌──────────────────────────────────┐
│  Image: Overlapping Objects      │
│                                  │
│      ┌─────┐                     │
│      │  A  │ ← Person A          │
│    ┌─┼─────┤                     │
│    │B│  A  │ ← Person B (behind) │
│    └─┴─────┘                     │
│      Center of B is inside A     │
│                                  │
│  YOLOv1 Detection:               │
│  ✓ Person A detected             │
│  ✗ Person B missed!              │
│  (Both centers in same cell)     │
└──────────────────────────────────┘

Why it fails:

  • Both object centers fall in same grid cell
  • YOLO chooses higher confidence box
  • Occluded object has lower features → lower confidence → ignored

YOLOv2 Fix: Multiple anchors per cell → can detect both

YOLOv10 Fix: Dual heads (o2m + o2o) → better occlusion handling

failure rate summary

The following table shows illustrative estimates to convey qualitative trends - these are not actual benchmark results.

Scenario YOLOv1 Performance (illustrative) YOLOv10 Performance (illustrative)
Small grouped objects (birds) ~20% recall ~90% recall
Unusual aspect ratios (limousine) ~40% IOU ~85% IOU
Crowded scenes (>50 objects) ~35% recall ~95% recall
Heavy occlusion (overlapping) ~50% recall ~85% recall

YOLOv1 works great on sparse, medium-sized objects with normal aspect ratios. Anything else? Use v3+ or v10.


evolution: from v1 to v10

the YOLO family tree

Each version after v1 tried to fix its problems without killing the speed.

key improvements across versions

YOLOv2 (YOLO9000) - 2017:

  • Anchor boxes: Instead of predicting boxes directly, predict offsets from predefined anchors
  • Batch normalization: Added to all conv layers, improved convergence
  • High-resolution classifier: Pretrain at 448×448 (not 224×224)
  • Multi-scale training: Train on different input sizes
  • Better backbone: Darknet-19 (19 conv layers + 5 maxpool)
  • Performance: 67% mAP at 40 fps, 78.6% mAP at slower speeds

YOLOv3 - 2018:

  • Multi-scale predictions: Detect at 3 different scales (better for small objects)
  • Better backbone: Darknet-53 (53 conv layers, residual connections)
  • Binary classification: Use logistic regression instead of softmax (allows multi-label)
  • Performance: 57.9% mAP@0.5, comparable to RetinaNet but 3-4× faster

YOLOv4 - 2020:

  • Bag of freebies: Techniques that improve accuracy without increasing inference cost
    • Data augmentation (Mosaic, MixUp)
    • Label smoothing
    • DropBlock regularization
  • Bag of specials: Techniques with small inference cost increase
    • Mish activation
    • CSPNet backbone
    • SPP (Spatial Pyramid Pooling)
    • PANet neck
  • Performance: 43.5% mAP@0.5:0.95, state-of-the-art at the time

YOLOv5 - 2020 (Ultralytics):

  • PyTorch implementation: Easier to use and customize
  • Auto-learning anchors: Automatically cluster anchors from training data
  • Model family: Nano, Small, Medium, Large, XLarge variants
  • Better augmentations: Albumentations integration
  • Export options: ONNX, TensorRT, CoreML, etc.

YOLOv6, v7, v8 (2022-2023): Mostly incremental improvements. v6 (Meituan) focused on industrial deployment with a hardware-aware design. v7 introduced re-parameterized convolutions and a more efficient training pipeline. v8 (Ultralytics) went anchor-free and decoupled the classification and regression heads. All three pushed accuracy up a few points on COCO, but the next real architectural jump is v10.

YOLOv10 - 2024:

  • NMS-free detection: Eliminates need for NMS post-processing
    • Consistent matching strategy during training
    • Dual label assignment
    • Significantly faster inference
  • Efficiency optimizations:
    • Compact inverted block design
    • Partial self-attention
    • Spatial-channel decoupled downsampling
  • Model variants: N/S/M/B/L/X for different speed/accuracy trade-offs
  • Performance:
    • YOLOv10-X: ~54% mAP@0.5:0.95 on COCO
    • YOLOv10-N: Real-time on edge devices

key shifts across versions

  1. v1→v2: Direct prediction → Anchor-based
  2. v2→v3: Single-scale → Multi-scale detection
  3. v3→v4: Manual design → Automated architecture search and bag-of-tricks
  4. v4→v5: Darknet (C) → PyTorch (Python), better tooling
  5. v8→v10: Anchor-based → Anchor-free, NMS → NMS-free

what remained constant

Ten versions in, the core hasn’t changed:

  • Single-stage detection: One network, one forward pass
  • Real-time performance: Speed is a primary goal
  • End-to-end training: Unified loss function
  • Practical focus: Easy to deploy and use

YOLOv10: what’s actually new

Two things that actually matter in v10: NMS-free detection and a rethought architecture.

1. NMS-free training with dual heads

Traditional YOLO models generate multiple overlapping boxes per object, then rely on Non-Maximum Suppression (NMS) as post-processing. NMS adds latency and isn’t differentiable, so you can’t optimize it end-to-end. YOLOv10 gets rid of it entirely.

The trick is two detection heads during training:

  • One-to-Many (o2m) Head: Matches each ground truth object with multiple predictions. Rich supervision signal - explores diverse locations for the same object.
  • One-to-One (o2o) Head: Matches each ground truth object with exactly one prediction. Learns to output one clean box per object.

Both heads share the same matching metric so their supervision stays consistent. During training, the o2o head benefits from the rich signal of the o2m head (which explores multiple locations) while learning to produce clean, single predictions. At inference, the o2m head is discarded entirely - zero extra cost. You just use the o2o head, pick top-K class scores, filter by confidence, and you’re done. No NMS needed.

The result: about 40% faster inference than YOLOv8 since there’s no post-processing overhead, and cleaner predictions with one strong box per object.

2. architectural efficiency

YOLOv10 also rethinks the model architecture for better speed-accuracy tradeoffs:

  • Lightweight classification head: The classification head in YOLOv8 was 2.5x heavier than the regression head despite regression being more important for accuracy. YOLOv10 replaces it with depthwise separable convolutions - much cheaper.
  • Decoupled downsampling: Instead of one expensive 3x3 stride-2 conv that handles both spatial reduction and channel expansion, YOLOv10 splits it into a pointwise conv (channels) and a depthwise conv (spatial). Significantly cheaper.
  • Rank-guided block design: Not all network stages are equally important. YOLOv10 computes the intrinsic rank of each stage’s convolutions (via SVD), then replaces redundant stages with lightweight Compact Inverted Blocks while keeping high-rank stages strong with large kernel convolutions and Partial Self-Attention.

comparison: YOLOv1 vs YOLOv10

Metric YOLOv1 (2016) YOLOv10 (2024)
mAP50 63.4% (VOC) 90.6% (ODN dataset)
mAP50-95 Not reported 76%
Speed (fps) 45 fps ~63 fps (40% faster than v8)
Grid Size 7×7 (fixed) Multi-scale, adaptive
Boxes per Cell 2 Dynamic (o2o/o2m)
Max Objects 49 (7×7 grid limit) Unlimited
Post-processing NMS required NMS-free
Small Objects Struggles Excellent (multi-scale detection)
Architecture 24 conv + 2 FC Efficient CSPNet + lightweight heads
Training Single-head Dual-head (o2m + o2o)

v1 proved single-stage detection works. v10 makes it truly end-to-end - no post-processing, even faster. Same core idea, just way more refined.


implementation: PyTorch code examples

Full PyTorch implementation of YOLOv1 below.

1. model architecture

The YOLO network uses a ResNet34 backbone (pretrained on ImageNet) followed by detection layers:

import torch
import torch.nn as nn
import torchvision

class YOLOV1(nn.Module):
    """
    YOLOv1 Implementation using ResNet34 backbone

    Args:
        img_size: Input image size (448x448)
        num_classes: Number of classes (20 for Pascal VOC)
        model_config: Configuration dict with S, B, and architectural params

    Output:
        Tensor of shape (batch_size, S, S, 5*B + C)
    """
    def __init__(self, img_size, num_classes, model_config):
        super(YOLOV1, self).__init__()
        self.img_size = img_size
        self.S = model_config['S']  # Grid size (7x7)
        self.B = model_config['B']  # Boxes per cell (2)
        self.C = num_classes  # Number of classes (20)

        # Load pretrained ResNet34 backbone (trained on ImageNet 224x224)
        backbone = torchvision.models.resnet34(
            weights=torchvision.models.ResNet34_Weights.IMAGENET1K_V1
        )

        # Feature extraction layers (before FC layers)
        self.features = nn.Sequential(
            backbone.conv1,    # 7x7 conv, stride 2
            backbone.bn1,
            backbone.relu,
            backbone.maxpool,
            backbone.layer1,   # ResNet blocks
            backbone.layer2,
            backbone.layer3,
            backbone.layer4,   # Output: 512 channels
        )

        # Detection head: 3 conv layers for feature refinement
        yolo_conv_channels = model_config['yolo_conv_channels']  # 1024
        leaky_relu_slope = model_config['leaky_relu_slope']  # 0.1

        self.conv_layers = nn.Sequential(
            nn.Conv2d(512, yolo_conv_channels, 3, padding=1, bias=False),
            nn.BatchNorm2d(yolo_conv_channels),
            nn.LeakyReLU(leaky_relu_slope),

            nn.Conv2d(yolo_conv_channels, yolo_conv_channels, 3,
                     stride=2, padding=1, bias=False),
            nn.BatchNorm2d(yolo_conv_channels),
            nn.LeakyReLU(leaky_relu_slope),

            nn.Conv2d(yolo_conv_channels, yolo_conv_channels, 3,
                     padding=1, bias=False),
            nn.BatchNorm2d(yolo_conv_channels),
            nn.LeakyReLU(leaky_relu_slope)
        )

        # Final 1x1 conv to get S*S*(5B+C) output
        self.final_conv = nn.Conv2d(yolo_conv_channels, 5 * self.B + self.C, 1)

    def forward(self, x):
        # x: (batch, 3, 448, 448)
        out = self.features(x)      # (batch, 512, 14, 14)
        out = self.conv_layers(out)  # (batch, 1024, 7, 7)
        out = self.final_conv(out)   # (batch, 30, 7, 7)

        # Permute to (batch, S, S, 5B+C)
        out = out.permute(0, 2, 3, 1)  # (batch, 7, 7, 30)
        return out

2. loss function

The complete YOLOv1 loss with all three components:

import torch
import torch.nn as nn

def iou(box1, box2):
    """
    Calculate Intersection over Union between two sets of boxes.

    Args:
        box1, box2: Tensors of shape (..., 4) in format (x1, y1, x2, y2)

    Returns:
        iou: Tensor of shape (...) with IOU values
    """
    # Calculate areas
    area1 = (box1[..., 2] - box1[..., 0]) * (box1[..., 3] - box1[..., 1])
    area2 = (box2[..., 2] - box2[..., 0]) * (box2[..., 3] - box2[..., 1])

    # Find intersection rectangle
    x_topleft = torch.max(box1[..., 0], box2[..., 0])
    y_topleft = torch.max(box1[..., 1], box2[..., 1])
    x_bottomright = torch.min(box1[..., 2], box2[..., 2])
    y_bottomright = torch.min(box1[..., 3], box2[..., 3])

    # Calculate intersection area (clamp to handle non-overlapping boxes)
    intersection = (x_bottomright - x_topleft).clamp(min=0) * \
                   (y_bottomright - y_topleft).clamp(min=0)

    # Calculate union and IOU
    union = area1.clamp(min=0) + area2.clamp(min=0) - intersection
    iou = intersection / (union + 1e-6)  # Add epsilon to avoid division by zero
    return iou

class YOLOLoss(nn.Module):
    """
    YOLOv1 Loss Function: Localization + Confidence + Classification

    Loss = λ_coord × L_box + L_conf + L_class
    """
    def __init__(self, S=7, B=2, C=20):
        super(YOLOLoss, self).__init__()
        self.S = S
        self.B = B
        self.C = C
        self.lambda_coord = 5.0    # Increase weight for box coordinates
        self.lambda_noobj = 0.5    # Decrease weight for no-object cells

    def forward(self, preds, targets):
        """
        Args:
            preds: (batch, S, S, 5*B + C) - model predictions
            targets: (batch, S, S, 5*B + C) - ground truth targets

        Returns:
            loss: Scalar tensor
        """
        batch_size = preds.size(0)

        # Create coordinate shift grids for converting relative → absolute coords
        xshift = torch.arange(0, self.S, device=preds.device) / float(self.S)
        yshift = torch.arange(0, self.S, device=preds.device) / float(self.S)
        yshift, xshift = torch.meshgrid(yshift, xshift, indexing='ij')
        xshift = xshift.reshape((1, self.S, self.S, 1)).repeat(1, 1, 1, self.B)
        yshift = yshift.reshape((1, self.S, self.S, 1)).repeat(1, 1, 1, self.B)

        # Reshape predictions and targets: (batch, S, S, B, 5)
        pred_boxes = preds[..., :5*self.B].reshape(batch_size, self.S, self.S, self.B, 5)
        target_boxes = targets[..., :5*self.B].reshape(batch_size, self.S, self.S, self.B, 5)

        # Convert from (x_offset, y_offset, √w, √h) to (x1, y1, x2, y2) format
        def boxes_to_x1y1x2y2(boxes, xshift, yshift):
            x_center = boxes[..., 0] / self.S + xshift
            y_center = boxes[..., 1] / self.S + yshift
            width = torch.square(boxes[..., 2])   # w = (√w)²
            height = torch.square(boxes[..., 3])  # h = (√h)²

            x1 = (x_center - 0.5 * width).unsqueeze(-1)
            y1 = (y_center - 0.5 * height).unsqueeze(-1)
            x2 = (x_center + 0.5 * width).unsqueeze(-1)
            y2 = (y_center + 0.5 * height).unsqueeze(-1)
            return torch.cat([x1, y1, x2, y2], dim=-1)

        pred_boxes_xyxy = boxes_to_x1y1x2y2(pred_boxes, xshift, yshift)
        target_boxes_xyxy = boxes_to_x1y1x2y2(target_boxes, xshift, yshift)

        # Calculate IOU between predicted and target boxes
        iou_pred_target = iou(pred_boxes_xyxy, target_boxes_xyxy)

        # Find responsible box (highest IOU with ground truth)
        max_iou, max_iou_idx = iou_pred_target.max(dim=-1, keepdim=True)
        max_iou_idx = max_iou_idx.repeat(1, 1, 1, self.B)

        # Create mask for responsible boxes
        box_indices = torch.arange(self.B, device=preds.device).reshape(1, 1, 1, self.B)
        box_indices = box_indices.expand_as(max_iou_idx)
        is_responsible_box = (max_iou_idx == box_indices).long()

        # Object indicator: 1 if cell contains object, 0 otherwise
        obj_indicator = targets[..., 4:5]  # Shape: (batch, S, S, 1)

        # Indicator for responsible boxes in cells with objects
        responsible_obj_indicator = is_responsible_box * obj_indicator

        # --- 1. LOCALIZATION LOSS (only for responsible boxes) ---
        x_loss = (pred_boxes[..., 0] - target_boxes[..., 0]) ** 2
        y_loss = (pred_boxes[..., 1] - target_boxes[..., 1]) ** 2
        w_loss = (pred_boxes[..., 2] - target_boxes[..., 2]) ** 2  # √w loss
        h_loss = (pred_boxes[..., 3] - target_boxes[..., 3]) ** 2  # √h loss

        localization_loss = self.lambda_coord * (
            (responsible_obj_indicator * x_loss).sum() +
            (responsible_obj_indicator * y_loss).sum() +
            (responsible_obj_indicator * w_loss).sum() +
            (responsible_obj_indicator * h_loss).sum()
        )

        # --- 2. CONFIDENCE LOSS (for object cells) ---
        # Target confidence = IOU for responsible boxes
        obj_conf_loss = ((pred_boxes[..., 4] - max_iou) ** 2 *
                        responsible_obj_indicator).sum()

        # --- 3. CONFIDENCE LOSS (for no-object cells) ---
        no_obj_indicator = 1 - responsible_obj_indicator
        noobj_conf_loss = self.lambda_noobj * (
            (pred_boxes[..., 4] ** 2 * no_obj_indicator).sum()
        )

        # --- 4. CLASSIFICATION LOSS (only for cells with objects) ---
        class_preds = preds[..., 5*self.B:]
        class_targets = targets[..., 5*self.B:]
        class_loss = ((class_preds - class_targets) ** 2 * obj_indicator).sum()

        # Total loss
        total_loss = (localization_loss + obj_conf_loss +
                     noobj_conf_loss + class_loss) / batch_size

        return total_loss

3. dataset & target encoding

Pascal VOC annotations to YOLO format:

import torch
import albumentations as alb
import cv2
from torch.utils.data import Dataset

class VOCDataset(Dataset):
    """Pascal VOC Dataset with YOLO target encoding"""

    def __init__(self, split='train', img_size=448, S=7, B=2, C=20):
        self.split = split
        self.img_size = img_size
        self.S = S  # Grid size
        self.B = B  # Boxes per cell
        self.C = C  # Number of classes

        # Data augmentation for training
        self.transforms = {
            'train': alb.Compose([
                alb.HorizontalFlip(p=0.5),
                alb.Affine(scale=(0.8, 1.2),
                          translate_percent=(-0.2, 0.2)),
                alb.ColorJitter(brightness=(0.8, 1.2),
                               saturation=(0.8, 1.2)),
                alb.Resize(self.img_size, self.img_size)
            ], bbox_params=alb.BboxParams(format='pascal_voc',
                                          label_fields=['labels'])),
            'test': alb.Compose([
                alb.Resize(self.img_size, self.img_size)
            ], bbox_params=alb.BboxParams(format='pascal_voc',
                                          label_fields=['labels']))
        }

        # Load Pascal VOC annotations...
        # (XML parsing code omitted for brevity)

    def __getitem__(self, index):
        # Load image and annotations
        img_info = self.images_info[index]
        img = cv2.imread(img_info['filename'])
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        bboxes = [det['bbox'] for det in img_info['detections']]  # (x1,y1,x2,y2)
        labels = [det['label'] for det in img_info['detections']]

        # Apply augmentations
        transformed = self.transforms[self.split](
            image=img, bboxes=bboxes, labels=labels
        )
        img = transformed['image']
        bboxes = torch.tensor(transformed['bboxes'])
        labels = torch.tensor(transformed['labels'])

        # Normalize image to [0, 1] and apply ImageNet normalization
        img_tensor = torch.from_numpy(img / 255.0).permute(2, 0, 1).float()

        # --- Create YOLO target tensor ---
        target_dim = 5 * self.B + self.C
        yolo_target = torch.zeros(self.S, self.S, target_dim)

        h, w = img.shape[:2]
        cell_size = h // self.S  # Pixels per grid cell

        if len(bboxes) > 0:
            # Convert (x1, y1, x2, y2) → (x_center, y_center, width, height)
            box_width = bboxes[:, 2] - bboxes[:, 0]
            box_height = bboxes[:, 3] - bboxes[:, 1]
            box_center_x = bboxes[:, 0] + 0.5 * box_width
            box_center_y = bboxes[:, 1] + 0.5 * box_height

            # Determine which grid cell each object belongs to
            grid_i = torch.floor(box_center_x / cell_size).long()
            grid_j = torch.floor(box_center_y / cell_size).long()

            # Compute relative coordinates within grid cell (0 to 1)
            box_x_offset = (box_center_x - grid_i * cell_size) / cell_size
            box_y_offset = (box_center_y - grid_j * cell_size) / cell_size

            # Normalize width and height to image size
            box_w_norm = box_width / w
            box_h_norm = box_height / h

            # Fill YOLO target tensor
            for idx in range(len(bboxes)):
                # Assign same target to all B boxes (model picks responsible one)
                for b in range(self.B):
                    s = 5 * b
                    yolo_target[grid_j[idx], grid_i[idx], s] = box_x_offset[idx]
                    yolo_target[grid_j[idx], grid_i[idx], s+1] = box_y_offset[idx]
                    yolo_target[grid_j[idx], grid_i[idx], s+2] = box_w_norm[idx].sqrt()
                    yolo_target[grid_j[idx], grid_i[idx], s+3] = box_h_norm[idx].sqrt()
                    yolo_target[grid_j[idx], grid_i[idx], s+4] = 1.0  # Confidence

                # One-hot encode class
                label = int(labels[idx])
                yolo_target[grid_j[idx], grid_i[idx], 5*self.B + label] = 1.0

        return img_tensor, yolo_target

4. training loop

Training loop:

import torch
from torch.optim import SGD
from torch.optim.lr_scheduler import MultiStepLR
from torch.utils.data import DataLoader

# Initialize model, loss, and dataset
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = YOLOV1(img_size=448, num_classes=20, model_config={
    'S': 7, 'B': 2, 'yolo_conv_channels': 1024, 'leaky_relu_slope': 0.1
}).to(device)

criterion = YOLOLoss(S=7, B=2, C=20)

train_dataset = VOCDataset(split='train', img_size=448)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

# Optimizer: SGD with momentum (as per paper)
optimizer = SGD(model.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4)

# Learning rate schedule: reduce at epochs [75, 105]
# Paper uses warm-up from 1e-3 to 1e-2 for first epochs, then steps down
scheduler = MultiStepLR(optimizer, milestones=[75, 105], gamma=0.1)

# Training loop
num_epochs = 135  # As per paper
model.train()

for epoch in range(num_epochs):
    epoch_loss = 0.0

    for images, targets in train_loader:
        images = images.to(device)
        targets = targets.to(device)

        # Forward pass
        predictions = model(images)
        loss = criterion(predictions, targets)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    scheduler.step()
    print(f'Epoch {epoch+1}/{num_epochs}, Loss: {epoch_loss/len(train_loader):.4f}')

5. inference with NMS

Converting raw predictions to actual boxes:

def convert_predictions_to_boxes(predictions, S=7, B=2, C=20,
                                conf_threshold=0.2, nms_threshold=0.5):
    """
    Convert YOLO predictions to bounding boxes with NMS.

    Args:
        predictions: (S, S, 5*B + C) tensor
        conf_threshold: Minimum confidence to keep box
        nms_threshold: IOU threshold for NMS

    Returns:
        boxes: (N, 4) tensor in (x1, y1, x2, y2) format
        scores: (N,) confidence scores
        labels: (N,) class labels
    """
    predictions = predictions.reshape(S, S, 5*B + C)

    # Get class predictions (same for all boxes in a cell)
    class_probs, class_labels = predictions[..., 5*B:].max(dim=-1)

    # Create coordinate shift grid
    shifts_x = torch.arange(S, device=predictions.device) / float(S)
    shifts_y = torch.arange(S, device=predictions.device) / float(S)
    shifts_y, shifts_x = torch.meshgrid(shifts_y, shifts_x, indexing='ij')

    all_boxes = []
    all_scores = []
    all_labels = []

    # Process each of B boxes per cell
    for b in range(B):
        # Extract box parameters
        x_offset = predictions[..., b*5 + 0]
        y_offset = predictions[..., b*5 + 1]
        w = predictions[..., b*5 + 2]
        h = predictions[..., b*5 + 3]
        conf = predictions[..., b*5 + 4]

        # Convert to absolute coordinates
        x_center = (x_offset / S + shifts_x)
        y_center = (y_offset / S + shifts_y)
        width = torch.square(w)   # w = √w_pred²
        height = torch.square(h)

        # Convert to (x1, y1, x2, y2) format
        x1 = (x_center - 0.5 * width).reshape(-1, 1)
        y1 = (y_center - 0.5 * height).reshape(-1, 1)
        x2 = (x_center + 0.5 * width).reshape(-1, 1)
        y2 = (y_center + 0.5 * height).reshape(-1, 1)
        boxes = torch.cat([x1, y1, x2, y2], dim=-1)

        # Compute class-specific confidence scores
        scores = conf.reshape(-1) * class_probs.reshape(-1)
        labels = class_labels.reshape(-1)

        all_boxes.append(boxes)
        all_scores.append(scores)
        all_labels.append(labels)

    # Concatenate all boxes
    boxes = torch.cat(all_boxes, dim=0)
    scores = torch.cat(all_scores, dim=0)
    labels = torch.cat(all_labels, dim=0)

    # Confidence thresholding
    keep = scores > conf_threshold
    boxes = boxes[keep]
    scores = scores[keep]
    labels = labels[keep]

    # Apply NMS per class
    keep_mask = torch.zeros_like(scores, dtype=torch.bool)
    for class_id in torch.unique(labels):
        class_indices = labels == class_id
        class_boxes = boxes[class_indices]
        class_scores = scores[class_indices]

        # NMS (using torchvision)
        keep_indices = torch.ops.torchvision.nms(
            class_boxes, class_scores, nms_threshold
        )

        # Mark these boxes as kept
        class_keep_indices = torch.where(class_indices)[0][keep_indices]
        keep_mask[class_keep_indices] = True

    final_boxes = boxes[keep_mask]
    final_scores = scores[keep_mask]
    final_labels = labels[keep_mask]

    return final_boxes, final_scores, final_labels

# Example usage
model.eval()
with torch.no_grad():
    img_tensor = ... # Load and preprocess image
    predictions = model(img_tensor.unsqueeze(0))[0]  # Remove batch dim

    boxes, scores, labels = convert_predictions_to_boxes(
        predictions, conf_threshold=0.2, nms_threshold=0.5
    )

    # boxes: (N, 4) in normalized 0-1 coordinates
    # Multiply by image dimensions to get pixel coordinates

key implementation details

  1. √w and √h: The model predicts square root of width/height to balance loss across different box sizes
  2. Relative Coordinates: x,y offsets are relative to grid cell top-left corner
  3. Responsible Box Selection: During training, only the box with highest IOU with GT is penalized for coordinates
  4. Class Probabilities: Shared across all B boxes in a cell (limitation of v1)
  5. Lambda Weighting: λ_coord=5 to emphasize localization, λ_noobj=0.5 to de-emphasize empty cells

Follows the original paper. Gets ~63% mAP on Pascal VOC 2007 after 135 epochs.


optimization techniques

precision options

Precision Speed Accuracy Memory Best For
FP32 (default) 100% (baseline) 4 bytes/param Training, research
FP16 (half precision) 2-3× 99.5% 2 bytes/param Production (GPU)
INT8 (quantization) 97-99% 1 byte/param Edge devices

INT8 Quantization (PyTorch):

import torch.quantization

model = YOLOV1(...)
model.eval()

# Quantization-aware training (best accuracy)
model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm')
model_prepared = torch.quantization.prepare_qat(model)

# Train for a few epochs...
model_quantized = torch.quantization.convert(model_prepared)

# Result: 4× smaller, 2-4× faster on CPU

layer fusion

TensorRT automatically fuses operations:

Conv → BatchNorm → ReLU  →  Single fused kernel
                            (3× fewer memory transfers)

dynamic tensor memory

  • Reuses memory buffers across layers
  • Reduces GPU memory usage by 30-50%

3. hardware-specific deployment

NVIDIA GPU (RTX 3060/4090) Recommendation: TensorRT with FP16

# Export PyTorch → ONNX → TensorRT
python export.py --weights best.pt --format onnx
trtexec --onnx=yolov1.onnx --saveEngine=yolov1_fp16.trt --fp16

Expected Performance:

  • RTX 4090: 180 fps (FP16), 90 fps (FP32)
  • RTX 3060: 120 fps (FP16), 60 fps (FP32)
  • Edge Devices (Jetson, Raspberry Pi) Recommendation: TensorRT INT8 or TFLite

NVIDIA Jetson Orin:

# INT8 quantization for Jetson
trtexec --onnx=yolov1.onnx \
        --int8 \
        --workspace=2048 \
        --saveEngine=yolov1_int8.trt

Performance: 120 fps (INT8), 45 fps (FP16)


conclusion

This post covered the full YOLO pipeline: how the grid system works, why the loss function is shaped the way it is, what NMS and mAP actually measure, and how the architecture evolved from v1 through v10. The core insight from the original 2016 paper - treat detection as a single regression problem instead of a multi-stage pipeline - turned out to be the right bet. Ten versions later, the models are faster and more accurate, but that basic idea hasn’t changed.


references

  • Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You Only Look Once: Unified, Real-Time Object Detection. CVPR 2016.
  • Pascal VOC Dataset: http://host.robots.ox.ac.uk/pascal/VOC/
  • ImageNet: http://www.image-net.org/
  • PyTorch Implementation

extra

Training a phone detection model. What is happening at each epoch?

1. training

For each batch (32 images):

a. Data Loading

  • Load 32 images from train set
  • Apply augmentations (copy_paste, HSV, rotation, etc.)
  • Resize to 640x640
  • Normalize pixel values

b. Forward Pass

  • Pass batch through YOLOv11m network
  • Get predictions: bboxes (x,y,w,h), class probabilities, objectness scores
  • Network has ~20M parameters (backbone + neck + head)

c. Loss Calculation

  • Box loss (7.5x weight): IoU/GIoU loss for bbox coordinates
  • Class loss (1.5x weight): Binary cross-entropy for classification
  • DFL loss (1.5x weight): Distribution focal loss for bbox refinement
  • Total loss = weighted sum of above

What do these mean?

  • Box loss: How well the model draws a box around the phone
  • Class loss: How sure the model is that the thing is a phone
  • DFL loss: How precisely the model guesses the edges of the box

d. Backward Pass

  • Calculate gradients using total loss
  • Backpropagation and updating of gradients using Optimizer (I used AdamW with momentum)

e. Optimizer Step (AdamW)

  • Update weights using gradients
  • Apply learning rate (starts at 0.0005, decays to 0.01 × lr0)
  • Apply weight decay (0.0005) for regularization
  • Apply momentum (0.9)

2. validation (every epoch)

After all training batches:

a. Switch to eval mode

  • Disable dropout (full model instead of skipping some layers)
  • Use batch normalization in eval mode (already learned features instead of learning new ones)
  • No augmentations (no flips)

b. For each validation image

  • Forward pass (no gradient calculation)
  • Get predictions with conf threshold (default 0.001 for val)
  • Apply NMS (non-maximum suppression: if the model draws many boxes on the same phone, keep only the best one and throw away the rest) with IoU threshold 0.45

c. Metric Calculation

  • Match predictions to ground truth (IoU >= 0.5)
  • Calculate per-class metrics:
    • Precision = TP / (TP + FP)
    • Recall = TP / (TP + FN)
    • mAP@50 = mean AP at IoU=0.5
    • mAP@50-95 = mean AP averaged over IoU 0.5 to 0.95
  • Calculate overall metrics (averaged across all classes)

You can see how CNN works here

PyTorch implementation of YOLOV1 paper

Intro to YOLO


<
Previous Post
some python tricks
>
Next Post
ai engineering