Residual Networks

What

Neural networks with skip connections (shortcut connections) that let input bypass one or more layers. Instead of learning a full mapping H(x), the layers learn a residual F(x) = H(x) - x, and the output is y = F(x) + x.

The problem ResNets solved

Before 2015, deeper networks performed worse than shallower ones — not because of overfitting, but because gradients degraded through many layers. A 56-layer network had higher training error than a 20-layer one. That shouldn’t happen if depth only adds capacity.

Why skip connections work

  • Gradient highway: gradients flow directly through the skip connection, bypassing layers that might squash them. This solves Vanishing and Exploding Gradients
  • Easy to learn identity: if a layer isn’t helpful, the weights can go to zero and the block passes input through unchanged. The network can’t get worse by being deeper
  • Residual is easier to learn: learning a small adjustment to the input is simpler than learning the full transformation from scratch
Input x ──┬──→ [Conv → BN → ReLU → Conv → BN] ──→ (+) ──→ ReLU ──→ Output
           │                                         ↑
           └─────────── skip connection ─────────────┘

ResNet architecture

VariantLayersParametersTop-1 accuracy (ImageNet)
ResNet-181811M~69%
ResNet-343421M~73%
ResNet-505025M~76%
ResNet-10110144M~77%
ResNet-15215260M~78%

Bottleneck blocks

ResNet-50+ uses bottleneck blocks to reduce computation:

1x1 conv (reduce channels: 256 → 64)
3x3 conv (process at reduced dimension)
1x1 conv (expand channels: 64 → 256)

The 1x1 convolutions squeeze and expand the channel dimension, so the expensive 3x3 convolution works on fewer channels. This makes deeper networks practical.

Impact

  • Enabled training networks with 100+ layers (up to 1000+ in experiments)
  • Won ImageNet 2015 by a large margin
  • Skip connections became standard in almost every modern architecture: DenseNet, U-Net, Transformers (residual connections around attention and FFN layers)
  • Transfer Learning with pretrained ResNets is one of the most common starting points for vision tasks