VGG-16 Layer Shapes — Complete Shape Trace
VGG-16: 224 → 224 → 112 → 112 → 56 → 56 → 56 → 28 → 28 → 28 → 14 → 14 → 14 → 7 → 7 → 7 → Flatten(25088) → 4096 → 4096 → 1000
Complete Shape Trace
Input: (batch, 3, 224, 224)
# Block 1 — 2 conv layers
Conv3x3(64) + ReLU (batch, 64, 224, 224)
Conv3x3(64) + ReLU (batch, 64, 224, 224)
MaxPool2d(2, 2) (batch, 64, 112, 112)
# Block 2 — 2 conv layers
Conv3x3(128) + ReLU (batch, 128, 112, 112)
Conv3x3(128) + ReLU (batch, 128, 112, 112)
MaxPool2d(2, 2) (batch, 128, 56, 56)
# Block 3 — 3 conv layers
Conv3x3(256) + ReLU (batch, 256, 56, 56)
Conv3x3(256) + ReLU (batch, 256, 56, 56)
Conv3x3(256) + ReLU (batch, 256, 56, 56)
MaxPool2d(2, 2) (batch, 256, 28, 28)
# Block 4 — 3 conv layers
Conv3x3(512) + ReLU (batch, 512, 28, 28)
Conv3x3(512) + ReLU (batch, 512, 28, 28)
Conv3x3(512) + ReLU (batch, 512, 28, 28)
MaxPool2d(2, 2) (batch, 512, 14, 14)
# Block 5 — 3 conv layers
Conv3x3(512) + ReLU (batch, 512, 14, 14)
Conv3x3(512) + ReLU (batch, 512, 14, 14)
Conv3x3(512) + ReLU (batch, 512, 14, 14)
MaxPool2d(2, 2) (batch, 512, 7, 7)
# Classifier
Flatten (batch, 25088) # 512 * 7 * 7
Linear(25088, 4096) + ReLU (batch, 4096)
Linear(4096, 4096) + ReLU (batch, 4096)
Linear(4096, 1000) (batch, 1000)
Key Observations
- All conv layers use 3×3 kernels with padding=1 (preserves spatial size)
- Spatial reduction comes only from MaxPool2d(2, 2): 224 → 112 → 56 → 28 → 14 → 7
- Channels double at each block: 64 → 128 → 256 → 512 → 512
- The Flatten produces 512 × 7 × 7 = 25,088 features
- The FC layers dominate parameters: the first FC alone has 25,088 × 4,096 = 102.8M parameters
- Total: ~138M parameters (most in FC layers)