What Does Conv2d Output with 224×224 Input, Kernel 11, Stride 4?
Conv2d with 224×224 input, kernel_size=11, stride=4, padding=2 outputs 55×55. The formula gives: floor((224 + 2×2 - 11) / 4) + 1 = 55.
Formula Breakdown
The Conv2d output size formula is:
output_size = floor((input_size - kernel_size + 2 * padding) / stride) + 1
Plugging in the values for 224×224 input:
output = floor((224 - 11 + 2*2) / 4) + 1
output = floor((224 - 11 + 4) / 4) + 1
output = floor(217 / 4) + 1
output = floor(54.25) + 1
output = 55
So the spatial dimensions go from 224×224 to 55×55.
PyTorch Code Example
import torch
import torch.nn as nn
# Define the Conv2d layer
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=11, stride=4, padding=2)
# Create input tensor: (batch, channels, height, width)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(output.shape) # torch.Size([1, 64, 55, 55])
# Verify with formula
expected = (224 + 2 * 2 - 11) // 4 + 1
print(f"Expected output size: {expected}x{expected}") # 55x55
Architecture Context
This is the first convolution layer in AlexNet. The large 11×11 kernel with stride 4 aggressively reduces spatial dimensions from 224×224 to 55×55.
Parameter Count
A Conv2d(3, 64, 11) layer has:
parameters = in_channels * out_channels * kernel_size^2 + out_channels (bias)
parameters = 3 * 64 * 11 * 11 + 64
parameters = 23,296
This layer has 23,296 trainable parameters (23232 weights + 64 bias terms).
Practical Tips
- Memory usage: The output feature map for a single image is 64 × 55 × 55 = 193,600 float values (0.74 MB in float32).
- Batch dimension: Multiply memory by batch size. A batch of 32 uses 23.6 MB for this layer's output alone.
- Same padding rule: For any kernel, setting padding = (kernel_size - 1) / 2 with stride=1 preserves spatial dimensions.