What Does Conv2d Output with 224×224 Input, Kernel 7, Stride 2?
Conv2d with 224×224 input, kernel_size=7, stride=2, padding=3 outputs 112×112. This is ResNet's first conv layer. The stride=2 halves the spatial dimensions.
Formula Breakdown
output_size = floor((input - kernel + 2*padding) / stride) + 1
Plugging in the values:
output = floor((224 - 7 + 2*3) / 2) + 1
output = floor((224 - 7 + 6) / 2) + 1
output = floor(223 / 2) + 1
output = 111 + 1
output = 112
PyTorch Code
import torch
import torch.nn as nn
# This is exactly ResNet's conv1 layer
conv = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
x = torch.randn(1, 3, 224, 224)
output = conv(x)
print(output.shape) # torch.Size([1, 64, 112, 112])
Why ResNet Uses This
ResNet uses a large 7×7 kernel with stride=2 as its first layer to quickly reduce spatial dimensions from 224×224 to 112×112 while capturing large receptive field features. This is followed by a MaxPool2d(3, stride=2) which further reduces to 56×56. The aggressive early downsampling keeps the computational cost manageable for the deeper residual blocks that follow.