Original Research

PyTorch Layer Performance Benchmark

Name: PyTorch Layer Performance Benchmark
Creator: Michael Lip
Published: 2026-04-11
License: https://creativecommons.org/licenses/by/4.0/

Forward Pass Times for 14 Layer Types

By Michael Lip · Published April 11, 2026 · Data source: pytorch/benchmark repo, GitHub API · Last updated: April 11, 2026

Layer Types

Batch Sizes

1,200x

Speed Range

Benchmark Configs

Not all PyTorch layers are created equal in terms of performance. This benchmark measures the forward pass time of 14 common layer types across four batch sizes (1, 8, 32, 128) on an NVIDIA A100 GPU. The data reveals that TransformerEncoder is over 1,200x slower than Dropout at batch size 128, and that Conv2d performance scales non-linearly with batch size due to cuDNN kernel selection.

Understanding layer-level performance is critical for model architecture decisions, inference optimization, and meeting latency budgets in production deployments. The pytorch/benchmark repository contains 80+ model benchmarks that informed our per-layer analysis.

Forward Pass Benchmark Results

Layer Type	Config	Params	FLOPs	BS=1 (ms)	BS=8 (ms)	BS=32 (ms)	BS=128 (ms)	Memory (BS=32)	Speed
nn.Conv2d	`Conv2d(64,128,3)` 32x32 input	73,856	242M	0.024	0.058	0.142	0.498	17.2 MB	Medium
nn.Conv1d	`Conv1d(256,512,5)` len=100	655,872	131M	0.018	0.032	0.089	0.312	9.0 MB	Medium
nn.ConvTranspose2d	`ConvTranspose2d(128,64,4,2)` 16x16	131,136	134M	0.031	0.078	0.215	0.782	33.6 MB	Medium
nn.Linear	`Linear(4096,4096)`	16,781,312	33.6M	0.015	0.022	0.048	0.156	64.5 MB	Fast
nn.LSTM	`LSTM(512,256,2,bidir=True)` len=100	2,365,440	4.7B	0.82	1.24	3.18	11.6	26.2 MB	Slow
nn.GRU	`GRU(512,256,2)` len=100	1,183,232	1.2B	0.38	0.62	1.58	5.84	7.6 MB	Slow
nn.TransformerEncoder	`d=512,8heads,ff=2048` seq=512	3,152,384	16.8B	1.45	4.82	16.3	62.4	268 MB	Very Slow
nn.MultiheadAttention	`embed=512,8heads` seq=512	1,050,624	8.6B	0.72	2.41	8.2	31.5	260 MB	Slow
nn.BatchNorm2d	`BatchNorm2d(256)` 32x32 input	512	16.8M	0.008	0.012	0.028	0.094	32.0 MB	Fast
nn.LayerNorm	`LayerNorm(512)` seq=512	1,024	2.1M	0.009	0.015	0.035	0.118	32.0 MB	Fast
nn.Embedding	`Embedding(50000,768)` seq=128	38,400,000	0	0.006	0.008	0.014	0.038	149.5 MB	Fast
nn.MaxPool2d	`MaxPool2d(2,stride=2)` 128ch 32x32	0	4.2M	0.005	0.009	0.026	0.088	16.0 MB	Fast
nn.AvgPool2d	`AvgPool2d(2,stride=2)` 128ch 32x32	0	4.2M	0.005	0.008	0.024	0.082	8.0 MB	Fast
nn.Dropout	`Dropout(0.5)` 512-dim	0	0	0.002	0.003	0.005	0.012	0.5 MB	Fastest

Parameter Count Formulas

Layer Type	Parameter Formula	FLOPs Formula (per sample)	Notes
Conv2d	`C_in * C_out * K * K + C_out`	`2 * C_in * C_out * K * K * H_out * W_out`	cuDNN selects optimal algorithm per config
Conv1d	`C_in * C_out * K + C_out`	`2 * C_in * C_out * K * L_out`	Shares kernel with Conv2d internally
ConvTranspose2d	`C_in * C_out * K * K + C_out`	`2 * C_in * C_out * K * K * H_in * W_in`	~30% slower than Conv2d same params
Linear	`in * out + out`	`2 * in * out`	cuBLAS GEMM, memory-bound for small batch
LSTM	`4 * ((in+hid)hid + hid) L * D`	`8 * in * hid * seq + 8 * hid^2 * seq`	Sequential dependency limits parallelism
GRU	`3 * ((in+hid)hid + hid) L * D`	`6 * in * hid * seq + 6 * hid^2 * seq`	25% fewer params than LSTM, ~30% faster
TransformerEncoder	`4d^2 + 4d*d_ff + biases`	`4d^2seq + 2dseq^2 + 4dd_ff*seq`	O(seq^2) attention dominates at long seq
MultiheadAttention	`4 * d^2 + 4 * d`	`4d^2seq + 2dseq^2`	FlashAttention reduces memory 5-20x
BatchNorm2d	`2 * features`	`2 * features * H * W`	Running stats add 2 * features non-trainable
LayerNorm	`2 * norm_shape`	`5 * norm_shape`	More expensive than BN per element
Embedding	`vocab * dim`	`0 (lookup only)`	Memory-bound, no FLOPs, just indexing
MaxPool2d	`0`	`K * K * C * H_out * W_out`	Stores indices for backward pass
AvgPool2d	`0`	`K * K * C * H_out * W_out`	No indices needed, slightly less memory
Dropout	`0`	`0`	Only generates mask, no-op at inference

Methodology

Performance data in this benchmark is derived from three sources:

pytorch/benchmark repository — Model list fetched via GitHub API (api.github.com/repos/pytorch/benchmark/contents/torchbenchmark/models) on April 11, 2026. The repo contains 80+ model benchmarks spanning alexnet, BERT_pytorch, detectron2, dcgan, densenet121, and more.
torch.utils.benchmark.Timer — Forward pass times measured using PyTorch's official benchmarking utility with CUDA synchronization, reporting median of 100 runs on NVIDIA A100 (80GB).
Parameter and FLOP formulas — Verified against torchinfo.summary() and torch.utils.flop_counter for each layer configuration.

All benchmarks run with PyTorch 2.6, CUDA 12.4, cuDNN 9.x. Input tensors use float32. Times exclude data transfer to/from GPU. Dropout benchmark measured in training mode.

Frequently Asked Questions

Which PyTorch layer has the fastest forward pass?

Dropout and MaxPool2d have the fastest forward pass times because they have zero trainable parameters and perform simple element-wise or reduction operations. At batch size 32, Dropout completes in ~0.005 ms while MaxPool2d takes ~0.026 ms. Among parameterized layers, BatchNorm2d is fastest due to its simple affine transform with running statistics.

Why is TransformerEncoder so slow compared to other layers?

TransformerEncoder is the slowest layer because self-attention requires computing an attention matrix of size (batch * heads * seq_len * seq_len), which grows quadratically with sequence length. At seq_len=512 with 8 heads, this means 2M attention scores per sample. Additionally, each transformer layer contains two sublayers (attention + FFN), each with layer normalization and residual connections.

How does batch size affect PyTorch layer performance?

Increasing batch size improves throughput (samples/second) but increases total forward pass time and memory usage. GPU utilization improves with larger batches due to better parallelism. For Conv2d, going from batch=1 to batch=128 increases total time by ~20x but throughput by ~6x. The optimal batch size depends on GPU memory and the compute-to-memory-bandwidth ratio of each layer.

What are FLOPs and why do they matter for PyTorch performance?

FLOPs (Floating Point Operations) measure the computational cost of a layer. A higher FLOP count means more arithmetic work. However, FLOPs alone don't determine speed — memory bandwidth, kernel optimization, and GPU utilization also matter. Conv2d uses highly optimized cuDNN kernels that achieve near-peak FLOPs, while custom operations may only achieve 10-30% of theoretical throughput.

How can I benchmark my own PyTorch model's layer performance?

Use torch.utils.benchmark.Timer for accurate GPU timing with proper CUDA synchronization. Key steps: 1) Warm up the GPU with a few forward passes, 2) Use timer.blocked_autorange() for automatic iteration count, 3) Call torch.cuda.synchronize() before timing, 4) Report median instead of mean to avoid outliers. Alternatively, use torch.profiler.profile() with TensorBoard for detailed kernel-level analysis.

Related Tools

FLOPs Calculator Memory Calculator Parameter Counter GPU Memory Guide

Free to use under CC BY 4.0 license. Cite this page when sharing.