Enter your image dimensions and patch size to instantly see the patch grid, sequence length, embedding tensor shape, and parameter count for a ViT patch-embedding layer.

How the patch embedding is computed

A Vision Transformer (ViT) cannot read raw pixels the way it reads tokens, so it first slices the image into a grid of non-overlapping square patches and projects each patch into a D-dimensional embedding vector. This calculator reproduces the exact arithmetic of nn.Conv2d(C, D, kernel_size=P, stride=P) — the standard PyTorch implementation of patch embedding — followed by a flatten and transpose.

The patch grid along each axis is the floor of the image size divided by the patch size:

n_h = floor(H / P)
n_w = floor(W / P)
num_patches = n_h × n_w
seq_len = num_patches + (1 if CLS else 0)
output shape = [B, seq_len, D]

Because the convolution uses stride = kernel = P with no padding, any pixels that don't fill a complete patch are silently dropped. The calculator flags this divisibility remainder so you catch the off-by-a-few-pixels bug that quietly shrinks your effective resolution.

The projection weight is a Conv2d kernel of shape [D, C, P, P], so the learnable parameter count of the patch-embed layer is D × C × P × P + D (the trailing + D is the bias). Each patch contributes C × P × P floating-point values that get linearly mapped to D — for the classic ViT-Base/16 at 224px that is 3×16×16 = 768 inputs per patch, 196 patches, and a 590,592-weight projection. Adding the CLS token and learnable position embeddings of shape [seq_len, D] gives the full input tensor the transformer encoder consumes.

Use the live readout to size GPU memory (activation footprint scales with B × seq_len × D), to verify attention is quadratic in seq_len, and to plan resolution changes: doubling image side quadruples patch count and thus quadruples attention FLOPs.

Vision Transformer Patch Embedding Tensor Shapes

How the patch embedding is computed

Related Tools