Transformer KV Cache VRAM Estimator

Calculate the GPU memory a transformer's key-value cache consumes during autoregressive inference. Enter your model dimensions, context length, batch size, and precision to size deployments before you load a single weight.

Number of layers (L)

KV heads (n_kv)

Use grouped-query value (e.g. 8) for GQA models.

Head dimension (d_head)

Sequence length (tokens)

Batch size (B)

Cache dtype

Bytes per token

–

Total KV cache

–

Per-layer cache

–

Cached elements

–

How the KV cache size is computed

During generation a transformer stores one key vector and one value vector for every past token, at every layer, so it never recomputes attention over the prefix. That stored tensor is the KV cache, and it grows linearly with context length. The calculator uses the exact element count below:

bytes = 2 × L × n_kv × d_head × seq_len × batch × bytes_per_element

The leading 2 accounts for both the key and the value tensors. L is the layer count, n_kv the number of key/value heads, and d_head the per-head width — their product n_kv × d_head equals the KV hidden size, which is smaller than the model hidden size whenever grouped-query or multi-query attention is used. Multiply by sequence length and batch to get total cached vectors, then by the dtype width (2 bytes for FP16/BF16, 1 for FP8/INT8) to reach bytes.

This is where the calculator adds information most overview pages omit: KV cache scales with n_kv, not the full attention-head count. A 32-head model retrofitted to 8 KV heads via GQA cuts cache memory by 4× with no change to layers or head dim, which is precisely why long-context models adopt it. The output also reports bytes per token (the cache cost of every additional generated token) — multiply it by your target max context to forecast the worst-case footprint, and compare against free VRAM after weights and activations to find your true concurrency ceiling.

Quantizing the cache to FP8 halves the footprint versus FP16 and INT4 quarters it, at some accuracy cost on long sequences. Because the formula is linear, doubling batch size or doubling context exactly doubles the cache, making capacity planning a single multiplication once you know bytes per token.

Lever	Effect on KV cache
Halve KV heads (GQA)	Halves cache
Double context length	Doubles cache
FP16 → FP8	Halves cache
Double batch	Doubles cache

Transformer KV Cache VRAM Estimator

How the KV cache size is computed

Related Tools