Transformer KV Cache VRAM Estimator

Calculate the GPU memory a transformer's key-value cache consumes during autoregressive inference. Enter your model dimensions, context length, batch size, and precision to size deployments before you load a single weight.

Use grouped-query value (e.g. 8) for GQA models.
Bytes per token
Total KV cache
Per-layer cache
Cached elements

How the KV cache size is computed

During generation a transformer stores one key vector and one value vector for every past token, at every layer, so it never recomputes attention over the prefix. That stored tensor is the KV cache, and it grows linearly with context length. The calculator uses the exact element count below:

bytes = 2 × L × n_kv × d_head × seq_len × batch × bytes_per_element

The leading 2 accounts for both the key and the value tensors. L is the layer count, n_kv the number of key/value heads, and d_head the per-head width — their product n_kv × d_head equals the KV hidden size, which is smaller than the model hidden size whenever grouped-query or multi-query attention is used. Multiply by sequence length and batch to get total cached vectors, then by the dtype width (2 bytes for FP16/BF16, 1 for FP8/INT8) to reach bytes.

This is where the calculator adds information most overview pages omit: KV cache scales with n_kv, not the full attention-head count. A 32-head model retrofitted to 8 KV heads via GQA cuts cache memory by 4× with no change to layers or head dim, which is precisely why long-context models adopt it. The output also reports bytes per token (the cache cost of every additional generated token) — multiply it by your target max context to forecast the worst-case footprint, and compare against free VRAM after weights and activations to find your true concurrency ceiling.

Quantizing the cache to FP8 halves the footprint versus FP16 and INT4 quarters it, at some accuracy cost on long sequences. Because the formula is linear, doubling batch size or doubling context exactly doubles the cache, making capacity planning a single multiplication once you know bytes per token.

LeverEffect on KV cache
Halve KV heads (GQA)Halves cache
Double context lengthDoubles cache
FP16 → FP8Halves cache
Double batchDoubles cache