Calculate the GPU memory a transformer's key-value cache consumes during autoregressive inference. Enter your model dimensions, context length, batch size, and precision to size deployments before you load a single weight.
During generation a transformer stores one key vector and one value vector for every past token, at every layer, so it never recomputes attention over the prefix. That stored tensor is the KV cache, and it grows linearly with context length. The calculator uses the exact element count below:
The leading 2 accounts for both the key and the value tensors. L is the layer count, n_kv the number of key/value heads, and d_head the per-head width — their product n_kv × d_head equals the KV hidden size, which is smaller than the model hidden size whenever grouped-query or multi-query attention is used. Multiply by sequence length and batch to get total cached vectors, then by the dtype width (2 bytes for FP16/BF16, 1 for FP8/INT8) to reach bytes.
This is where the calculator adds information most overview pages omit: KV cache scales with n_kv, not the full attention-head count. A 32-head model retrofitted to 8 KV heads via GQA cuts cache memory by 4× with no change to layers or head dim, which is precisely why long-context models adopt it. The output also reports bytes per token (the cache cost of every additional generated token) — multiply it by your target max context to forecast the worst-case footprint, and compare against free VRAM after weights and activations to find your true concurrency ceiling.
Quantizing the cache to FP8 halves the footprint versus FP16 and INT4 quarters it, at some accuracy cost on long sequences. Because the formula is linear, doubling batch size or doubling context exactly doubles the cache, making capacity planning a single multiplication once you know bytes per token.
| Lever | Effect on KV cache |
|---|---|
| Halve KV heads (GQA) | Halves cache |
| Double context length | Doubles cache |
| FP16 → FP8 | Halves cache |
| Double batch | Doubles cache |