Skip to main content
Researcher points at a large monitor displaying a 32‑layer transformer diagram with colored KV‑cache blocks highlighted.

Analyzing KV Cache Memory Use in LLMs with 32‑Layer Model Example

2 min read

Imagine a transformer churning through a prompt and, for each attention head, stashing key and value tensors so the next token doesn’t have to recompute everything. That stash - the KV cache - expands with depth, heads, head size and batch size. In real setups we often bump into GPU memory ceilings once we push past a few hundred tokens.

If we know the exact footprint, we can decide whether to trim the context window, move some data to the CPU, or tweak the batch size. Below is a quick look at the four knobs that control that usage. Say we have a model with 32 layers, 32 heads per layer, a head dimension of 128, and a batch of one.

Most deployments use bigger batches, but this simple case gives a feel for the baseline demand. The math also hints why some services limit context windows to a few thousand tokens. Plug the numbers into the formula and you’ll see if your GPU can hold the cache without spilling into slower memory.

To understand how much memory KV cache consumes for a certain model, we need to understand the following 4 variables here:- Let's assume the following values for our use case: num_layers = 32, num_heads = 32, head_dim = 128, and batch_size = 1 (actual deployments usually have higher batches). KV cache per token = 2 * (num_layers) * (num_heads * head_dim) * precision_in_bytes * batch_size Why the 2? Because we store two matrices per token - K and V matrices KV cache per token = 2 * 32 * (32 * 128) * 2 * 1 = 524288 B = 0.5 MB We need 0.5 MB to store KV per token across all the layers and heads.

Related Topics: #KV cache #LLM #transformer #attention head #GPU #context window #batch size #32-layer #head dimension

Those numbers tell the story. A model with 32 layers, 32 heads and a 128-dimensional head can already blow a GPU’s memory when you run a batch of one and the KV cache grows past a few dozen tokens. Every new token adds a fresh set of key-value pairs for each layer, so the memory use climbs linearly with sequence length, something the original “Attention is all you need” paper warned about back in 2017.

In the wild we usually see bigger batches, which means the actual memory hit is probably higher than this tiny example suggests. I’m not sure how much slack typical hardware still has for very long generations without pulling in off-loading tricks or quantisation. The four knobs we highlighted, layer count, head count, head dimension and batch size, give a quick way to ballpark KV cache needs, but they leave out things like precision format or optimizer state that can shift the numbers.

Until we measure those extra pieces, the true ceiling for safe token lengths stays a bit fuzzy.

Further Reading

Common Questions Answered

How is the KV cache memory per token calculated for a 32‑layer model?

The KV cache per token is computed as 2 × num_layers × (num_heads × head_dim) × precision_in_bytes × batch_size. For the example with 32 layers, 32 heads, a head dimension of 128, batch size 1, and typical 16‑byte precision, this yields a substantial memory requirement per token.

Why does the KV cache size grow linearly with sequence length?

Each new token generated adds a fresh set of key and value matrices for every transformer layer, effectively duplicating the KV cache footprint for that token. Consequently, as the sequence length increases, the total KV cache memory expands in direct proportion, leading to linear growth.

What impact does a batch size greater than one have on KV cache memory usage?

Increasing the batch size multiplies the KV cache memory by the batch factor because the formula includes batch_size as a multiplier. Therefore, a batch of two would roughly double the memory consumption compared to a batch of one, quickly exhausting GPU memory limits.

Which factors besides the number of layers most significantly affect GPU memory consumption for KV caches?

The number of attention heads, the head dimension (head_dim), and the precision of stored tensors (precision_in_bytes) are the primary contributors alongside layer count. Larger head counts, higher dimensionality, or using higher‑precision formats like FP32 dramatically increase the per‑token KV cache footprint.