Mar 8, 2025 - 6 min

What is KV-Cache?

LLMs

If you have been reading articles on LLMs, you would have often come across an interesting term called KV-Cache and how developers are trying to do all sorts of trickery to speed up LLMs. And that is what we are going to do today– talk about KV-Cache to understand it in detail!

Before we talk about KV-Cache, we need to talk about LLMs and attention. Let’s get cracking…

Next Token Generation in LLMs

In LLMs, when we generate a new token, we take a look at the preceding tokens as part of self-attention. Self-attention is defined as

How do LLMs compute Q, K, and V?

The calculation for is–

Where is the context or the words generated so far. Model forms this matrix formed by stacking word/token embeddings on top of one another. Every time the LLM generates a word, it adds it to this stack and re-runs the computation to get the query (), key () and values () matrices. This happens for all the attention heads across all the layers. In GPT-3, for example, we have 96 layers and for each layer we have 96 attention heads. This means we must compute , , and matrices for every token in each head of every layer, making this operation computationally expensive.

And this is where developers noted that we could re-use the previous calculations to speed up the process. Let us take an example where an LLM has generated the word – and we are taking a look at the calculations.

image showing the matrix calculation of the Key transformation for "The" in LLMs while discussing KV-Cache

In the above image, we compute the Key () transformation for the word– by multiplying its embedding with the weight matrix . The multiplication projects the word embedding into the Key space, which the model uses for computing attention. Assume, we generate the word next, our Key computation now includes both and . now consists of embeddings for both words

image showing the matrix calculation of the Key transformation for "The" and "cat" words in LLMs while discussing KV-Cache

When we perform this step, we recompute the Key transformation for along with the newly generated word. The model repeats this process across all attention heads in every layer. Given that large-scale models have multiple layers and numerous attention heads per layer, these redundant recalculations quickly become computationally expensive. Enter our starKV Cache!

How does KV-Cache Work?

A visual representation of how KV-Cache works in a transformer model during inference. The image shows: A new query vector (labeled "Query new" in green), representing the latest token being processed. A key matrix (labeled "Key" in blue), where previously stored key vectors ("Key old") are stacked above the newly computed key vector ("Key new"). A value matrix (labeled "Value" in red), where previously stored value vectors ("Value old") are stacked above the newly added value vector ("Value new"). This diagram illustrates how new Key and Value vectors are appended to the existing KV-Cache, while the Query vector is recomputed dynamically for each new token during inference.

To predict the next word, LLMs rely on the preceding words. This process requires computing the , , matrices. We must recalculate these matrices each time the model generates a new word, making the process computationally expensive. So, to bypass all the recomputations, we simply store the , values for preceding values in something called KV-Cache.

To get an intuition, think of KV-Cache as a locker room storage that allows us to retrieve the , matrices/vectors for past tokens. The system organizes it as an array indexed by layer, head, and token position. Since each transformer layer has its own set of attention heads, we must store the and values for each layer and head. This significantly speeds up the inference of LLMs. Each new token appends its , matrices/vectors to the cache, while older tokens remain stored for retrieval.

The model caches KV data in the GPU’s VRAM to speed up inference.

It’s important to note a tradeoff. Large KV-Caches require more VRAM. If you want to use a model with a 32K+ token context size, you likely need to run it on a high-end GPU such as the NVIDIA A100 or H100, which have the necessary VRAM for KV-Cache. Consumer GPUs lack the required VRAM to store large KV-Caches. The KV-Cache size of a 32k context model can easily swell up to 48GB of VRAM (depending on the architecture of the LLM).

Why not store in the KV-Cache?’

Attention requires three matrices- , , and . We have talked about caching the , matrices to speed up computation and inferencing. So why not go ahead and store the matrices as well and drive down latency even more?

Well, because we don’t reuse any of the older values. Whenever the model generates a new token, it projects the token into the query space. Hence we cannot get out of doing that for the new token. Model computes Query () from the latest token. and remain unchanged for past tokens, allowing us to store them in KV-Cache. Storing gives no added advantage.

Conclusion and Takeaways

  • KV-Cache plays a crucial role in optimizing the efficiency of Large Language Models (LLMs) by eliminating redundant computations during token generation. By storing previously computed Key (( K )) and Value (( V )) matrices, models can significantly reduce inference latency, making real-time applications of LLMs more feasible.

  • However, this optimization comes with trade-offs—larger context windows require substantial VRAM to store KV-Cache. This is why models with 32K+ token contexts demand high-end GPUs like A100 or H100.

  • Model stores the ( K ) and ( V ) values in the KV-Cache. ( Q ) is not because the model must compute it fresh for every new token. Storing ( Q ) would provide no performance gains and would unnecessarily increase memory usage.

Further Reading

If you liked this post, you might enjoy reading my post on fine-tuning embedding models.