\section{Introduction}\label{sec:intro}
Transformer-based LLMs with long-context capabilities have significantly enhanced real-world applications, including long-document analysis and personalized conversational agents \citep{gpt4,llama3,gemma3}. However, increasing context lengths substantially raises both memory consumption for KV caching and computational costs associated with attention mechanisms \citep{vllm}. For example, caching 120K tokens in Qwen2.5-14B with FP16 precision requires approximately 33 GB memory, surpassing the model's 28 GB parameter storage at equivalent precision \citep{qwen}.

Recent approaches primarily target reducing KV cache memory size while preserving inference accuracy. These methods include merging the attention heads \citep{gqa}, compressing KV pairs into shorter sequences \citep{compressive}, and using sliding-window techniques to limit context windows \citep{mistral,streaming,duo}. Other studies exploit attention sparsity for dynamic KV eviction during decoding \citep{dynamicpruning,scissorhands,h2o} and prefill stages \citep{pyramid,snapkv}. Existing eviction methods typically employ \textit{query-aware} KV-pair importance scoring computed online during inference \citep{pyramid,snapkv,h2o}, selectively retaining KV pairs most relevant to immediate queries (\Crefsub{fig:intro}{a,b}). While effective in single-query scenarios, these methods exhibit significant performance degradation in multi-query settings, as the retained KV pairs predominantly overfit to initial queries \citep{scbench}. We elaborate on these limitations in \Cref{sec:prelim_existing}.

In this work, we introduce \textit{KVzip}, a novel \textit{query-agnostic} KV cache eviction algorithm. KVzip optimizes a reusable compressed KV cache for a given context, enabling efficient inference across diverse future queries (\Crefsub{fig:intro}{c}). Our approach particularly benefits scenarios where KV caches are prepared offline, such as personalized conversational agents retaining user instructions and chat histories \citep{characterai,personal}, or enterprise systems utilizing precomputed document KV caches for retrieval \citep{cag}. 

Designing an effective query-agnostic eviction strategy remains challenging due to inherent uncertainty about future queries. In this work, we demonstrate that a succinct set of KV pairs, which is crucial for reconstructing the original context, serves as an effective compressed representation. KVzip leverages the insight that a Transformer naturally functions as an encoder-decoder architecture by encoding context into KV pairs, analogous to traditional compression methods such as Zip \citep{zip}. Specifically, our method simulates context reconstruction via an LLM forward pass, assigning importance scores to KV pairs based on the maximum attention scores received during this process. This compression principle parallels self-supervised learning approaches that emphasize input reconstruction, demonstrating robust generalization across diverse downstream tasks \citep{bert,mae,gpt2}.

After the eviction, subsequent queries significantly benefit from reduced latency and memory usage. Specifically, KVzip achieves approximately $2\times$ latency reduction in FlashAttention \citep{flashattn} and $3$\minus$4\times$ reduction in KV cache size during decoding with negligible performance loss on diverse queries. KVzip supports both context-dependent eviction, which achieves higher compression ratios but incurs per-context compression overhead \citep{adakv}, and context-independent eviction, which incurs no overhead after deployment while achieving moderate compression ratios \citep{duo}.

\Cref{sec:exp} empirically demonstrates KVzip’s robustness and effectiveness on multiple benchmarks, including document question-answering, mathematical reasoning, retrieval, and code comprehension tasks, with contexts up to 170K tokens. Unlike existing eviction methods which show significant performance degradation even at 10\% KV eviction in multi-query settings  \citep{snapkv,h2o}, KVzip consistently maintains inference accuracy even when evicting up to 70\% of the KV cache. Experiments encompass 12 benchmark datasets, including SQuAD \citep{squad}, GSM8K \citep{gsm}, and SCBench \citep{scbench}, and involve various models such as LLaMA3.1 \citep{llama3}, Qwen2.5 \citep{qwen}, and Gemma3 \citep{gemma3}, ranging from 3B to 14B parameters. Furthermore, KVzip seamlessly integrates with existing optimizations such as KV cache quantization \citep{qserve} and structured head-level KV eviction \citep{duo}. Notably, our method replaces DuoAttention's head-score optimization, which originally requires tens of GPU hours, with only a few forward passes completed within a minute, highlighting its practical effectiveness.

\begin{figure}[t]
    \vspace{-0.3em}
    \centering
    \include{figure/intro}
    \vspace{-2.3em}
    \caption{
    \textbf{Overview of KV eviction strategies in multi-query scenarios.} An LLM processes input context (\textit{CTX}) and queries ($Q_i$) to generate answers ($A_i$). Existing approaches, such as SnapKV \citep{snapkv} and PyramidKV \citep{pyramid}, evict context KV pairs based on immediate query information. (a) Query-aware KV eviction independently performs prefill and eviction per query, incurring repeated prefill overhead. (b) Reusing a query-dependent compressed cache leads to performance degradation for subsequent queries (\Cref{fig:prelim}). (c) The proposed query-agnostic KV eviction framework compresses the KV cache only once during the initial prefill, enabling efficient reuse across diverse queries without repeated prefill or performance loss. Adapting existing methods to the query-agnostic framework still results in suboptimal performance due to a mismatch with their original designs (\Cref{sec:exp}).}
    % \vspace{-0.5em}
    \label{fig:intro}
\end{figure}
