\section{Experiment}\label{sec:exp}

\subsection{Setup}\label{sec:setup}
\paragraph{Eviction Structure.} 
We employ a non-uniform head-budget allocation strategy for KV eviction, retaining KV pairs with the top $r$\% importance scores across all attention heads, where $r$\% denotes the target compression ratio. KV pairs of the initial system prompt remain intact. To ensure fairness, we apply the same non-uniform allocation to baseline methods, given its demonstrated superiority over uniform allocation \citep{adakv}. This compressed KV cache, combined with FlashAttention, improves inference speed (\Cref{fig:complexity}). Additionally, we evaluate KVzip with context-independent eviction in \Cref{sec:exp_benchmark} and uniform-budget allocation in \Cref{appendix:uniform}.

\vspace{-0.2em}
\paragraph{Evaluation.} 
Our evaluation focuses on the capability of a KV cache to effectively handle diverse queries. Given the inherent limitations of query-aware frameworks discussed in \Cref{sec:prelim_existing}, we adopt the query-agnostic framework from \Crefsub{fig:intro}{c}. Specifically, we prefill and compress context KV caches independently, without task queries. Existing eviction methods also support this independent prefilling of context \citep{h2o,snapkv}, enabling evaluation under the query-agnostic framework. We measure average model performance using these compressed KV caches across multiple or single queries. Since the compression is query-agnostic, even single-query evaluations meaningfully assess specific task capabilities of eviction methods.
Unlike prior methods that evict KV pairs from replicated caches for grouped queries \citep{snapkv}, we evict directly from the initially stored cache before replication, thus reducing the actual storage required for the KV cache. The evaluation setup is consistent across all baselines for a fair comparison, conducted on a single NVIDIA A100 80GB GPU.

\vspace{-0.2em}
\paragraph{Baselines, Datasets, and Models.}
We benchmark against state-of-the-art KV cache eviction methods, including $\text{H}_2\text{O}$ \citep{h2o}, SnapKV \citep{snapkv}, and PyramidKV \citep{pyramid}. We further compare DuoAttention \citep{duo} using head-level eviction for context-independent compression. Evaluations span diverse datasets: SQuAD \citep{squad}, GSM8K \citep{gsm}, needle-in-a-haystack (NIAH) \citep{needle}, and nine tasks from SCBench \citep{scbench}. SCBench provides comprehensive multi-query evaluations, including tasks from RULER \citep{ruler} and $\infty$Bench \citep{inftybench}.
Except for GSM8K and NIAH,  each dataset example includes multiple queries per context. Context lengths range from 100 to 170K tokens, tokenized with the Qwen tokenizer \citep{qwen}, covering domains such as long-document QA, retrieval, mathematical reasoning, in-context learning, and code comprehension. \Cref{appendix:implement} provides implementation details and dataset specifics.

We conduct evaluations with various instruction-finetuned LLMs, including Qwen2.5-7B-1M, LLaMA3.1-8B, and Gemma3-12B \citep{qwen,llama3,gemma3}. These models utilize GQA with group sizes varying from 4 (LLaMA3.1-8B) to 7 (Qwen2.5-7B-1M). Gemma3 employs hybrid attention mechanisms, combining global and sliding window strategies \citep{gemma3}. All evaluations use Bfloat16 precision. We use greedy decoding with these models to generate responses. Furthermore, we integrate KVzip with the QServe quantization framework, adopting 8-bit weights, 8-bit activations, and 4-bit KV cache \citep{qserve}. 


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Benchmarking}\label{sec:exp_benchmark}

\begin{figure}[t]
    \centering
    \input{figure/g-qwen7b}
    \vspace{-1em}
    \caption{\textbf{Benchmark results} using Qwen2.5-7B-1M across varying KV cache budget ratios from 0.1 to 1.0. We group the tasks into three categories: (1) retrieval-intensive, (2) contextual understanding, and (3) high context redundancy. \Cref{appendix:individual} presents additional results on the SCBench multi-task datasets and RULER, where KVzip consistently outperforms the baselines.}
    \label{fig:benchmark}
\end{figure}


\paragraph{Task Generalization.}
\Cref{fig:benchmark} presents multi-query evaluation results for Qwen2.5-7B-1M across 12 benchmark datasets, grouped into three categories.
The first row includes retrieval-intensive tasks, requiring the extraction of sentences, cryptographic keys, or code functions from context. Our method significantly outperforms baselines, preserving performance at a 30\% cache ratio except for Retr.Prefix-Suffix, while baseline methods degrade notably at 90\% retention.
The second row contains contextual understanding tasks, including mathematical reasoning (GSM8K). Our method achieves near-lossless compression down to 20\minus30\%, consistently outperforming baselines.
In the last row, En.Summary requires high-level contextual information, whereas other tasks contain repetitive contextual information \citep{scbench}. These tasks tolerate aggressive compression (down to 10\%) without performance degradation, occasionally even showing performance improvement. We hypothesize that this improvement results from reduced attention distractions following KV eviction \citep{differential}. Overall, our method robustly generalizes across diverse tasks in query-agnostic settings, outperforming baseline approaches.


\begin{figure}[t]
    \centering
    \input{figure/g-architecture}
    \vspace{-1,5em}
    \caption{\textbf{Performance on various models} averaged over 12 benchmark datasets. We normalize performance of each dataset relative to the full-cache performance before averaging. \Cref{appendix:individual} provides detailed results per dataset, including results for LLaMA3.1-3B.}
    \label{fig:architecture}
\end{figure}

\paragraph{Model Scale and Architecture.}
\Cref{fig:architecture} shows performance across larger models (Qwen2.5-14B-1M), distinct model families (LLaMA3.1-8B), and hybrid attention architectures (Gemma3-12B). Gemma employs global and sliding-window attention layers in a 1:5 ratio \citep{gemma3}. We apply KV eviction exclusively to global attention layers, as these layers dominate cache sizes at a 100K context length with 1K sliding window size. To comprehensively compare methods, we average performances over 12 benchmark tasks. \Cref{fig:architecture} confirms KVzip’s generalizability and superior compression performance across various models compared to baseline methods.


\paragraph{KV Quantization.}
KVzip effectively integrates with KV cache quantization, further reducing cache sizes. \Cref{fig:architecture} evaluates KV eviction methods on a 4-bit KV quantized model (LLaMA3-8B-W8A8KV4) from QServe \citep{qserve}. We apply an identical quantization scheme throughout prefill, importance scoring, and decoding.
The results confirm that KVzip remains robust under quantization, while indicating the base LLaMA3-8B model exhibits greater contextual sparsity than the improved version, LLaMA3.1-8B.
Specifically, the 16-bit KV cache occupies \textbf{16.3GB} at a 124K input length. Integrating 4-bit quantization with our 70\% eviction ratio effectively reduces the cache size to \textbf{1.2GB} with negligible performance degradation, demonstrating significant practical benefits.


\paragraph{Context-Independent Eviction.}
KVzip also supports context-independent eviction strategies, requiring only a one-time importance scoring per model and incurring no compression overhead after deployment \citep{duo}. 
Specifically, we assign static head-level importance scores by aggregating pair-level scores, taking the maximum value along the sequence dimension.  
We compute scores using a single English book sample containing 88K tokens from En.QA in SCBench \citep{scbench} and apply DuoAttention's head-level KV eviction strategy \citep{duo}. \Cref{fig:visual_head} in Appendix visualizes the obtained head-score distribution, comparing with scores derived from other data sources.

\begin{wrapfigure}[22]{r}{0.4\textwidth} 
    \vspace{-1.8em}
    \centering
    \input{figure/g-duo}
    \vspace{-0.3em}
    \caption{Average relative performance across 12 benchmark datasets with head-level eviction. The lowest KV cache ratio is set to 0.4 due to DuoAttention's lower limit of 0.32.}
    \label{fig:pruning_structure}

    \vspace{1.2em}
    \centering
    \input{figure/g-format}
    \vspace{-0.3em}
    \caption{Performance across various inputs for KV importance scoring on SQuAD (LLaMA3.1-8B).}
    \label{fig:format}
\end{wrapfigure}

\Cref{fig:pruning_structure} compares KVzip against DuoAttention \citep{duo}, using publicly released official head-scores on LLaMA3-8B-Instruct-Gradient-1048K \citep{gradientai}. Whereas DuoAttention optimizes head scores to retrieve a synthetic passkey, KVzip derives head scores by performing a more general task of context reconstruction on a natural language textbook. Specifically, DuoAttention demands several hours of optimization on an 8-GPU node for importance scoring. In contrast, KVzip achieves superior performance using only a \textbf{few forward passes within one minute} for scoring. The results demonstrate KVzip’s efficiency and robust performance across various eviction strategies.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Analysis}\label{sec:exp_analysis}

\paragraph{Necessity of Context Reconstruction.}
KVzip employs an input that concatenates the repeat prompt and the context for importance scoring (\Cref{fig:method}). \Cref{fig:format} demonstrates the necessity of full context reconstruction by comparing scoring performance across various inputs: using the repeat prompt combined with either the first 10\% of context (\textit{First}), the last 10\% (\textit{Last}), or the repeat prompt alone (\textit{Prompt}). Results clearly indicate that reconstructing the full context (\textit{Recon}) is essential to prevent performance degradation by KV eviction.


\paragraph{Behavior Analysis Beyond Task Solving.}
Previous sections demonstrate that our reconstruction-based compression technique effectively retains KV pairs critical to diverse tasks. Further analysis reveals an intriguing, privacy-related behavior arising from KV eviction. \Cref{tab:behavior} compares generated responses for queries involving private context information before and after KV cache compression. Specifically, the LLaMA3.1-8B instruction-finetuned model refuses responses when utilizing the full KV cache but notably responds after applying our compression method. This behavior naturally emerges because KVzip prioritizes KV pairs necessary for context reconstruction and discards others, consistent with \citet{notoken}. Although practical implications may be limited—since cached contexts typically imply permission for utilization—this observation suggests intersections between KV eviction techniques and shallow-alignment concerns \citep{shallowalignment}, motivating further research exploration.

\begin{table}[t]
\caption{\textbf{Behavior analysis.} Generation results on a privacy-related example from DecodingTrust \citep{decodingtrust}, using LLaMA3.1-8B with full KV cache and a 40\% compressed cache via KVzip.}
\vspace{0.2em}
\label{tab:behavior}
\input{table/behavior}
\end{table}

