\section{Related Work}\label{sec:related}
\paragraph{KV Cache Compression.}
Compressing KV caches of Transformer-based models is crucial for efficient inference \citep{transformer}. Sparse Transformer methods explicitly train models to utilize sparse or localized KV caches, reducing memory requirements during inference \citep{sparsetransformer,mistral,ltp}. Compressive Transformer approaches further compress caches by merging KV pairs during training \citep{gqa,ccm,compressive}.
\citet{dejavu} show that Transformer-based LLMs exhibit contextual sparsity during inference, motivating dynamic KV eviction methods such as H2O and FastGen that operate during decoding without additional training \citep{dynamicpruning,nacl,fastgen,kim2024infinipot,scissorhands,tova,yang2024pyramidinfer,h2o}. SnapKV, PyramidKV, and Finch specifically target KV eviction during long-context prefill \citep{pyramid,adakv,snapkv,corallo2024finch}, while DuoAttention profiles and selectively replaces attention heads with sliding-window attention prior to deployment \citep{streaming,duo}.
Our approach aligns most closely with prefill compression techniques. Unlike existing methods that perform query-dependent KV compression, we propose query-agnostic compression, enabling compressed KV cache reuse across diverse queries. Concurrently, \citet{corallo2025beyond} propose a query-agnostic KV compression method for the retrieval-augmented generation scenario. Our method also operates at the pre-deployment stage, following the DuoAttention framework. Recent studies have explored KV cache compression via quantization \citep{qserve,kivi}. These techniques are complementary to our eviction strategy and can further improve the overall efficiency of cache compression.


\paragraph{Efficient LLM Inference.} 
Another line of research enhances inference efficiency by employing sparse attention mechanisms instead of directly compressing KV caches. BigBird achieves efficiency by training models with sparse attention structures, reducing inference-time attention costs \citep{bigbird}. MInference leverages attention sparsity at inference without additional training \citep{minference}. Approaches including Quest reduce attention computations during decoding by leveraging KV cache offloading and retrieval techniques \citep{magicpig,infinigen,retrieval,quest}. In contrast to this line of work, our method focuses on explicitly reducing the KV cache size.



\section{Conclusion}
We introduce KVzip, a query-agnostic KV cache eviction algorithm that effectively optimizes reusable compressed KV caches through reconstructing the original context from KV pairs. Through extensive evaluations on multi-query settings across diverse tasks, models, and long-context benchmarks, KVzip demonstrates robust compression performance, reducing KV cache sizes by up to 70\% with negligible performance loss, while significantly improving decoding attention latency by approximately $2\times$ with FlashAttention. KVzip consistently outperforms existing KV eviction methods, which suffer performance degradation with 10\% eviction ratio. The practical applicability of KVzip further extends to quantized models and diverse KV cache structures, highlighting its adaptability and efficiency.


\newpage
\begin{figure}[!ht]
    \vspace{1.5em}
    \centering
    \input{figure/v-kv}
    \vspace{-0.5em}
    \caption{\textbf{Visualization of maximum attention scores.}
    Each heatmap visualizes the maximum attention scores received by KV pairs in $\kvc$ (\Cref{eq2}) for a SQuAD example, computed using LLaMA3.1-8B. \Cref{tab:task-inputs} in Appendix describes the text inputs for each task. Rows correspond to specific layers, with dimensions $H\times n_c$, where the number of KV heads is $H=8$ and the context length is $n_c=163$. (a) Importance scores from KVzip obtained using the repeat task. (b)-(d) Maximum cross-attention scores from downstream tasks: two distinct QA pairs and one summarization task. These illustrate varied attention patterns across downstream tasks, while the repeat task's attention pattern encompasses all these patterns (see also \Cref{fig:observation-heat}). (e) Maximum self-attention scores during the prefill stage exhibit denser attention patterns than cross-attention scores and do not overlap with downstream task patterns, indicating that prefill-based profiling such as $\text{H}_2\text{O}$ does not effectively reflect the KV cache utilization by downstream tasks.}
    \label{fig:visual_kv}
\end{figure}


\begin{ack}
This work was supported by Samsung Electronics Co., Ltd. (IO250418-12669-01), Mobile eXperience (MX) Business, Samsung Electronics Co., Ltd., Institute of Information \& Communications Technology Planning \& Evaluation (IITP) grant funded by the Korea government (MSIT) [No. RS2020-II200882, (SW STAR LAB) Development of deployable learning intelligence via self-sustainable and trustworthy machine learning], the Air Force Office of Scientific Research under award number FA2386-25-1-4013, and the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2024-00354036). Hyun Oh Song is the corresponding author.
\end{ack}

