\section{Preliminary}\label{sec:prelim}

\subsection{Notation and Problem Formulation}
Consider the text domain $\mathcal{T}$ and an autoregressive Transformer-based LLM $\lm: \mathcal{T} \rightarrow \mathcal{T}$ that generates sequences via greedy decoding \citep{gpt,transformer}. The model comprises $L$ layers, utilizing Grouped-Query Attention (GQA) \citep{gqa} with $H$ KV heads, each attended by a group of $G$ query heads. During inference, $\lm$ caches hidden representations as KV pairs to enhance computational efficiency \citep{vllm}.

Given an input context $c \in \mathcal{T}$ tokenized into $n_c$ tokens, the prefill stage generates a cache containing $L \times H \times n_c$ KV pairs, denoted as $\kvc$ \citep{prefill}. Conditioned generation using the cache is denoted as $\lm(\cdot \mid \kvc)$. Our objective is to derive a compact pruned cache $\kvp \subseteq \kvc$ satisfying

\vspace{-2em}
\begin{align}\label{eq1}
\lm(q \mid \kvp) \approx \lm(q \mid \kvc),\ \forall q \in \mathcal{T}.
\end{align}


\subsection{Analysis of Existing Approaches}\label{sec:prelim_existing}
Existing KV eviction methods, such as SnapKV \citep{snapkv} and PyramidKV \citep{pyramid}, compress KV caches based on information given during prefill. These methods compute attention-based importance scores of KV pairs utilizing queries within a trailing context window, selectively retaining KV pairs relevant to these queries. While effective for single-query benchmarks such as needle-in-a-haystack \citep{needle} and LongBench \citep{longbench}, these methods require repetitive cache prefills for each new query, as shown in \Crefsub{fig:intro}{a}.\looseness=-1

\begin{wrapfigure}[12]{r}{0.42\textwidth} 
    \centering
    \vspace{-1.5em}
    \input{figure/g-snap}
    \vspace{-0.5em}
    \caption{
    Accuracy on SQuAD using LLaMA3.1-8B. We evaluate SnapKV with repetitive per-query \textit{prefill}, \textit{reuse} of the compressed cache from the first question of each data sample, and \textit{KVzip} with single prefill and query-agnostic compression.}
    \label{fig:prelim}
\end{wrapfigure}

Alternatively, reusing a previously compressed KV cache for subsequent queries can reduce the computation overhead, as depicted in \Crefsub{fig:intro}{b}. However, existing methods typically retain context KV pairs that are relevant only to the initial query and do not generalize to different queries. \Cref{fig:prelim} illustrates this issue using the SQuAD multi-QA dataset \citep{squad}. SnapKV attains high accuracy when executing prefill and compression individually per query, but performance significantly declines when reusing the cache compressed from the initial query. This shortcoming motivates our \textit{query-agnostic} KV eviction strategy, enabling effective reuse of a compressed cache across multiple queries.