\section{Method}\label{sec:method}
The primary objective of our algorithm is to assign an importance score to each KV pair, determining eviction priorities, following prior studies \citep{h2o}. 
Given a context length $n_c$, KVzip assigns importance scores $S\in\R^{L \times H \times n_c}$ to KV pairs in $\kvc$, subsequently evicting pairs with the lowest scores. Our method supports both non-uniform and uniform head budget allocations \citep{adakv,snapkv}. KVzip further accommodates a head-level eviction strategy by computing head-level scores using the maximum pair-level scores across the sequence dimension, $n_c$ \citep{duo}. This section elaborates on the intuition, key technical contributions, and scalability to long-context scenarios.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Intuition}
\begin{wrapfigure}[8]{r}{0.42\textwidth}
\centering
\vspace{-3.2em}
\input{figure/intuition}
\vspace{-1.3em}
\caption{Transformer LLM viewed as a context encoder-decoder. Each matrix cell indicates a KV pair. We use the prompt ``Repeat the previous context:''.}
\label{fig:intuition}
\end{wrapfigure}

To effectively answer arbitrary queries, the compressed cache $\kvp$ and $\lm$ should retain complete contextual information. Our intuition is that we can verify this completeness by explicitly prompting $\lm$ to reconstruct the previous context from $\kvp$ (\Cref{fig:intuition}). If $\kvp$ enables $\lm$ to accurately reconstruct the original context $c$ using the \textit{repeat prompt}, we can re-prefill the original cache $\kvc$ and conduct accurate inference.

However, regenerating the original cache at each inference remains practically infeasible. Encouragingly, our empirical studies indicate that the compressed cache demonstrates strong generalization capabilities even without reconstructing the original cache (\Cref{sec:exp_benchmark}), empirically achieving \Cref{eq1}. This finding resonates with principles from reconstruction-based self-supervised learning, which demonstrates strong generalization across diverse downstream tasks \citep{bert,mae,gpt2}.

\begin{figure}[b]
\centering
\vspace{-1.5em}
\input{figure/method}
\vspace{-0.3em}
\caption{\textbf{Method overview.} KVzip evicts KV pairs with the lowest importance scores, accommodating both KV pair-level and head-level eviction \citep{adakv,duo}. System prompts are omitted for clarity.}
\label{fig:method}
\end{figure}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{KV Importance Scoring}
KVzip quantifies KV pair importance based on their contribution in context reconstruction. Specifically, we simulate reconstruction through teacher-forced decoding \citep{teacherforcing}, parallelized via a single forward pass with an input sequence comprising a repeat prompt followed by the original context (\Cref{fig:method}). We define importance scores to be the maximum attention score each KV pair receives during this forward pass, leveraging the insight that KV pairs receiving minimal attention contribute little to Transformer computations \citep{h2o}.

Formally, given a context of length $n_c$, we construct an input sequence of length $n_\text{in} = n_\text{prompt} + n_c$ by concatenating the repeat prompt of length $n_\text{prompt}$ with the context. Forwarding this input through $\lm$ with $\kvc$ generates $d$-dimensional grouped-query features {\small$Q_{l,h}\in  \R^{G\times n_\text{in}\times d}$} and key features {\small$K_{l,h}\in \R^{(n_c + n_\text{in})\times d}$} for the $h$-th KV head in layer $l$ \citep{gqa}. Grouped-attention between these features produces an attention matrix {\small$A_{l,h}= \text{Softmax}(Q_{l,h}K_{l,h}^\intercal)\in \mathbb{R}_+^{G \times n_\text{in} \times (n_c + n_\text{in})}$}. Extracting entries corresponding to keys in $\kvc$ gives a sliced attention matrix {\small$\bar{A}_{l,h} \in \mathbb{R}_+^{G \times n_\text{in} \times n_c}$}. Finally, we compute importance scores {\small$S_{l,h} \in \mathbb{R}^{n_c}$} for the $h$-th KV head in layer $l$ by taking the maximum over grouped queries as
\vspace{-0.2em}
\begin{align}\label{eq2}
S_{l,h} = \max_{g =1,\ldots,G;\ i = 1,\ldots,n_\text{in}} \bar{A}_{l,h}[g,i].
\end{align}

\vspace{-0.8em} 
We refer to the aggregated scores $S$ across all KV heads as the \textit{maximum cross-attention scores}. \Cref{fig:visual_kv} provides a visualization of these scores.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Observation}\label{sec:observation}
The cross-attention pattern from the repeated context onto the prefilled context exhibits significant sparsity, indicating substantial opportunities for compressing $\kvc$. Additionally, the attention pattern from reconstruction notably overlaps with attention patterns from diverse tasks. Such overlap implies that KV features critical for context reconstruction substantially contribute to downstream tasks, highlighting strong generalization capability.

\begin{wrapfigure}[12]{r}{0.4\textwidth}
\vspace{-2em}
\centering
\input{figure/g-observation}
\vspace{-0.5em}
\caption{Histogram comparing max attention scores received by KV pairs in $\kvc$ during prefill versus reconstruction stages, measured on SQuAD with LLaMA3.1-8B.}
\label{fig:observation}
\end{wrapfigure}


\vspace{-0.3em}
\paragraph{Attention Sparsity in Reconstruction.}
Cross-attention patterns obtained during context reconstruction exhibit greater sparsity compared to self-attention patterns computed during the initial prefill of $\kvc$ (\Cref{fig:observation}). During prefill, the model densely interacts among tokens to encode comprehensive contextual information \citep{elmo}. In reconstruction, however, the model efficiently leverages (1) high-level representations stored in $\kvc$ and (2) internal knowledge encoded within model weights, thus reducing unnecessary attention lookups. This cross-attention sparsity effectively identifies and removes redundant KV pairs, outperforming prior methods such as $\text{H}_2\text{O}$ \citep{h2o} that rely on attention scores obtained during prefill (\Cref{sec:exp_benchmark}).


\begin{figure}[b]
    \vspace{-1.0em}
    \input{figure/g-scatter}
    \vspace{-1.6em}
    \caption{\textbf{Attention comparison across tasks.}
    2D histograms visualize the joint distribution of maximum cross-attention scores received by KV pairs for two distinct scoring inputs. Each input consists of a task query and the generated response (\Cref{tab:task-inputs}).
    Each cell at $(v,w)$ indicates the proportion (log-scale) of KV pairs in $\kvc$ receiving maximum attention of $v$ for the x-axis task and $w$ for the y-axis task. Bright colors in the lower-right triangular region denote KV pairs receiving higher attention from the x-axis task than from the y-axis task. We compute scores using LLaMA3.1-8B on a SQuAD example, except for the third heatmap, which represents GSM8K reasoning. QA-1 and QA-2 denote distinct QA pairs. \Cref{fig:visual_kv} visualizes the attention patterns for each task.\looseness=-1}
    \label{fig:observation-heat}
\end{figure}


\paragraph{Attention Overlap Across Tasks.}
\Cref{fig:observation-heat} compares max cross-attention scores across various tasks: repeat, question-answering (QA), summarization, and reasoning. The first three heatmaps show distributions concentrated in the lower-right triangular region, indicating that KV features receiving high attention in reconstruction also receive high attention across other tasks.
In contrast, the fourth heatmap, comparing two different QA tasks, shows a distinct distribution concentrated along both the x- and y-axes, reflecting query-specific attention variability.
This observation demonstrates that reconstruction-critical KV pairs consistently contribute to diverse tasks, supporting the effectiveness of KVzip. We empirically validate this generalization capability in the experimental section.


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\begin{figure}[t]
    \centering
    \input{figure/chunk}
    \vspace{-1.7em}
    \caption{
    \textbf{Chunked scoring} for the $i$-th chunk in $\kvc$. We compute attention scores by multiplying queries with subsampled keys of length $m+n_\text{in}$, followed by softmax normalization. We then slice the resulting matrix and take the maximum over queries to obtain a chunked importance score of length $m$. We set the grouped-query size to $G=1$ for clarity. This procedure repeats per chunk. For chunks with $i \geq 2$, we formulate the repeat prompt as: ``Repeat the previous context starting with $\langle$\texttt{last 8 tokens of preceding chunk}$\rangle$:''. \Cref{appendix:prompt} demonstrates that the design choice of a repeat prompt negligibly affects performance. Pseudo-code is provided in \Cref{appendix:implement}, \Cref{algo}.}
    \label{fig:chunk}
    \vspace{-0.2em}
\end{figure}


\subsection{Technical Challenge and Solution}\label{sec:method_complexity}
Our method concatenates a repeat prompt with context tokens, processing this input through $\lm$ to obtain attention matrices. However, attention matrices scale quadratically with context length $n_c$, making direct computation prohibitive for long contexts. 
While fused attention kernels like FlashAttention reduce memory overhead by computing attention scores block-wise without storing full matrices \citep{flashattn}, our method uniquely requires a maximization along the query dimension following Softmax normalization along the key dimension. This cross-dimensional dependency prevents direct integration of \Cref{eq2} into existing block-wise attention algorithms.


\paragraph{Chunked Scoring.} 
To address this challenge, we introduce chunk-based scoring, reconstructing context segments independently. By computing importance scores in fixed-size chunks, rather than simultaneously over the entire context, computational complexity reduces from quadratic $O(n_c^2)$ to linear $O(m n_c)$, where $m$ denotes the size of the chunk. Specifically, we partition the context tokens into fixed-length chunks of size $m$, concatenate each chunk with the repeat prompt, and process the resulting input of length $n_\text{in} = n_\text{prompt} + m$ through $\lm$ (\Cref{fig:chunk}). For each Transformer layer, we subsample keys in $\kvc$ corresponding to each chunk, obtaining a smaller attention matrix of size $n_\text{in} \times (m + n_\text{in})$. As in \Cref{eq2}, slicing the attention matrix and maximizing over grouped queries yields chunk-wise importance scores. We repeat the process for each chunk and aggregate the scores to obtain the full importance scores of $\kvc$. We set the chunk size to $m = \text{2K}$, constant across context lengths, models, and tasks, as the size has negligible impact on performance (\Cref{appendix:chunk_size}).


\paragraph{Complexity Analysis.} 
Computational complexity per chunk is $O(m^2)$, assuming a negligible repeat prompt length, \textit{i.e.}, $n_\text{prompt} \ll m$, thus $n_\text{in}\approx m$. Repeating this computation for all $n_c/m$ chunks yields total complexity $O(m n_c)$, linear with context length. Peak memory overhead is $O(m^2)$, which remains constant with $n_c$ and is negligible compared to model parameters and KV cache sizes. Additionally, we propose a softmax-free variant in \Cref{appendix:logit} utilizing a custom CUDA kernel integrated into FlashAttention, further reducing computational costs at a performance trade-off.

Importance scoring introduces additional overhead from computing attention queries and keys for chunked inputs through $\lm$ with $\kvc$. Given $n_\text{in}\approx m$, FlashAttention incurs $O(n_c m + m^2/2)$ causal-attention FLOPs per chunk, resulting in a total complexity of $O(n_c^2 + n_c m/2)$ across all $n_c/m$ chunks. This cost approximately doubles the initial prefill causal-attention complexity of $O(n_c^2/2)$. Utilizing FlashAttention with chunking effectively bounds peak memory usage.
For efficiency, KVzip also supports context-independent eviction by assigning static head-level importance scores per model (\Cref{sec:exp_benchmark}--\Cref{fig:pruning_structure}), incurring no compression overhead after deployment.

\begin{figure}[t]
    \centering
    \input{figure/g-profile}
    \vspace{-1.3em}
    \caption{
    \textbf{Computational analysis} using LLaMA3.1-8B with 124K context tokens on an NVIDIA A100 GPU in FP16 precision. We apply non-uniform head budget allocation with variable-length FlashAttention-2 \citep{adakv}.
    (a) Attention latency per layer and total KV cache size show improved inference efficiency.
    (b) KV importance scoring overhead aggregated over all chunks. Dashed horizontal lines indicate initial prefill cost for reference, with 2K chunk size limiting peak memory for a fair comparison \citep{prefill}. KVzip also supports context-independent eviction \citep{duo}, incurring a scoring overhead per model prior to deployment and removing runtime compression overhead (\Cref{fig:pruning_structure}).
    }
    \label{fig:complexity}
    \vspace{-0.2em}
\end{figure}


\paragraph{Empirical Efficiency Analysis.}
Empirical evaluations on an NVIDIA A100 GPU in \Cref{fig:complexity} confirm approximately twice the computational overhead of standard prefill during compression, with minimal additional memory (under 2\%). Importantly, compression occurs once per context or per model. \Crefsub{fig:complexity}{a} shows that our approach achieves significant reduction in inference latency and KV cache size. Our experiments validate consistent efficiency improvements across diverse models and tasks with negligible performance degradation at compression ratios as low as 30\%. 
