\section{Methodology}

In this section, we formalize KV cache compression as a combinatorial optimization problem. We first demonstrate how existing heuristic strategies implicitly optimize a modular function, and then introduce a monotone submodular objective to model semantic redundancy. Finally, we present a GPU-parallel score correction that uses this objective as a theoretical guide while avoiding the sequential bottleneck of submodular optimization.

\subsection{KV Cache Eviction with Capacity Constraints}
\label{sec:formulation}

Let the ground set $V = \{1, 2, \dots, N\}$ represent all tokens in the historical context during LLM inference. The task of KV cache compression is to select a representative subset $S \subseteq V$ to be retained in GPU memory. Given hardware constraints, we require $|S| \le B$, where $B$ is the KV cache budget. If we define a set function $F: 2^V \rightarrow \mathbb{R}$ to measure the utility of the retained cache $S$, the eviction problem can be formalized as:
\begin{equation}
\label{eq:problem1}
S^* = \arg\max_{S \subseteq V} F(S) \quad \text{subject to:} \quad |S| \le B.
\end{equation}

In the general paradigm of score-based KV cache compression, an evaluation metric, such as accumulated attention weights \cite{li2024snapkv} or predictions from a lightweight neural module \cite{kim2026kvzip}, assigns an importance score $s_i \in [0,1]$ to each token. Standard Top-K eviction then selects tokens with the highest scores. Mathematically, this is equivalent to defining the utility function as $F_{mod}(S) = \sum_{i \in S} s_i$. This objective is \textit{modular}, assuming that the marginal gain of adding a token $k$ to the cache is strictly $s_k$, independent of the existing elements in $S$.

However, in natural language, semantic information is intrinsically clustered. If a set $S$ already contains several tokens from a specific semantic phrase, adding another adjacent token yields negligible new information. Consequently, a purely modular $F_{mod}(S)$ often leads to highly redundant context summaries. To capture this semantic overlap, we argue that the ideal utility $F(S)$ is better formulated as a \textit{monotone submodular} function, which naturally incorporates the principle of diminishing returns. Although maximizing a submodular function subject to a cardinality constraint is NP-hard, sequential greedy selection provides a useful theoretical oracle with a $(1 - 1/e)$ approximation factor \cite{nemhauser1978analysis}. HubKV uses this oracle to motivate the form of its score correction, but does not claim to inherit this global guarantee.

\subsection{Score-Weighted Local Coverage Objective}
\label{sec:objective}

To properly evaluate the quality of a KV cache subset, the objective function should encourage preserving critical tokens while simultaneously penalizing local semantic redundancy. Drawing inspiration from facility location problems in operations research \cite{lin2011documentsummarization}, we model the total utility of the retained KV set $S$ as a score-weighted local coverage function:
\begin{equation}
\label{eq:facility}
F_{sub}(S) = \sum_{i \in V} \max_{j \in S} w_{i,j}
\end{equation}
where $w_{i,j}$ represents the semantic coverage provided by token $j$ to token $i$. To bridge the gap between abstract optimization and our empirical importance scores, we define $w_{i,j}$ using a normalized importance-weighted kernel:
\begin{equation}
w_{i,j} = s_j \cdot \frac{\mathcal{K}(|i-j|)}{Z_j}, \quad Z_j = \sum_{k \in V} \mathcal{K}(|j-k|)
\end{equation}
where $s_j$ is the importance score of token $j$, and $\mathcal{K}$ is a non-negative distance-based kernel (e.g., a window or Gaussian kernel). In the absence of an explicit semantic distance metric, we utilize spatial proximity within the sequence as a robust proxy for semantic overlap. The normalization factor $Z_j$ ensures that the singleton utility of a token remains stable, satisfying $F_{sub}(\{j\}) = s_j$. 

Crucially, as a facility location function, $F_{sub}(S)$ is inherently monotone submodular \cite{cornuejols1983uncapacitatedfacilitylocationproblem}, satisfying the property of diminishing returns. The marginal gain of adding a token $k$ to a set $S$, defined as $\Delta(k \mid S) = \sum_{i \in V} \max(0, w_{i,k} - \max_{j \in S} w_{i,j})$, decreases as $S$ grows, particularly when $S$ already contains tokens that are spatially close to $k$.

\begin{figure*}[!t]
    \centering
    \includegraphics[width=\textwidth]{pics/hubkv_prefill_main_4x4.pdf}
    \caption{\textbf{Main prefill results on Qwen3-8B.} We compare HubKV-refined KVZip/FastKVZip scores against the corresponding base scorers, SnapKV, Expected Attention, and the full-cache reference across 16 tasks. The x-axis reports the true KV compression ratio, ordered from aggressive compression on the left to full cache on the right.}
    \label{fig:prefill_main_results}
\end{figure*}

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{pics/longbench_average_model_comparison_1x4.pdf}
    \caption{\textbf{LongBench average across model backbones.} We report LongBench average results for Qwen3-8B, Qwen2.5-7B-Instruct-1M, Qwen3-14B, and Llama-3.1-8B-Instruct. Each panel reports the compression ratios available for that backbone and includes the full-cache reference at $r=0.00$.}
    \label{fig:longbench_average_model_comparison}
\end{figure*}

\begin{figure}[t]
    \centering
    \includegraphics[width=\columnwidth]{pics/decoding_qwen3_8b_singlecol.pdf}
    \caption{\textbf{Decoding-stage results on Qwen3-8B.} We evaluate AIME25 and MATH with a cache update interval of 128 decoding steps and target KV lengths of 1024, 2048, 4096, and 6144. HubKV denotes HubKV(+FastKVZip), which refines FastKVZip scores before the same target-length pruning step is applied.}
    \label{fig:decoding_results}
\end{figure}

\subsection{One-Pass Marginal-Gain Proxy via HubKV}
\label{sec:smd}

The objective in Eq.~\ref{eq:facility} defines an ideal redundancy-aware utility, but optimizing it directly requires repeatedly evaluating marginal gains against the current selected set. This sequential dependency is undesirable for KV-cache inference, where practical methods must preserve a simple parallel scoring and Top-K-style pruning path. We therefore propose \textbf{HubKV} as a one-pass marginal-gain proxy rather than a global submodular optimizer. HubKV keeps the original importance scorer unchanged, applies a bounded redundancy-aware correction to its scores, and then uses the same compression-ratio-based pruning interface as the underlying score-based method.

For clarity, the submodular objective above is written at the token level. In implementation, importance scores are maintained for every layer and attention head. Let $\mathbf{s} \in [0,1]^{L \times H \times N}$ denote the raw score tensor produced by a base scorer such as FastKVZip, and let $P_{\ell,h,i}$ indicate protected tokens, including attention sinks and the recent local window. HubKV transforms only the non-protected content scores and assigns protected tokens the maximum retention score; Appendix~\ref{app:protected_tokens} details the protected-token mask, its interaction with hub detection and head calibration, and the associated budget accounting.

\begin{enumerate}
    \item \textbf{Local Hub Detection:} For each layer $\ell$ and head $h$, we define a local window $\mathcal{W}_i = [i-k, i+k]$ and identify whether token $i$ is a local hub:
    \begin{equation}
    M_{\ell,h,i} = \mathbb{I}\!\left(i = \arg\max_{j \in \mathcal{W}_i} s_{\ell,h,j}\right),
    \end{equation}
    where deterministic index-based tie-breaking is used for score plateaus. This operation is implemented with a 1D max-pooling primitive over the sequence dimension.
    \item \textbf{Submodular Marginal Discounting:} For non-hub tokens, we apply a discount factor $\gamma \in (0,1)$ to approximate the residual marginal utility after a nearby hub has already covered the local region:
    \begin{equation}
    \tilde{s}_{\ell,h,i}
    = M_{\ell,h,i}s_{\ell,h,i}
    + (1-M_{\ell,h,i})\gamma s_{\ell,h,i}.
    \end{equation}
    This step is the local analogue of submodular diminishing returns: tokens near a stronger local representative are not discarded outright, but their scores are pushed downward before global pruning.
    \item \textbf{Bounded Head-Wise Selectivity Calibration:} Some attention heads exhibit sharp, spiky score distributions, while others remain flatter and less discriminative. We measure head selectivity on the non-protected content region using the coefficient of variation
    \begin{equation}
    c_{\ell,h} = \frac{\sigma_{\ell,h}}{\mu_{\ell,h}+\epsilon},
    \end{equation}
    where $\mu_{\ell,h}$ and $\sigma_{\ell,h}$ are computed over unprotected tokens. The relative head weight is then clipped to a bounded interval:
    \begin{equation}
    \begin{aligned}
    \beta_{\ell,h}
    &= \mathrm{clip}\!\left(
    \left(\frac{c_{\ell,h}}{\bar{c}_{\ell}}\right)^{\tau},
    \beta_{\min}, \beta_{\max}
    \right), \\
    u_{\ell,h,i} &= \beta_{\ell,h}\tilde{s}_{\ell,h,i},
    \end{aligned}
    \end{equation}
    where $\bar{c}_{\ell}$ is the mean selectivity across heads at layer $\ell$, $\tau$ controls calibration strength, and clipping prevents head calibration from overwhelming the base scorer.
    \item \textbf{Compression-Ratio Gating:} The redundancy correction should be weak when the cache is only lightly compressed and stronger when cache pressure is high. Given compression ratio $r$ and gate exponent $p$, HubKV mixes the raw and corrected scores as
    \begin{equation}
    \lambda = r^{p}, \qquad
    z_{\ell,h,i}
    = (1-\lambda)s_{\ell,h,i} + \lambda u_{\ell,h,i}.
    \end{equation}
    For protected tokens, we set $z_{\ell,h,i}=1$ so that sink and recent-window tokens remain available to the underlying pruning rule.
\end{enumerate}

Finally, HubKV passes the calibrated scores $\mathbf{z}$ to the same compression operator used by the base scorer: KV pairs with the lowest calibrated scores are evicted until the target compression ratio is reached, either globally or layer-wise depending on the base configuration. Equivalently, if $B$ KV entries are retained under a given budget, HubKV keeps the $B$ highest entries under $\mathbf{z}$. This keeps HubKV compatible with existing Top-K-style pruning implementations while changing the ranking criterion from purely modular scores to a redundancy-aware, ratio-gated correction. The additional score-refinement computation consists of local max-pooling, content-region head statistics, and element-wise score mixing, yielding $\mathcal{O}(LHNk)$ work with a small kernel size $k$ and no sequential marginal-gain loop. Because the cache interface and retained budget are unchanged, Section~\ref{sec:experiments} reports downstream evaluations under matched cache budgets.

We formally analyze the SMD core of HubKV in Appendix~\ref{app:theory}. Specifically, we establish the \textit{Local Hub Dominance} of the SMD-induced ranking and prove a \textit{Local Marginal Sandwich} (Prop.~\ref{prop:sandwich}), showing that the refined score $\tilde{s}_{\ell,h,i}$ serves as a local proxy for true marginal gains under localized overlap. We then characterize the ratio-gated head-wise calibration as a bounded perturbation of the base scores. This analysis intentionally separates three claims: the ideal coverage objective is submodular, sequential greedy is the theoretical oracle for that objective, and HubKV is a parallel local proxy designed for efficient inference rather than a sequential submodular optimizer.
