\section{Introduction}

The transformative success of Transformer-based Large Language Models (LLMs) is fundamentally anchored in the self-attention mechanism, which enables the model to contextualize information across vast sequences~\cite{vaswani2017attentionisallyouneed}. To maintain high throughput during autoregressive generation, modern inference systems rely on Key-Value (KV) caching to store historical context~\cite{dai2019transformer}. However, the memory footprint of the KV cache scales with sequence length, becoming the primary bottleneck for deploying LLMs in long-context scenarios~\cite{kwon2023efficient}.

To mitigate this burden, KV cache compression has emerged as a critical research direction. The goal of KV cache compression is to find effective ways to choose critical tokens to retain, from a massive sequence of past tokens, in order to maximize long-context inference performance. Although predicting the exact future attention of LLMs is highly complex, there exist many heuristics, such as attention sparsity~\cite{li2024snapkv}, adaptive head budgets~\cite{feng2026ada}, and query-agnostic reconstruction~\cite{kim2026kvzip}, which have been shown to be effective in practice. KV cache compression is typically a ranking process in which token importance scores are computed during the prefill stage, and a subset of tokens is chosen to be preserved using the aforementioned heuristics.

While these methods have demonstrated strong performance in accelerating LLM inference, they rely on a greedy Top-K selection strategy. In their standard formulation, these algorithms implicitly optimize a \textit{modular} objective, where the utility of each token is evaluated in complete isolation. However, this independence assumption ignores the semantic reality of language, as adjacent tokens (e.g., subwords of a single term or entities within a local phrase) often share highly overlapping information. Consequently, retaining clustered tokens provides diminishing marginal returns to the global context. Because of this, a pure Top-K approach tends to allocate the cache budget heavily on local high-scoring clusters, resulting in locally redundant samples that dilute the overall context coverage.

In order to tailor a compression method for highly correlated text sequences, we first conduct a systematic empirical study to characterize the spatial distribution of token importance indicators. 
Our analysis reveals two critical phenomena: (1) a sharp decline in the efficiency of standard Top-K selection in covering diverse context segments, and (2) high spatial autocorrelation in importance scores, suggesting that informational utility is statistically localized. 

This naturally suggests viewing cache retention as a coverage problem: a selected token should be valuable by itself, but its value should decrease when nearby tokens have already been retained.
Motivated by these findings, we re-examine KV cache eviction through the lens of \textit{Submodular Maximization}. 
Specifically, we model the cache utility as a monotone submodular function to naturally incorporate the principle of diminishing returns: the marginal gain of a token decreases if its semantic neighbors are already present in the cache~\cite{lin2011documentsummarization}. 
While sequential greedy solvers provide theoretical guarantees for such objectives, they introduce unacceptable sequential bottlenecks for real-time LLM inference. 

To bridge this gap, we propose \textbf{HubKV}, a hardware-parallel first-order ranking proxy based on Submodular Marginal Discounting (SMD). 
Operating directly on the temporal sequence, HubKV uses 1D max-pooling to identify local ``hub tokens'' and induces a local repulsion field by applying a soft penalty to their neighbors. 
Furthermore, HubKV incorporates a compression-ratio-gated head-wise selectivity mechanism that calibrates the strength of this redundancy correction according to both cache pressure and the discriminative power of individual attention heads.
The resulting calibrated scores can be passed directly to existing Top-K-style pruning operators, making HubKV a plug-in score refinement layer rather than a replacement cache policy.

Our main Qwen3-8B evaluation spans 16 long-context and reasoning tasks; appendix LongBench results cover three additional backbones. At 95\% prefill compression, HubKV improves the corresponding FastKVZip and KVZip baselines by 3.23 and 6.53 average score points on Qwen3-8B, respectively. Averaged across evaluated prefill budgets, the paired gains are +1.71/+1.87 points. Decoding-stage experiments further show consistent improvements over FastKVZip on AIME25 and MATH.

Our contributions are summarized as follows:
\begin{itemize}
    \item We empirically expose the modularity flaw in current Top-K eviction strategies through spatial profiling and autocorrelation analysis, and reveal head-wise functional diversity via raw attention visualization.
    \item We model KV cache compression as a score-weighted local coverage problem using a monotone submodular facility-location objective, and use sequential greedy selection as a conceptual oracle rather than as the deployed inference algorithm.
    \item We design HubKV, a hardware-parallel marginal-gain proxy that combines local hub detection, soft neighbor discounting, compression-ratio gating, and bounded head-wise selectivity calibration.
    \item We evaluate HubKV on four models across 16 tasks and decoding-stage reasoning benchmarks. The results show the strongest gains in the aggressive-compression regime while preserving the same pruning interface as modular baselines.
\end{itemize}
