\section{Related Work}
\label{sec:related_works}

\paragraph{KV Cache Eviction and Sparsification.}
Efficient LLM inference is increasingly bottlenecked by the memory footprint of the Key-Value (KV) cache~\cite{kwon2023efficient}. StreamingLLM~\cite{xiao2024efficient} and LM-Infinite~\cite{han2024lm} preserve attention sinks and local context for stable long generation, while H2O~\cite{zhang2024h2o}, Scissorhands~\cite{liu2024scissorhands}, and FastGen~\cite{ge2023model} evict tokens using accumulated attention or observed attention patterns. These methods are effective, but they primarily operate during decoding and rely on past attention as a proxy for future utility.

\paragraph{Prefill-based and Query-Agnostic Compression.}
Prefill-time methods move compression earlier. SnapKV~\cite{li2024snapkv} and PyramidKV~\cite{zhang2024pyramidkv} profile prefill attention to choose persistent KV pairs, and DuoAttention~\cite{zhao2024duoattention} applies sliding-window attention selectively across heads. Expected Attention~\cite{devoto2025expected} estimates token utility by approximating how a distribution of future queries will attend to cached keys, while KVZip~\cite{kim2026kvzip} estimates query-agnostic importance through context reconstruction. HubKV follows this score-based prefill interface, but changes the ranking criterion by discounting locally redundant high-score neighbors before pruning.

Recent work also studies structure inside the KV cache. ChunkKV~\cite{liu2026chunkkv} retains semantic chunks rather than isolated tokens, while FastKVZip~\cite{fastkv2026} predicts KV utility with a hardware-friendly gating unit. HubKV is complementary to such scorers: it does not replace their importance estimates, but refines them with a submodularly motivated local redundancy penalty before the same Top-K-style eviction step.
Across these methods, token utility may be derived from past attention, future-query estimation, reconstruction, or learned prediction, yet the final retention rule is usually still an independent Top-K-style ranking. HubKV targets this shared final step, keeping the score source and cache layout unchanged while making the ranking locally redundancy-aware.

This distinction also separates HubKV from methods that change the unit of allocation or the attention pattern itself. Rather than assigning a new budget to each layer, head, or semantic chunk, HubKV operates after the base utility scores have already been computed and before the standard pruning operator is invoked. This makes the method easy to compose with query-agnostic scorers.
