\section{Experiments}
\label{sec:experiments}

In this section, we evaluate whether the redundancy-aware score correction introduced by \textbf{HubKV} improves practical KV cache compression. Since HubKV is designed as a plug-in refinement rather than a standalone scorer, the central comparison is paired: we apply the same HubKV correction on top of KVZip and FastKVZip scores, and compare each refined variant against its own base scorer under the same compression budget. We first report the main prefill results on Qwen3-8B across a diverse suite of long-context and reasoning tasks, and then evaluate whether the same score correction remains useful in a decoding-stage setting.

\subsection{Experimental Setup}

\textbf{Models, tasks, and baselines.} The main evaluation uses Qwen3-8B~\cite{yang2025qwen3} and a 16-task suite spanning contextual QA, retrieval and code use, reasoning, summarization, and in-context learning. We evaluate HubKV as a score-refinement layer on top of KVZip and FastKVZip, and compare each refined variant with its own base scorer under the same cache budget. We also include SnapKV and Expected Attention~\cite{devoto2025expected} as attention-based baselines. Additional cross-backbone LongBench evaluations on Qwen2.5-7B-Instruct-1M, Llama-3.1-8B-Instruct, and Qwen3-14B are reported in Appendices~\ref{app:qwen25_longbench},~\ref{app:llama31_longbench}, and~\ref{app:qwen3_14b_longbench}. Detailed task groupings, metrics, and baseline descriptions are provided in Appendix~\ref{app:experimental_details}.

\textbf{Compression and implementation.} For prefill compression, the ratio $r$ denotes the fraction of KV entries removed; we evaluate seven ratios between 0.50 and 0.95 and focus on paired differences at the same budget. Unless otherwise stated, HubKV uses a local kernel of size 5, discount factor $\gamma=0.5$, head-selectivity exponent $\tau=0.5$, clipping range $[0.8,1.2]$, and gate exponent $p=2$. HubKV only refines scores before the same Top-K-style pruning step used by the base method; full implementation and decoding-stage settings are deferred to Appendix~\ref{app:experimental_details}.

\subsection{Main Results}

Figure~\ref{fig:prefill_main_results} shows that HubKV most consistently helps under aggressive compression, where the retained budget is small and redundant local clusters are most damaging. This trend matches the design of the compression-ratio gate: when cache pressure is high, HubKV applies a stronger local marginal-gain correction; when the budget is looser, the refined score stays closer to the original base scorer.

Figure~\ref{fig:longbench_average_model_comparison} isolates LongBench performance across model backbones. The averaged curves show the same pattern as the full Qwen3-8B suite: HubKV tends to track or improve the corresponding base scorer at moderate compression and separates more clearly under stronger compression. The paired summary in Table~\ref{tab:paired_prefill_gain} quantifies the task-family gains, with additional interpretation provided in Appendix~\ref{app:detailed_results_analysis}.

\begin{table}[t]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{1.6pt}
    \begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}lrrrrrrrr@{}}
        \hline
        Family & 0.95 & 0.90 & 0.88 & 0.85 & 0.80 & 0.75 & 0.50 & Avg. \\
        \hline
        \multicolumn{9}{c}{\textit{HubKV(+FastKVZip) $-$ FastKVZip}} \\
        \hline
        Contextual QA & +3.91 & +4.56 & +4.00 & +2.65 & +1.14 & +0.90 & +0.06 & +2.46 \\
        Long-Context Use & +1.32 & +1.60 & +1.08 & -1.09 & +1.22 & +2.39 & +0.33 & +0.98 \\
        Reasoning QA & +4.60 & +3.80 & +3.71 & +4.22 & +1.32 & +0.35 & -0.24 & +2.54 \\
        Generation / ICL & +3.10 & +1.66 & +0.91 & +0.85 & -0.05 & -0.06 & -0.24 & +0.88 \\
        Overall & +3.23 & +2.90 & +2.42 & +1.66 & +0.91 & +0.89 & -0.02 & +1.71 \\
        \hline
        \multicolumn{9}{c}{\textit{HubKV(+KVZip) $-$ KVZip}} \\
        \hline
        Contextual QA & +7.29 & +3.47 & +1.81 & +0.70 & +0.21 & +0.06 & +0.48 & +2.00 \\
        Long-Context Use & +6.02 & +2.12 & +3.59 & +2.17 & +1.70 & +0.75 & -0.23 & +2.30 \\
        Reasoning QA & +8.48 & +2.46 & +1.45 & +2.54 & +1.78 & -0.43 & -0.06 & +2.32 \\
        Generation / ICL & +4.32 & +1.24 & +0.19 & -0.02 & +0.21 & +0.15 & -0.06 & +0.86 \\
        Overall & +6.53 & +2.32 & +1.76 & +1.35 & +0.98 & +0.13 & +0.03 & +1.87 \\
        \hline
    \end{tabular*}
    \caption{\textbf{Paired average gain by task family and compression ratio.} Each value is the mean score-point difference between HubKV-refined scores and the corresponding base scorer, averaged over all tasks in the family at the specified compression ratio in Figure~\ref{fig:prefill_main_results}.}
    \label{tab:paired_prefill_gain}
\end{table}

\subsection{Decoding-Stage Results}
\label{sec:decoding_results}

We further evaluate HubKV in a decoding-stage setting, where the cache is compressed repeatedly during generation. Figure~\ref{fig:decoding_results} compares HubKV(+FastKVZip) with FastKVZip and SnapKV on AIME25 and MATH~\cite{hendrycks2021MATH} using target KV lengths of 1024, 2048, 4096, and 6144. HubKV improves over FastKVZip at every target length: on AIME25, accuracy rises from 6.67\% to 13.33\% at length 1024 and from 16.67\% to 20.00\% at larger lengths; on MATH, HubKV gains 4.0 points at length 1024 and 1.1--1.4 points elsewhere. These results suggest that the same local redundancy correction remains useful beyond one-shot prefill pruning.
