\clearpage
\appendix

\section{Experimental Details}
\label{app:experimental_details}

The reported main prefill evaluation uses Qwen3-8B as the backbone model. The benchmark suite contains 16 tasks drawn from LongBench~\cite{bai2023longbench}, SCBench~\cite{li2025scbench}, RULER~\cite{hsieh2024ruler}, SQuAD~\cite{rajpurkar2016squad}, and GSM8K~\cite{cobbe2021gsm8k}. The selected tasks cover contextual question answering (En.QA, MultiFieldQA, Qasper, SQuAD), long-context retrieval and code use (Retr.KV, RULER-4K, Code.RepoQA, Code.LCC), multi-hop and mathematical reasoning (2WikiMQA, HotpotQA, MuSiQue, GSM8K), and generation or in-context learning (GovReport, MultiNews, SAMSum, ICL.ManyShot). We report the standard task metric used by each benchmark, including accuracy, F1, ROUGE-L, Pass@1, and code similarity. Because these metrics have different natural scales, we primarily interpret paired differences between a base scorer and its HubKV-refined counterpart on the same task and compression ratio.

Following KVZip and FastKVZip, the main experiment uses the prefill-stage compression setting, where a score is assigned to historical KV entries before downstream generation. The compression ratio $r$ denotes the fraction of KV entries removed, so $r=0$ corresponds to the full-cache reference and larger values correspond to more aggressive eviction. We evaluate $r \in \{0.50,0.75,0.80,0.85,0.88,0.90,0.95\}$, with particular attention to the high-compression regime where local redundancy is most likely to waste the remaining cache budget. The full-cache score is included as an uncompressed reference, but it should not be interpreted as an oracle upper bound: on some tasks, removing distracting context can improve the final task score.

We compare two deployed HubKV variants, HubKV(+FastKVZip) and HubKV(+KVZip), against their corresponding base scorers FastKVZip~\cite{fastkv2026} and KVZip~\cite{kim2026kvzip}. We additionally include SnapKV~\cite{li2024snapkv} and Expected Attention as representative attention-based baselines. This evaluation is query-agnostic for KVZip, FastKVZip, and their HubKV-refined variants, matching the intended use case where a compressed shared cache may be reused across downstream queries.

Unless otherwise stated, HubKV uses a 1D local max-pooling kernel of size 5, a non-hub discount factor $\gamma=0.5$, a bounded head-wise selectivity exponent $\tau=0.5$, clipping range $[0.8,1.2]$, and compression-ratio gate exponent $p=2$. Protected sink tokens and the recent local window follow the underlying base scorer and bypass the HubKV correction. HubKV does not alter the cache data structure or the final Top-K-style pruning interface; it only refines the layer-head token scores before the same budgeted eviction step is applied.

\subsection{Protected-Token Handling}
\label{app:protected_tokens}

HubKV follows the protected-token convention of the underlying KV compression method. The protected set contains attention sink tokens and the recent local window, which are retained to preserve stable long-context generation and short-range continuity. Let $T$ denote the current sequence length. We define the protected mask for each layer $\ell$, head $h$, and token position $i$ as
\begin{equation}
P_{\ell,h,i}
=
\mathbb{I}\!\left[
i \in \mathcal{P}_{\mathrm{sink}} \cup \mathcal{P}_{\mathrm{recent}}
\right],
\end{equation}
where $\mathcal{P}_{\mathrm{sink}}$ denotes the prefix attention-sink positions and $\mathcal{P}_{\mathrm{recent}}$ denotes the trailing recency window. In our default FastKVZip/KVZip-based configuration, we inherit the base scorer's protection rule; for the score-trace analysis in Appendix~\ref{app:observation_details}, this corresponds to four sink tokens and a recent window of $\lfloor 0.02T \rfloor$ tokens.

HubKV applies its redundancy-aware correction only to the unprotected content region
\begin{equation}
\mathcal{C}_{\ell,h} = \{i : P_{\ell,h,i}=0\}.
\end{equation}
Protected tokens are excluded from the content-region head statistics used for head-wise calibration, including the mean, standard deviation, and coefficient of variation. They are also masked out of local hub competition: for a content token $i$, local max-pooling is evaluated over unprotected positions in the local window, so protected sink or recency tokens do not suppress nearby content tokens.

For $i \in \mathcal{C}_{\ell,h}$, HubKV computes the local-hub mask, SMD-discounted score, head-calibrated score, and ratio-gated score as described in Section~\ref{sec:smd}. For protected positions, the final score is overwritten after score refinement:
\begin{equation}
z_{\ell,h,i} =
\begin{cases}
1, & P_{\ell,h,i}=1,\\
(1-\lambda)s_{\ell,h,i}+\lambda u_{\ell,h,i}, & P_{\ell,h,i}=0.
\end{cases}
\end{equation}
This overwrite ensures that protected tokens bypass local discounting, head-wise calibration, and compression-ratio gating.

Protected tokens are not an additional cache allocation introduced by HubKV. They are retained under the same budget accounting as the underlying compression method. Equivalently, if the base method retains $B_{\ell,h}$ KV entries for a given layer and head, protected tokens occupy part of this retained budget, and the remaining budget is assigned to the highest-scoring unprotected content tokens under $z_{\ell,h,i}$. When the base method uses a global or layer-wise pruning rule rather than a per-head rule, HubKV follows the same pruning granularity and only changes the scores passed to that pruning operator. The reported compression ratio is computed after final retention: the final retained cache includes both protected positions and selected content positions and is matched to the target budget whenever the target budget exceeds the number of protected positions, which holds for our evaluated settings.

This design preserves the stability safeguards already used by the base compressor while preventing the local redundancy correction from being driven by tokens retained for architectural or generation-stability reasons. As a result, the paired comparisons in the main experiments isolate the effect of HubKV on the rank ordering of non-protected content tokens while keeping the protected-token policy fixed.

For the decoding-stage experiment, the cache is repeatedly compressed during generation rather than compressed only once after prefill. We use Qwen3-8B, set the cache update interval to 128 decoding steps, and evaluate fixed target KV lengths of 1024, 2048, 4096, and 6144. HubKV refines FastKVZip scores before applying the same target-length pruning operation, allowing the decoding-stage comparison to isolate the effect of the score correction under an identical cache budget.

The current evaluation isolates HubKV through paired comparisons against the same base scorer under the same cache budget. Table~\ref{tab:hubkv_ablation} complements the main evaluation with Qwen3-8B A100 ablations that contrast the final ratio-gated correction with an ungated variant and with several head-budget allocation alternatives.

\section{Ablation Results}
\label{app:ablation_results}

Table~\ref{tab:hubkv_ablation} summarizes Qwen3-8B A100 ablation results. The component block shows that the final ratio-gated HubKV variant performs best for both base scorers in this setting. The head-budget block tests a different design choice: reallocating per-head budgets by waterfilling, entropy, selectivity, or a hybrid score is not a reliable substitute for bounded score correction, whereas HubKV is the only alternative with clear positive gains for both FastKVZip and KVZip under the same task subset.

\begin{table}[!hbp]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{2.0pt}
    \begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}lrrrrrr@{}}
        \hline
        Branch & Base & Ungated & HubKV & $\Delta$ Ungated & $\Delta$ HubKV & $n$ \\
        \hline
        \multicolumn{7}{c}{\textit{Component ablation at $r=0.75$}} \\
        \hline
        FastKVZip & 58.62 & 59.99 & 60.50 & +1.37 & +1.88 & 6 \\
        KVZip & 56.28 & 56.07 & 56.41 & -0.21 & +0.13 & 6 \\
        \hline
        Branch & Base & HubKV & Waterfill & Entropy & Select. & Hybrid \\
        \hline
        \multicolumn{7}{c}{\textit{Head-budget alternatives at $r=0.88$}} \\
        \hline
        FastKVZip & 34.99 & +5.51 & -3.50 & -7.57 & -16.77 & -7.46 \\
        KVZip & 35.82 & +2.11 & +0.18 & -7.33 & -19.32 & -9.07 \\
        \hline
    \end{tabular*}
    \caption{\textbf{Ablation of HubKV design choices on Qwen3-8B.} The component block reports average scores and gains over the paired base scorer on six shared ablation tasks; ``Ungated'' applies the local hub discounting and head-wise calibration without the compression-ratio gate. The head-budget block reports gains over the paired base scorer on six shared ablation tasks.}
    \label{tab:hubkv_ablation}
\end{table}

\section{Detailed Results Analysis}
\label{app:detailed_results_analysis}

The paired summary in Table~\ref{tab:paired_prefill_gain} confirms that the improvement in Figure~\ref{fig:prefill_main_results} is not merely a visual artifact of overlapping curves. Across all 112 task-ratio pairs, HubKV(+FastKVZip) improves over FastKVZip by an average of 1.71 points, while HubKV(+KVZip) improves over KVZip by an average of 1.87 points. The gains are strongest under aggressive compression: at $r=0.95$, the average improvements are +3.23 and +6.53 points, respectively. In absolute terms, the average score at $r=0.95$ rises from 21.42 to 24.65 for the FastKVZip branch, and from 17.69 to 24.22 for the KVZip branch.

The task-family breakdown shows that the improvement is broadly distributed rather than concentrated in a single benchmark group. For the FastKVZip branch, contextual QA and reasoning QA show the largest average gains, while the KVZip branch benefits most on long-context use and reasoning QA. This pattern is consistent with the motivation of HubKV: tasks that require retaining evidence from several separated parts of the context are more likely to benefit when the cache budget is not spent on locally redundant high-score neighbors.

The averaged LongBench comparison in Figure~\ref{fig:longbench_average_model_comparison} provides an additional model-level view. The Qwen backbones retain both FastKVZip and KVZip branches, and HubKV generally improves or closely tracks the paired base scorer. On Llama-3.1-8B-Instruct, the available data contain the KVZip branch, where the HubKV-refined curve separates most clearly at high compression. The exact set of compression ratios differs across the exported runs, so the figure should be read as a backbone-level robustness check rather than as a strict model ranking.

The results also expose the intended limitation of the method. At moderate compression, especially $r=0.50$, the paired gains largely vanish and can become slightly negative. This is consistent with HubKV's role as a bounded correction rather than a replacement scorer: when the cache budget is already sufficient, suppressing neighboring high-score tokens is less necessary and may occasionally perturb a useful local evidence chain. The main benefit therefore lies in the aggressive compression regime, where retaining multiple near-duplicate tokens is most expensive and a local marginal-gain proxy can redirect the budget toward broader context coverage.

\section{Code and Math Locality Stress Test}
\label{app:code_math_stress}

HubKV uses positional locality as a redundancy prior: nearby high-scoring tokens are often overlapping representatives of the same local evidence. This prior is useful on average, but it is not a semantic-equivalence oracle. Code and mathematical expressions provide a natural stress test because adjacent symbols can be complementary rather than redundant, and useful dependencies may also be non-contiguous.

We therefore isolate code and math tasks from the Qwen3-8B A100 snapshot and report paired gains against the corresponding base scorer under the same cache budget. The prefill tasks are Code.RepoQA, Code.LCC, and GSM8K. We also include the decoding-stage MATH result from the fixed-target-KV experiment in Section~\ref{sec:decoding_results}. Table~\ref{tab:code_math_gain_loss} reports both positive and negative entries instead of only aggregate averages.

\begin{table*}[t]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{2.4pt}
    \begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}llrrrrrrrrl@{}}
        \hline
        Task & Branch & 0.95 & 0.90 & 0.88 & 0.85 & 0.80 & 0.75 & 0.50 & Avg. & Worst \\
        \hline
        Code.RepoQA & H(+FastKVZip) & +0.00 & +0.23 & -0.68 & -0.68 & +2.05 & +1.82 & -0.46 & +0.33 & -0.68 @ 0.88 \\
        Code.RepoQA & H(+KVZip) & +0.69 & +0.00 & +1.14 & +0.68 & +0.23 & +1.14 & -0.46 & +0.49 & -0.46 @ 0.50 \\
        Code.LCC & H(+FastKVZip) & +2.86 & +0.99 & +0.81 & +0.66 & +0.02 & +1.10 & -0.07 & +0.91 & -0.07 @ 0.50 \\
        Code.LCC & H(+KVZip) & +10.10 & +3.43 & +2.55 & +2.73 & +1.69 & +0.76 & +0.23 & +3.07 & +0.23 @ 0.50 \\
        GSM8K & H(+FastKVZip) & +1.00 & +5.00 & +17.00 & +10.00 & +4.00 & +1.00 & -2.00 & +5.14 & -2.00 @ 0.50 \\
        GSM8K & H(+KVZip) & +1.00 & +1.00 & +1.00 & +5.00 & +3.00 & -1.00 & -1.00 & +1.29 & -1.00 @ 0.75 \\
        \hline
    \end{tabular*}
    \caption{\textbf{Code and math gain-loss analysis on Qwen3-8B.} Each value is the paired score difference between HubKV-refined scores and the corresponding base scorer under the same prefill compression ratio. ``H'' denotes HubKV. Positive values indicate improvements; negative values identify cases where the locality prior can perturb useful local evidence chains.}
    \label{tab:code_math_gain_loss}
\end{table*}

The aggregate results remain positive on average for all six prefill branches, especially on Code.LCC and GSM8K. At the same time, the negative entries are informative. For example, HubKV(+FastKVZip) loses 0.68 points on Code.RepoQA at $r=0.88$ and $r=0.85$, and loses 2.00 points on GSM8K at $r=0.50$. These pockets support the limitation that local discounting can occasionally demote adjacent tokens that jointly encode a code fragment, formula, or reasoning chain.

The decoding-stage MATH result is positive at all tested target KV lengths. Relative to FastKVZip, HubKV(+FastKVZip) improves MATH accuracy by +4.00, +1.40, +1.10, and +1.10 points at target KV lengths 1024, 2048, 4096, and 6144, respectively, for an average gain of +1.90 points. This suggests that the locality correction remains useful in math reasoning overall, despite the failure modes exposed by individual prefill budgets.

We also ran a retained-token chain diagnostic on 32 short examples, split evenly between code snippets and math snippets. For each token inside the code or formula span, we measured the fraction of layer/head positions in which that token was retained by the base FastKVZip scorer and by HubKV(+FastKVZip). Table~\ref{tab:retained_chain_diagnostic} summarizes the mean retention fraction over the annotated spans.

\begin{table}[t]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{2.0pt}
    \begin{tabular*}{\linewidth}{@{\extracolsep{\fill}}rrrrrrr@{}}
        \hline
        $r$ & Ex. & Tok. & Base & HubKV & $\Delta$ & Changed \\
        \hline
        0.50 & 32 & 415 & .4298 & .4184 & -.0114 & 82.2\% \\
        0.75 & 32 & 415 & .1870 & .1677 & -.0193 & 90.4\% \\
        0.80 & 32 & 415 & .1353 & .1155 & -.0197 & 91.8\% \\
        0.85 & 32 & 415 & .0849 & .0641 & -.0207 & 89.6\% \\
        0.88 & 32 & 415 & .0532 & .0303 & -.0228 & 91.8\% \\
        0.90 & 32 & 415 & .0307 & .0100 & -.0207 & 60.7\% \\
        0.95 & 32 & 415 & .0131 & .0000 & -.0131 & 26.3\% \\
        \hline
    \end{tabular*}
    \caption{\textbf{Retained-token chain diagnostic for code and math spans.} Values are mean per-token retention fractions over annotated code/formula spans, averaged across examples and positions. The diagnostic is qualitative: lower span retention under HubKV is consistent with the small negative pockets in Table~\ref{tab:code_math_gain_loss}, but downstream benchmark deltas remain the primary evidence.}
    \label{tab:retained_chain_diagnostic}
\end{table}

Taken together, these results support a conservative interpretation. HubKV is generally beneficial on code and math groups in the tested setting, but the locality prior should be described as a redundancy heuristic rather than a guarantee that adjacent tokens are interchangeable. This is why the deployed method uses soft discounting, bounded head-wise calibration, and compression-ratio gating instead of hard non-maximum suppression.

\section{Efficiency Analysis}
\label{app:efficiency}

We evaluate the computational cost of HubKV in two ways. First, a microbenchmark isolates the score-refinement path using synthetic score tensors with the Qwen3-8B score shape $[L, B, H_{KV}, N]$, where $L=36$ and $H_{KV}=8$. Second, an end-to-end benchmark loads Qwen3-8B in \texttt{bfloat16} and compares FastKVZip with HubKV(+FastKVZip) at compression ratio $r=0.95$. The HubKV configuration is the same mild ratio-gated local-hub variant used in the main experiments.

Table~\ref{tab:efficiency_microbenchmark} reports the score-stage microbenchmark. The measured operation ``HubKV refine+select'' applies local hub detection, bounded head-wise calibration, ratio-gated score mixing, and then the same bottom-score selection plan as the base method. The relative overhead is visible because the base selection stage is already small; however, the absolute latency remains in the millisecond range.

\begin{table*}[t]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{3.0pt}
    \begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}lrrrrrr@{}}
        \hline
        Scorer & $N$ & Batch & Base select ms & HubKV refine+select ms & Overhead & Peak MB \\
        \hline
        FastKVZip & 4096 & 1 & 1.50 & 2.51 & 67.7\% & 47.32 \\
        FastKVZip & 4096 & 4 & 6.74 & 10.19 & 51.3\% & 182.15 \\
        FastKVZip & 8192 & 1 & 3.10 & 4.76 & 53.7\% & 91.98 \\
        FastKVZip & 8192 & 4 & 11.86 & 17.40 & 46.8\% & 362.37 \\
        FastKVZip & 16384 & 1 & 6.75 & 10.18 & 50.7\% & 182.15 \\
        FastKVZip & 16384 & 4 & 15.35 & 20.96 & 36.5\% & 722.80 \\
        FastKVZip & 32768 & 1 & 11.38 & 17.90 & 57.3\% & 362.37 \\
        FastKVZip & 32768 & 4 & 22.15 & 26.73 & 20.7\% & 1445.37 \\
        KVZip & 4096 & 1 & 1.48 & 2.49 & 68.0\% & 47.32 \\
        KVZip & 4096 & 4 & 6.77 & 10.20 & 50.6\% & 182.15 \\
        KVZip & 8192 & 1 & 3.09 & 4.76 & 54.1\% & 91.98 \\
        KVZip & 8192 & 4 & 11.88 & 18.40 & 55.0\% & 362.37 \\
        KVZip & 16384 & 1 & 6.76 & 10.08 & 49.1\% & 182.15 \\
        KVZip & 16384 & 4 & 14.47 & 21.01 & 45.2\% & 722.80 \\
        KVZip & 32768 & 1 & 11.88 & 17.40 & 46.4\% & 362.37 \\
        KVZip & 32768 & 4 & 20.61 & 26.87 & 30.4\% & 1445.37 \\
        \hline
    \end{tabular*}
    \caption{\textbf{Score-stage efficiency microbenchmark.} Synthetic score tensors match the Qwen3-8B layer/head shape and use \texttt{bfloat16}. The table reports median latency over the measured repeats. Across all rows, the maximum relative overhead is 68.0\%, but the absolute refine-plus-select latency remains below 27 ms even at $N=32768$ and batch size 4.}
    \label{tab:efficiency_microbenchmark}
\end{table*}

Table~\ref{tab:efficiency_e2e} reports the end-to-end timing result for Qwen3-8B with batch size 1 and 32 decoded tokens. In this full inference path, the score-refinement overhead is largely hidden by the model prefill and decode computation. The maximum measured TTFT overhead is +1.26\%, and the median TTFT overhead is -0.12\%. Prefill throughput stays between 0.984$\times$ and 1.012$\times$ of FastKVZip, and decode throughput is effectively unchanged.

\begin{table*}[t]
    \centering
    \scriptsize
    \setlength{\tabcolsep}{3.2pt}
    \begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}lrrrrr@{}}
        \hline
        Method & $N$ & TTFT s & Prefill tok/s & Decode tok/s & Peak MB \\
        \hline
        FastKVZip & 4096 & 0.549 & 8672.9 & 13.5 & 52.7 \\
        HubKV(+FastKVZip) & 4096 & 0.552 & 8658.5 & 13.2 & 61.7 \\
        FastKVZip & 8192 & 1.006 & 8906.6 & 12.0 & 104.7 \\
        HubKV(+FastKVZip) & 8192 & 0.997 & 9013.5 & 11.9 & 123.6 \\
        FastKVZip & 16384 & 1.967 & 8821.0 & 9.3 & 207.5 \\
        HubKV(+FastKVZip) & 16384 & 1.992 & 8681.9 & 9.6 & 244.4 \\
        FastKVZip & 32768 & 4.370 & 7754.1 & 7.0 & 415.6 \\
        HubKV(+FastKVZip) & 32768 & 4.339 & 7816.6 & 7.0 & 487.6 \\
        \hline
    \end{tabular*}
    \caption{\textbf{End-to-end efficiency on Qwen3-8B.} We compare FastKVZip and HubKV(+FastKVZip) at compression ratio $r=0.95$, batch size 1, and 32 generated tokens. TTFT denotes prefill plus the first decode step. The measured TTFT overheads for HubKV are +0.49\%, -0.89\%, +1.26\%, and -0.72\% for $N=4096,8192,16384,32768$, respectively.}
    \label{tab:efficiency_e2e}
\end{table*}

The efficiency results qualify the lightweight claim. HubKV adds measurable work to the score-selection stage, but this work is small in absolute terms and does not produce a meaningful end-to-end TTFT regression in the tested Qwen3-8B setup. Peak memory increases with sequence length because HubKV materializes temporary score-refinement tensors, from +9.0 MB at 4k to +72.0 MB at 32k relative to FastKVZip, which is modest compared with the full model inference footprint.

\section{Additional LongBench Results on Qwen2.5-7B-Instruct-1M}
\label{app:qwen25_longbench}

We further evaluate the same prefill-compression setting on Qwen/Qwen2.5-7B-Instruct-1M and report LongBench average results. The full-cache reference is taken from FastKVZip at compression ratio $r=0.00$.
Averaged over LongBench and five compressed ratios, HubKV improves FastKVZip by +1.14 average points and KVZip by +1.16 average points. The gains grow with compression pressure, reaching +1.63 points for the FastKVZip branch and +2.29 points for the KVZip branch at $r=0.88$.

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{pics/qwen25_longbench_appendix_2x5.pdf}
    \caption{\textbf{Additional LongBench results on Qwen2.5-7B-Instruct-1M.} We compare LongBench average scores from HubKV-refined KVZip/FastKVZip against the corresponding base scorers, SnapKV, Expected Attention, and the full-cache reference. The x-axis reports the true KV compression ratio, ordered from aggressive compression on the left to full cache on the right.}
    \label{fig:qwen25_longbench_appendix}
\end{figure*}

\section{Additional LongBench Results on Llama-3.1-8B-Instruct}
\label{app:llama31_longbench}

We also evaluate the same LongBench average setting on meta-llama/Llama-3.1-8B-Instruct, aligned with Appendix~\ref{app:qwen25_longbench}. The exported Llama-3.1 data contains the KVZip branch, Expected Attention, SnapKV, and the full-cache reference from SnapKV at compression ratio $r=0.00$.
Averaged over LongBench and seven compressed ratios, HubKV improves KVZip by +1.03 average points. The average gain is small near moderate compression but increases in the high-compression regime, reaching +5.56 points at $r=0.95$.

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{pics/llama31_longbench_appendix_2x5.pdf}
    \caption{\textbf{Additional LongBench results on Llama-3.1-8B-Instruct.} We compare LongBench average scores from HubKV-refined KVZip against KVZip, SnapKV, Expected Attention, and the full-cache reference. The x-axis reports the true KV compression ratio, ordered from aggressive compression on the left to full cache on the right, covering compressed ratios $r \in \{0.50,0.75,0.80,0.85,0.88,0.90,0.95\}$ plus the full-cache point $r=0.00$.}
    \label{fig:llama31_longbench_appendix}
\end{figure*}

\section{Additional LongBench Results on Qwen3-14B}
\label{app:qwen3_14b_longbench}

We further evaluate the same LongBench average setting on Qwen/Qwen3-14B, aligned with Appendix~\ref{app:llama31_longbench}. This run includes both FastKVZip and KVZip branches, together with Expected Attention, SnapKV, and the full-cache reference from SnapKV at compression ratio $r=0.00$.
Averaged over LongBench and five compressed ratios, HubKV improves FastKVZip by +0.90 average points and KVZip by +1.26 average points. The gains are again larger under stronger compression, reaching +1.60 points for the FastKVZip branch and +1.89 points for the KVZip branch at $r=0.88$.

\begin{figure*}[t]
    \centering
    \includegraphics[width=\textwidth]{pics/qwen3_14b_longbench_appendix_2x5.pdf}
    \caption{\textbf{Additional LongBench results on Qwen3-14B.} We compare LongBench average scores from HubKV-refined KVZip/FastKVZip against the corresponding base scorers, SnapKV, Expected Attention, and the full-cache reference. The x-axis reports the true KV compression ratio, ordered from aggressive compression on the left to full cache on the right, covering compressed ratios $r \in \{0.50,0.75,0.80,0.85,0.88\}$ plus the full-cache point $r=0.00$.}
    \label{fig:qwen3_14b_longbench_appendix}
\end{figure*}

\section{Details of Observation Experiments}
\label{app:observation_details}

The observation study in Section~\ref{sec:empirical} is a diagnostic analysis of score locality rather than a task-level evaluation. We therefore separate score capture from eviction. The score traces are produced by \texttt{insights/capture\_scores.py}, which wraps the KVZip~\cite{kim2026kvzip} scoring module with a capture-only press. The wrapper runs Qwen3-8B on 500 RULER-4096 test contexts~\cite{hsieh2024ruler}, truncates each input to 2,048 tokens, triggers the prefill scoring phase through a one-token greedy generation call, and saves the resulting score tensor without evicting any KV entry. Each saved tensor has shape $36 \times 8 \times 2048$, corresponding to model layers, KV heads, and token positions.

All three score-based observation experiments use the same content-token preprocessing in \texttt{insights/analyze.py}. We remove the four protected sink tokens and the trailing local recency window used by the base scorer. With the 2,048-token inputs used here, the recency window is $\lfloor 0.02T \rfloor = 40$ tokens, leaving 2,004 content positions per sample. Unless otherwise stated, a token's scalar importance is the mean score over all layers and KV heads.

For Observation 1, we sort content tokens by descending importance and divide the content region into 64 equal-width segments. At each Top-K selection step, the marginal spatial gain is $1/64$ if the selected token falls in a segment not previously covered, and 0 otherwise. The curve in Figure~\ref{fig:empirical_analysis}(a) reports this marginal gain over the first 80 selections, averaged over 500 samples, with a fixed random-order baseline used only to contextualize the coverage saturation behavior.

For Observation 2, we measure locality through the normalized autocorrelation of the mean importance sequence. For each sample, we center the importance vector and compute $R(d)=\mathbb{E}_i[(s_i-\bar{s})(s_{i+d}-\bar{s})]/\mathbb{E}_i[(s_i-\bar{s})^2]$ for token distances $d=1,\ldots,150$. Figure~\ref{fig:empirical_analysis}(b) reports the mean and sample standard deviation across the 500 score traces.

For Observation 3, the 1D-NMS diagnostic uses a max-pooling neighborhood of size 7. Tokens that are local maxima keep their original importance score, while non-maximal neighbors are multiplied by 0.3. We then rank tokens by the NMS-adjusted score and compare its segment coverage against ordinary Top-K under budgets $K=1,\ldots,80$. The uniform-spread line in Figure~\ref{fig:empirical_analysis}(c) is an oracle-style reference that places selected tokens evenly across the content sequence; it is not a deployable eviction baseline.

The raw-attention validation uses \texttt{insights/capture\_attention\_hooks.py}. It runs Qwen3-8B with eager attention and forward hooks, then summarizes the content-only causal attention maps instead of materializing all full maps in Python. The saved validation trace used for the appendix figures is from LongBench 2WikiMQA, sample index 47, at 2,048 tokens. For each layer $\ell$ and head $h$, the key-side attention density is computed as $a_{\ell,h}(j)=|\mathcal{Q}|^{-1}\sum_{i \in \mathcal{Q}} A_{\ell,h}(i,j)$ over content query positions. We also store layer-wise distance profiles up to distance 128 and 192-token local crops from layers 4, 18, and 32 around the strongest key-mass region in the probe layer. The additional raw-attention autocorrelation plot trims the earliest 100 content tokens to reduce prompt-prefix bias before computing $R(d)$.

\section{Theoretical Analysis of the SMD Ranking Proxy}
\label{app:theory}

In this section, we provide a structured analysis of the Submodular Marginal Discounting (SMD) algorithm. This appendix analyzes the SMD core of HubKV and then characterizes the compression-ratio-gated head-wise calibration as a bounded perturbation of the base importance scores. We formalize the facility location objective, establish local properties of the SMD proxy, and discuss its global approximation characteristics.

\subsection{Properties of the Score-Weighted Coverage Objective}

We first define the theoretical utility function that SMD aims to approximate.
\begin{definition}[Score-Weighted Local Coverage]
Let $w_{x,j} = s_j \frac{\mathcal{K}(|x-j|)}{Z_j}$ be the importance-weighted coverage kernel, where $Z_j = \sum_{x \in V} \mathcal{K}(|x-j|)$ is the normalization factor. The utility of a set $S$ is defined as $F_{sub}(S) = \sum_{x \in V} \max_{j \in S} w_{x,j}$.
\end{definition}

\begin{proposition}[Monotone Submodularity]
The normalized objective $F_{sub}(S)$ is monotone non-decreasing and submodular, with the singleton utility property $F_{sub}(\{j\}) = s_j$ for all $j \in V$.
\end{proposition}
\textit{Proof.} Monotonicity and submodularity are standard properties of facility location functions. The singleton property follows from normalization: $F_{sub}(\{j\}) = \sum_x w_{x,j} = \frac{s_j}{Z_j} \sum_x \mathcal{K}(|x-j|) = s_j$. \hfill $\blacksquare$

\subsection{Local Properties of SMD}

Unlike iterative greedy solvers, SMD is a one-shot parallel algorithm. We first show that it strictly preserves local importance peaks.

\begin{theorem}[Local Hub Dominance]
For any local window $\mathcal{W}_h$ with a unique local maximum (hub) $h$ such that $s_h > s_i$ for all $i \in \mathcal{W}_h \setminus \{h\}$, the SMD algorithm ensures that the hub remains ranked above all its discounted neighbors: $\tilde{s}_h > \tilde{s}_i$.
\end{theorem}
\textit{Proof.} By the SMD refined score definition, $\tilde{s}_h = s_h$ and $\tilde{s}_i = \gamma s_i$. Since $s_h > s_i$ and $\gamma \in (0, 1)$, it follows that $s_h > \gamma s_i$. \hfill $\blacksquare$

We now establish the "Marginal Sandwich" which relates the refined score $\tilde{s}_i$ to the true submodular marginal gain $\Delta(i \mid \{h\})$.

\begin{proposition}[Local Marginal Sandwich]
\label{prop:sandwich}
Let $h$ be a selected hub and $i \in \mathcal{W}_h$ be a neighbor. Define the coverage overlap ratio as $\rho(i,h) = \frac{1}{s_i} \sum_{x \in V} \min(w_{x,i}, w_{x,h})$. The exact marginal gain is $\Delta(i \mid \{h\}) = (1 - \rho(i,h))s_i$. 
Furthermore, if $\rho(i,h) \in [1-\gamma-\eta, 1-\gamma+\eta]$ for some error $\eta \ge 0$, then:
\begin{equation}
(\gamma - \eta) s_i \le \Delta(i \mid \{h\}) \le (\gamma + \eta) s_i,
\end{equation}
and the refined score $\tilde{s}_i = \gamma s_i$ satisfies $|\Delta(i \mid \{h\}) - \tilde{s}_i| \le \eta s_i$.
\end{proposition}
\textit{Proof.} For the facility location objective, $\Delta(i \mid \{h\}) = \sum_x \max(0, w_{x,i} - w_{x,h})$. Using $\max(0, A-B) = A - \min(A,B)$, we have $\Delta(i \mid \{h\}) = s_i - \Omega(i,h) = (1 - \rho(i,h))s_i$. The bounds follow directly from the assumption on $\rho(i,h)$. \hfill $\blacksquare$

\textit{Interpretation.} This result indicates that when the local kernel overlap $\rho$ is approximately $1-\gamma$, SMD's discounted score acts as a first-order proxy for the true marginal utility. By the submodularity of $F_{sub}$, we further have $\Delta(i \mid S) \le \Delta(i \mid \{h\}) \le (\gamma + \eta) s_i$, implying that SMD serves as an approximate upper proxy for the marginal gain in highly redundant clusters.

\subsection{Bounded Effect of Ratio-Gated Head Calibration}

The deployed HubKV variant mixes the raw base score with the SMD-corrected score. For a non-protected content token, write the SMD head-calibrated score as $u_i = \beta_i q_i s_i$, where $q_i \in \{\gamma,1\}$ indicates whether token $i$ is a non-hub or a hub, and $\beta_i \in [\beta_{\min}, \beta_{\max}]$ is the clipped head-wise selectivity weight. Given compression ratio $r$ and gate exponent $p$, the final score is
\begin{equation}
z_i = (1-\lambda)s_i + \lambda u_i, \qquad \lambda = r^p.
\end{equation}

\begin{proposition}[Bounded Ratio-Gated Perturbation]
For every non-protected token with $s_i \ge 0$, the final HubKV score satisfies
\begin{equation}
\left(1-\lambda+\lambda\gamma\beta_{\min}\right)s_i
\le z_i \le
\left(1-\lambda+\lambda\beta_{\max}\right)s_i .
\end{equation}
Consequently, HubKV is a bounded multiplicative perturbation of the base scorer. Moreover, for any two non-protected tokens $a$ and $b$, if
\begin{equation}
\frac{s_a}{s_b}
>
\frac{1-\lambda+\lambda\beta_{\max}}
{1-\lambda+\lambda\gamma\beta_{\min}},
\end{equation}
then $z_a > z_b$ regardless of their local hub status or clipped head weights.
\end{proposition}
\textit{Proof.} Since $q_i \in \{\gamma,1\}$ and $\beta_i \in [\beta_{\min}, \beta_{\max}]$, we have $\gamma\beta_{\min}s_i \le u_i \le \beta_{\max}s_i$ for all non-protected tokens. Substituting this interval into $z_i=(1-\lambda)s_i+\lambda u_i$ gives the first inequality. The ranking-stability condition follows by lower-bounding $z_a$ and upper-bounding $z_b$. \hfill $\blacksquare$

\textit{Interpretation.} The ratio gate prevents HubKV from arbitrarily rewriting the base importance scores. When compression is mild, $\lambda=r^p$ is small and the method remains close to the base scorer. When compression is aggressive, the local redundancy correction becomes stronger, but its effect remains bounded by the clipped head weights and the SMD discount factor. This is the theoretical guarantee attached to the deployed parallel proxy; it is distinct from the global approximation guarantee of sequential submodular greedy.

\subsection{Global Robustness and Limitations}

SMD does not inherit the $(1-1/e)$ approximation ratio of sequential greedy algorithms. The following baseline robustness bound applies to the unweighted SMD core before head-wise calibration.

\begin{proposition}[Global Lower Bound]
Let $S_{SMD}$ be the set selected by SMD with budget $B$, and $S^*$ be the optimal set. Then $F(S_{SMD}) / F(S^*) \ge 1/B$.
\end{proposition}
\textit{Proof.} SMD always selects the global maximum $s_{max}$ as it is a local hub. Thus $F(S_{SMD}) \ge s_{max}$. Conversely, $F(S^*) \le \sum_{j \in S^*} s_j \le B s_{max}$. The ratio follows. \hfill $\blacksquare$

\begin{remark}[Lack of Constant Approximation Guarantee]
In the absence of additional assumptions on the score distribution, SMD can still fall into local redundancy traps if the discount factor $\gamma$ is not small enough. Consider a cluster of $B$ highly redundant tokens with scores $\approx 1$ and $B-1$ distant isolated hubs with scores $\gamma(1-2\epsilon)$. SMD may prefer the $B$ redundant tokens over the distant hubs, leading to a performance ratio that decays as $\mathcal{O}(1/B)$. This underscores that SMD is a \textbf{hardware-parallel proxy} intended for empirical efficiency in LLM inference, rather than a guaranteed submodular optimizer.
\end{remark}

\section{Empirical Validation via Raw Attention Analysis}
\label{app:raw_attention}

To further test whether the observed local redundancy also appears in model-native attention, we conducted a supplemental experiment using raw attention weights from Qwen3-8B on the LongBench 2WikiMQA task~\cite{bai2023longbench}.

\begin{figure}[htbp]
    \centering
    \includegraphics[width=0.48\textwidth]{pics/raw_attention_locality.pdf}
    \caption{\textbf{Spatial correlation of raw attention weights.} The key-side attention density exhibits significant local autocorrelation at short token distances, providing additional evidence that model-native attention mass is spatially clustered.}
    \label{fig:raw_attention_locality}
\end{figure}

\textbf{Findings.} As illustrated in Figure~\ref{fig:raw_attention_locality}, the key-side attention density exhibits significant exponential decay in autocorrelation $R(d)$. This structural redundancy in model-native attention mass provides additional empirical evidence for the local repulsion mechanism employed by SMD. By suppressing local non-maxima, SMD encourages the selection of diverse representatives, aligning the one-shot ranking with the observed locality of LLM attention distributions.

\section{Additional Related Work and Baseline Scope}
\label{app:additional_related_work}

\paragraph{Token eviction and compression baselines.}
The main paper focuses on token-retention methods that expose a Top-K-style pruning interface. Decoding-time eviction methods such as H2O, Scissorhands, and FastGen use past attention or learned discard signals to remove KV entries during generation~\cite{zhang2024h2o,liu2024scissorhands,ge2023model}. Prefill-time methods such as SnapKV, PyramidKV, Expected Attention, KVZip, and FastKVZip estimate token utility before downstream generation~\cite{li2024snapkv,zhang2024pyramidkv,devoto2025expected,kim2026kvzip,fastkv2026}. HubKV is designed for this second interface: it does not replace the base scorer, but modifies the final ranking so that locally redundant high-score neighbors are softly discounted before the same budgeted pruning step is applied.

\paragraph{Semantic and chunk-level baselines.}
Semantic-chunk methods address a related but different unit-of-selection question. ChunkKV, for example, retains semantically coherent chunks rather than treating every token as an independent item~\cite{liu2026chunkkv}. HubKV keeps the token-level cache layout of existing score-based systems, but its local discounting plays a complementary role: it discourages spending a tight cache budget on multiple neighboring tokens when a local representative already covers the region. Combining chunk-level grouping with HubKV-style local marginal discounting is a natural follow-up direction.

\paragraph{KV-cache quantization.}
KV-cache quantization reduces the bytes used by retained keys and values rather than changing which positions are retained. KIVI studies asymmetric low-bit KV-cache quantization and KVQuant develops low-precision KV-cache quantization for very long contexts~\cite{liu2024kivi,hooper2024kvquant}. These methods are therefore orthogonal to HubKV's token-selection objective. A quantized cache can still require an eviction policy under extreme context lengths or tight memory budgets, while HubKV can in principle be applied before or together with quantization to choose which entries remain in the compressed cache.

\section{Additional Discussion}
\label{app:additional_discussion}

HubKV is designed as a score refinement layer for cache eviction methods that already expose a Top-K-style pruning interface. If the underlying scorer assigns low importance to genuinely useful tokens, the redundancy correction cannot recover those tokens after ranking. Conversely, if a local region contains several adjacent tokens that are jointly necessary, the discounting step may move some useful non-hub tokens below the retention threshold. This behavior is most likely when the remaining budget is already large enough to preserve local evidence chains, which is consistent with the smaller or slightly negative gains observed under moderate compression.

The submodular formulation should also be interpreted as a modeling guide rather than as the deployed optimization algorithm. Sequential greedy maximization of the ideal coverage objective would introduce a dependency chain that is undesirable for LLM inference. HubKV instead uses a one-pass local proxy that preserves the parallel score-ranking interface of existing systems. The theoretical analysis therefore characterizes local marginal behavior and bounded score perturbation, but it does not imply the global approximation guarantee of greedy submodular maximization.

\section{Ethical Considerations}
\label{app:ethical_considerations}

HubKV is an infrastructure method for reducing the memory cost of long-context LLM inference, rather than an application-facing system. More efficient KV cache compression can make long-context inference less expensive and more accessible, but it can also lower the cost of large-scale document analysis, surveillance, or other privacy-sensitive applications. HubKV does not detect, filter, anonymize, or protect sensitive information in the input context. If a prompt or document contains personal, confidential, copyrighted, or otherwise sensitive content, the compressed KV cache should still be treated as derived model state from that content rather than as an anonymized representation. Because HubKV changes which historical tokens remain represented in the cache, practitioners should validate compressed systems on the intended data distribution before using them in high-stakes settings.

\section{Reproducibility Statement}
\label{app:reproducibility_statement}

The experimental sections report the backbone models, task families, compression ratios, protected-token handling, and HubKV hyperparameters used in the current evaluation. Appendix~\ref{app:experimental_details} provides additional implementation details for prefill and decoding-stage experiments. We will release code, plotting scripts, and result snapshots with the camera-ready version, subject to the licensing terms of the evaluated models and benchmarks.
