\documentclass{article} % For LaTeX2e
% \PassOptionsToPackage{numbers, compress}{natbib}
% \usepackage{neurips_2025}
% \PassOptionsToPackage{numbers,square}{natbib}

\usepackage[nonatbib,final]{neurips_2025}
\usepackage[numbers,sort&compress]{natbib}
% \setcitestyle{numbers,square}

\usepackage{xcolor}
\usepackage{colortbl}
\usepackage{float}
% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}
\input{macros.tex}


% For theorems and such
\usepackage{amsmath}
\usepackage{amssymb}
\usepackage{mathtools}
\usepackage{amsthm}
\usepackage{enumitem}

% Recommended, but optional, packages for figures and better typesetting:
\usepackage{microtype}
% \usepackage{subfigure}
% \usepackage[table]{xcolor}
% \usepackage{colortbl}
% \usepackage[table]{xcolor} % Enable table coloring
\usepackage{subcaption}
\usepackage{tabularx}
% \usepackage{xcolor} 
\usepackage{pifont}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{array}     % For column width control

\usepackage{comment}
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc}    % use 8-bit T1 fonts
\usepackage{url}            % simple URL typesetting
\usepackage{amsfonts}       % blackboard math symbols
\usepackage{nicefrac}       % compact symbols for 1/2, etc.
\usepackage{xcolor}         % colors

\usepackage{tabularx}
\usepackage{makecell}

\usepackage{hyperref}
\usepackage{url}
\usepackage{graphicx}
% \usepackage{natbib}
\usepackage{booktabs}
\usepackage{multirow}
\usepackage{wrapfig}
\usepackage{subcaption}
\usepackage{amsmath}

\usepackage{bookmark}
\usepackage{makecell}
\usepackage{amssymb}
\usepackage{array}
\usepackage{booktabs}
\usepackage{pifont}
% \usepackage{algorithm}
% \usepackage{algpseudocode}
\usepackage{soul}
\usepackage{xcolor}

\usepackage{etoc}
\etocdepthtag.toc{mtchapter}
\etocsettagdepth{mtchapter}{subsection}
\etocsettagdepth{mtappendix}{none}

% \newcommand{\theHalgorithm}{\arabic{algorithm}}

% \usepackage{geometry}
% \geometry{a4paper, margin=1in}
\newcommand{\method}{ChunkKV}
\sethlcolor{yellow}
% \NewDocumentCommand\emojicheck{}{\includegraphics[scale=0.9]{emoji/check.pdf}}
\newcommand{\cmark}{\textcolor{blue}{\ding{51}}}

% \NewDocumentCommand{\xl}
% { mO{} }{\textcolor{blue}{\textsuperscript{\textit{xl}}\textsf{\textbf{\small[#1]}}}}

\newcommand{\rebuttal}[1]{\textcolor{red}{#1}}


% \title{ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference}

\title{ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference}


\begin{document}

% \twocolumn[
% \icmltitle{ChunkKV: Semantic-Preserving KV Cache Compression for \\ Efficient Long-Context LLM Inference}

% % It is OKAY to include author information, even for blind
% % submissions: the style file will automatically remove it for you
% % unless you've provided the [accepted] option to the icml2025
% % package.

% % List of affiliations: The first argument should be a (short)
% % identifier you will use later to specify author affiliations
% % Academic affiliations should list Department, University, City, Region, Country
% % Industry affiliations should list Company, City, Region, Country

% % You can specify symbols, otherwise they are numbered in order.
% % Ideally, you should not use this facility. Affiliations will be numbered
% % in order of appearance and this is the preferred way.
% \icmlsetsymbol{equal}{*}


\author{
Xiang LIU$^{\heartsuit*}$ $\qquad$ Zhenheng TANG$^{\clubsuit*}$ $\qquad$ Peijie DONG$^{\heartsuit}$ $\qquad$ Zeyu LI$^{\heartsuit}$ \\ 
\textbf{Yue LIU}$^{\diamondsuit\dagger}$ $\qquad$ \textbf{Bo LI}$^{\spadesuit\clubsuit}$ $\qquad$ \textbf{Xuming HU$^{\heartsuit\dagger}$} $\qquad$ \textbf{Xiaowen CHU$^{\heartsuit\dagger}$} \\
$^{\heartsuit}$  The Hong Kong University of Science and Technology (Guangzhou)\\ 
$^{\clubsuit}$  CSE, The Hong Kong University of Science and Technology\\
$^{\spadesuit} $Guangzhou HKUST Fok Ying Tung Research Institute \\
$^{\diamondsuit}$ Terminus Technologies \\ 
 \texttt{\{xliu886,pdong212,zli755\}@connect.hkust-gz.edu.cn} \\
 \texttt{\{zhtang.ml, bli\}@cse.ust.hk} \quad \texttt{\{xuminghu, xwchu\}@hkust-gz.edu.cn} 
  }


\maketitle

% Although recent KV cache compression methods show strong performance, all use discrete tokens to maintain the KV cache, leading to a loss of chunk semantic information. 


\begin{abstract}
   Large Language Models (LLMs) require significant GPU memory when processing long texts, with the key value (KV) cache consuming up to 70\% of total memory during inference. Although existing compression methods reduce memory by evaluating the importance of individual tokens, they overlook critical semantic relationships between tokens, resulting in fragmented context and degraded performance. We introduce \method{}, which fundamentally reimagines KV cache compression by treating semantic chunks - rather than isolated tokens - as basic compression units. This approach preserves complete linguistic structures and contextual integrity, ensuring that essential meaning is retained even under aggressive compression. Our innovation includes a novel layer-wise index reuse technique that exploits the higher cross-layer similarity of preserved indices in \method{}, reducing computational overhead and improving throughput by 26.5\%. Comprehensive evaluations on challenging benchmarks: LongBench, Needle-In-A-HayStack, GSM8K, and JailbreakV demonstrate that \method{} outperforms state-of-the-art methods by up to 8.7\% in precision while maintaining the same compression ratio. These results confirm that semantic-aware compression significantly enhances both efficiency and performance for long-context LLM inference, providing a simple yet effective solution to the memory bottleneck problem. \emph{The code is available at \href{https://github.com/NVIDIA/kvpress}{link}.}


\end{abstract}
\def\thefootnote{$*$}\footnotetext{Equal Contribution.}

\def\thefootnote{$\dagger$}\footnotetext{Corresponding Author.}


\section{Introduction}

Large Language Models (LLMs) have become essential for addressing various downstream tasks of natural language processing (NLP), including summarization and question answering, which require the interpretation of a long context from sources such as books, reports, and documents, often encompassing tens of thousands of tokens~\citep{brown2020language, tay2022unifying,tang2025the,wang2025agenttaxo,touvron2023llama2}. Recent advances in long-context technology within the field of machine learning (ML) systems~\citep{flash-attn2, jacobs2023deepspeed, xiao2024efficient} have significantly improved computational throughputs and reduced latency of LLMs to process increasingly large input context lengths~\citep{liu2024world, young2024yi} with historical KV cache (key value attentions). However, the memory requirement of the KV cache in serving super-long contexts becomes a new bottleneck~\citep{zhang2024h2o,zhu2025oraclekv,wang2025all,geminiteam2024gemini}. For instance, the KV cache for a single token in a 7B-parameter model requires approximately 0.5 MB of GPU memory, resulting in a 10,000-token prompt consuming around 5 GB of GPU memory. 

\begin{figure*}[h]
   \centering
   \includegraphics[width=1\textwidth]{figs/main.pdf}
      \vspace{-10pt}

   \caption{Illustration of the impact of the token discrete method and the chunk method on semantic preservation. The discrete method preserves words related to the question but often omits the subject. In contrast, the chunk method retains the subject of the words, maintaining more accurate semantic information. For the equation: $S$ is the score function, and $c$ is a chunk of tokens.}
   \label{fig:main}
   \vspace{-10pt}
\end{figure*}

To address the substantial GPU memory consumption caused by KV caching, recent studies consider compressing the KV cache by pruning non-important discrete parts from the prompt tokens~\citep{zhang2024h2o,zhu2025oraclekv,li2024snapkv, ge2023model, zhang2024pyramidkv, fu2024lazyllm, yang2024pyramidinfer,liu2024scissorhands, tang2024quest}. H2O~\citep{zhang2024h2o} and SnapKV~\citep{li2024snapkv} have shown that retaining less than 50\% of the discrete KV cache can significantly reduce GPU memory usage with minimal impact on performance. However, we identify that the previous KV cache compression methods~\citep{zhang2024h2o,zhang2024pyramidkv} measure token importance isolatedly, neglecting the dependency between different tokens on the characteristics of the real-world language. For example, as shown in Figure~\ref{fig:main}, focusing on the importance of the token level can excessively focus on words about subjects ``turaco'' in the question while omitting crucial information about objects (foods) in the documents, resulting in the loss of essential semantic information. This motivates us to rethink the following question:


\emph{How to avoid isolated token importance measurement and preserve the semantic information in KV cache?}


\begin{table*}[h]
   \centering
   \vspace{-5pt}
   \caption{Comparison of Methods on KV Cache Compression.}
   % \vspace{-10pt}

\scriptsize
% \setlength{\tabcolsep}{8pt}
\resizebox{0.9\textwidth}{!}{
   \begin{tabular}{lccccc}
       \toprule
       \multirow{2}{*}{\textbf{Method}} & \multirow{2}{*}{\makecell{ \textbf{KV Cache} \\ \textbf{Compression}}} & \multirow{2}{*}{\makecell{ \textbf{Dynamic} \\ \textbf{Policy}}}  & \multirow{2}{*}{\makecell{ \textbf{Layer-Wise} \\ \textbf{Policy}}} & \multirow{2}{*}{\makecell{ \textbf{Semantic} \\ \textbf{Information}}}  & \multirow{2}{*}{\makecell{ \textbf{Efficient} \\ \textbf{Index Reuse}}} \\
       \\
       \midrule 
       StreamingLLM~\citep{xiao2024efficient} & \cmark &  & &  &  \\
       H2O~\citep{zhang2024h2o} & \cmark & \cmark &  & &  \\
       SnapKV~\citep{li2024snapkv} & \cmark & \cmark &  & &  \\
       PyramidInfer~\citep{yang2024pyramidinfer} & \cmark & \cmark & \cmark &  &  \\
       PyramidKV~\citep{zhang2024pyramidkv} & \cmark & \cmark & \cmark &  &  \\ \midrule
       \rowcolor{red!20}\method{}(Ours) & \cmark & \cmark & \cmark & \cmark & \cmark \\
       \bottomrule
   \end{tabular}
   }
   \label{tab:comparison}
   \vspace{-5pt}
\end{table*}


In light of this, we observe that complete semantic information usually appears in a continuous sequence~\citep{miller1956information,wang2025agenttaxo,ramshaw1999text,xie2022an}. Thus, we introduce a straightforward yet effective \method{}, grouping the tokens in a chunk as a basic compressing unit, which should be preserved or discarded as a whole. Thus, it retains the most informative \textbf{semantic chunks} from the original KV cache. As shown in Figure~\ref{fig:main}, preserving a chunk helps catch the subject, predicate, and object. In Table~\ref{tab:quantitative_metrics_detailed}, we quantify the minimal loss and higher recovery of the attention score achieved by the chunk-based approach at the inference stage. Furthermore, we investigate that \textit{the preserved KV cache indices by \method{} exhibit a higher similarity} compared to previous methods. Consequently, we develop a technique called layer-wise index reuse, which reduces the additional computational time introduced by the KV cache compression method. As outlined in Table \ref{tab:comparison}, recent highly relevant KV cache compression methods \textit{lack the ability to retain semantic information and efficiently reuse indices}. 


To evaluate \method{}'s performance, we conduct comprehensive experiments across multiple cutting-edge long-context benchmarks: long-context tasks including LongBench~\citep{bai2023longbench} and Needle-In-A-HayStack (NIAH)~\citep{needle}, in-context learning tasks such as GSM8K~\citep{gsm8k} and JailbreakV~\citep{jailbreakv}. And also different models including DeepSeek-R1-Distill-Llama-8B~\citep{deepseekr1},LLaMA-3-8B-Instruct~\citep{meta2024llama3}, Mistral-7B-Instruct~\citep{jiang2023mistral7b}, and Qwen2-7B-Instruct~\citep{qwen2}. Our experimental results demonstrate that \method{} surpasses existing KV cache compression methods in both efficiency and accuracy, primarily due to its ability to preserve essential information through selective chunk retention. These findings establish \method{} as a simple yet effective approach to KV cache compression.

We summarize our key contributions as follows:
\begin{itemize}[leftmargin=*, topsep=0pt, itemsep=2pt, parsep=0pt]
   \item \noindent We identify the phenomenon in which discrete KV cache compression methods inadvertently prune the necessary semantic information.
   \item \noindent We propose \method{}, a simple KV cache compression method that uses the fragmentation method that retains semantic information, and design the layer-wise index reuse technique to reduce the additional computational time.
   \item \noindent We evaluate \method{} on cutting-edge long-context benchmarks including LongBench and Needle-In-A-HayStack, as well as the GSM8K, many-shot GSM8K and JailbreakV in-context learning benchmark, and multi-step reasoning (O1 and R1) LLMs, achieving state-of-the-art performance.
   \end{itemize}


\section{Related Work}
\textbf{KV Cache Compression.} KV cache compression technology has developed rapidly in the era of LLM, with methods mainly focused on evicting unimportant tokens. The compression process occurs before the attention blocks, optimizing both the prefilling time and GPU memory. \citet{xiao2024efficient} and \citet{han2024lm} propose that initial and recent tokens consistently have high attention scores between different layers and attention heads. As a result, retaining these tokens in the KV cache is more likely to preserve important information. Furthermore, FastGen~\citep{ge2023model} evicts tokens based on observed patterns. H2O~\citep{zhang2024h2o} and SnapKV~\citep{li2024snapkv} employ dynamic KV cache compression methods, evaluating the importance of tokens based on attention scores and then evicting the less important ones. As inference scenarios become increasingly complex, dynamic KV cache compression methods demonstrate powerful performance. Recently, \citet{yang2024pyramidinfer} and \citet{zhang2024pyramidkv} have closely examined the distributions of attention scores during the pre-filling stage of the Retrieval-Augmented Generation (RAG) task, discovering a pyramidal KV cache compression pattern in different transformer layers. In contrast, FlowKV~\citep{liu2025flowkv} employs a novel multi-turn isolation mechanism that preserves the accumulated compressed KV cache from past turns and only compresses the KV pairs from the most recent turn, effectively mitigating the re-compression of older context and the problem of catastrophic forgetting.

Although these KV cache compression methods have explored efficient GPU memory management while maintaining original performance, our study focuses more on the semantic information of the prompt. We find that chunks of the original KV cache are more important than discrete tokens.
   
% \vspace{-0.8em}


% More related work can be found in the Appendix~\ref{appendix:related_work}.
\section{ChunkKV}
\subsection{Preliminary Study of KV Cache Compression}

With the increasing long-context capabilities of LLMs, the KV cache has become crucial for improving the inference efficiency. However, it can consume significant GPU memory when handling long input contexts. The GPU memory cost of the KV cache for the decoding stage can be calculated as follows:
   % \vspace{-1.8em}
\begin{equation}
   \label{eq:kv_cache_cost}
    M_{KV} = 2 \times \textit{B} \times \textit{S}  \times \textit{L} \times \textit{N} \times \textit{D} \times 2
\end{equation}
   
% % \vspace{-1.8em}
% \begin{equation}
%    \label{eq:kv_cache_cost}
%     \text{GPU Cost of KV Cache} = 2 \times \textit{B} \times \textit{S}  \times \textit{L} \times \textit{N} \times \textit{D} \times 2
% \end{equation}
% % \vspace{-1.8em}

where $B$ is the batch size, $S$ is the sequence length of prompt and decoded length, $L$ is the number of layers, $N$ is the number of attention heads, $D$ is the dimension of each attention head, and the first $2$ accounts for the KV matrices, while the last $2$ accounts for the precision when using float16. Table \ref{appendix:config_models} shows the configuration parameters for LLaMA-3-8B-Instruct~\citep{meta2024llama3}. With a batch size $B=1$ and a sequence length of prompt $S=2048$, the GPU memory cost of the KV cache is nearly $1$ GB. If the batch size exceeds 24, the GPU memory cost of the KV cache will exceed the capacity of an RTX 4090 GPU. To address this issue, KV cache compression methods have been proposed, with the aim of retaining only a minimal amount of KV cache while preserving as much information as possible. For more details on the LLM configuration parameters, refer to Appendix~\ref{appendix:config_models}.

\subsection{Chunk Based KV Compression}
To address the limitations of existing KV cache compression methods, we propose \method{}, a novel KV cache compression method that retains the most informative semantic chunks. The key idea behind \method{} is to group tokens in the KV cache into chunks that preserve more semantic information, such as a chunk containing a subject, verb, and object. As illustrated in Figure~\ref{fig:main}, \method{} preserves the chunks of the KV cache that contain more semantic information.
First, we define a chunk as a group of tokens that contain related semantic information. By retaining the most informative chunks from the original KV cache, \method{} can effectively reduce the memory usage of the KV cache while preserving essential information. For more information on ChunkKV, please refer to Section~\ref{appendix:chunkkv_analysis}.


\begin{algorithm}
\caption{\method{}}
\label{alg:chunkkv}
\begin{algorithmic}
\State \textbf{Input:} $\mathbf{Q} \in \mathbb{R}^{T_q \times d}$, $\mathbf{K} \in \mathbb{R}^{T_k \times d}$, $\mathbf{V} \in \mathbb{R}^{T_v \times d}$, observe window size $w$, chunk size $c$, compressed KV cache max length $L_{\text{max}}$
\State \textbf{Output:} Compressed KV cache $\mathbf{K}'$, $\mathbf{V}'$
\State \textbf{Observe Window Calculation:}
\State $\mathbf{A} \gets \mathbf{Q}_{T_q - w:T_q} \mathbf{K}^T$ \Comment{Attention scores for the observe window}
\State $C \gets \left\lceil \frac{T_k}{c} \right\rceil$ \Comment{Calculate the number of chunks}
\State \textbf{Chunk Attention Score Calculation}:
\For {$i = 1$ to $C$}
      \State $\mathbf{A}_i \gets \sum_{j=(i-1)c+1}^{ic} \mathbf{A}_{:,j}$ \Comment{Sum of attention scores for each chunk}
\EndFor
\State \textbf{Top-K Chunk Selection}:
\State $k \gets \left\lfloor \frac{L_{\text{max}}}{c} \right\rfloor$
\State $\textit{Top\_K\_Indices} \gets \text{indices of Top-}k \text{ chunks based on } \mathbf{A}_i $ 
\State \textbf{Compression}:
\State $\mathbf{K}', \mathbf{V}' \gets \text{index\_select}(\mathbf{K}, \mathbf{V}, \textit{Top\_K\_Indices})$
\State \textbf{Concatenation}:
\State $\mathbf{K}' \gets \text{concat}(\mathbf{K}'_{0:L_{\text{max}}-w}, \mathbf{K}_{T_k-w:T_k})$
\State $\mathbf{V}' \gets \text{concat}(\mathbf{V}'_{0:L_{\text{max}}-w}, \mathbf{V}_{T_v-w:T_v})$
\State  $\mathbf{K}'$, $\mathbf{V}'$
\end{algorithmic}
\end{algorithm}
   
The algorithm \ref{alg:chunkkv} shows the pseudocode for \method{}. First, following the approach of H2O~\citep{zhang2024h2o} and SnapKV~\citep{li2024snapkv}, we set the observe window by computing the attention scores $\mathbf{A} \gets \mathbf{Q}_{T_q - w:T_q} \mathbf{K}^T$, where $\mathbf{Q}_{T_q - w:T_q}$ is the observe window, $\mathbf{K}$ is the Key matrix and the window size $w$ is usually set to $\{4,8,16,32\}$. Next, the number of chunks $C$ is calculated as $C = \left\lceil \frac{T_k}{c} \right\rceil$, where $T_k$ is the length of the Key matrix and $c$ is the chunk size. The attention scores for each chunk are then computed as $\mathbf{A}_i = \sum_{j=(i-1)c+1}^{ic} \mathbf{A}_{:,j}$ for $i = 1, 2, \ldots, C$. We use the top-$k$ algorithm as the sampling policy for \method{}. The top-$k$ chunks are selected based on their attention scores, where $k = \left\lfloor \frac{L_{\text{max}}}{c} \right\rfloor$, and $L_{\text{max}}$ is the maximum length of the compressed KV cache. The size of the last chunk is set to $\text{min}(c, L_{\text{max}} - (k-1) \times c)$. The indices of the top-$k$ chunks preserve the original sequence order. In the compression step, only the key and value matrices corresponding to the selected indices are retained, resulting in the compressed KV cache. Finally, the observe window of the original KV cache will be concatenated to the compressed KV cache by replacing the last $w$ tokens to keep important information. The compressed KV cache is then used for subsequent attention computations. For implementation, we add vectorized operations, memory optimizations, etc., to optimize the code. Refer to Appendix~\ref{appendix:chunkkv_implementation} for more details.


\begin{figure*}[h]
   \centering
   \includegraphics[scale=0.212, trim=0 0 0 0]{./figs/llama_snapkv_chunkkv_layer_similarity_heatmap.pdf}
      \vspace{-10pt}

   \caption{ Layer-wise similarity heatmaps of the preserved KV \textbf{cache indices} by SnapKV (left) and ChunkKV (right) on LLaMA-3-8B-Instruct. Deep colors indicate higher similarity. More visualization can be found in Appendix \ref{appendix:index_reuse_similarity}.}
   
   \label{fig:index_reuse_heatmap}
   \vspace{-20pt}
\end{figure*}

% \subsubsection{Layer-Wise Index Reuse}
\subsection{Layer-Wise Index Reuse}
\label{sec:layer_wise_index_reuse}


\begin{wraptable}{r}{0.5\textwidth}
   % \begin{table}[H]
      \centering
      \vspace{-20pt}
      \caption{Retained KV Cache Indices Similarity of Adjacent Layers for Different Models.}
         \vspace{-5pt}
      \resizebox{0.5\textwidth}{!}{
      \begin{tabular}{lccc}
      \toprule
      \textbf{Method} & \textbf{H2O} & \textbf{SnapKV} & \textbf{\method{}}  \\
      \midrule
      LLaMA-3-8B      & 25.31\%& 27.95\% & \textbf{57.74\%} \\
      Qwen2-7B        & 14.91\%& 16.50\% & \textbf{44.26\%} \\
      Mistral-7B      & 15.15\% & 15.78\% & \textbf{52.16\%} \\
      \bottomrule
      \end{tabular}
      }
      \vspace{-20pt}
      \label{tab:jaccard_similarity_models}
   % \end{table}
   \end{wraptable}

   
Furthermore, we investigated the KV cache indices preserved by \method{} and found that they exhibit higher similarity compared to previous methods. Figure~\ref{fig:index_reuse_heatmap} shows the layer-wise similarity heatmaps of SnapKV and \method{}. Each cell represents the similarity between the preserved KV cache indices of two layers, with deeper colors indicating higher similarity. The results demonstrate that the KV cache indices preserved by \method{} are more similar to those in neighboring layers. 


\begin{wrapfigure}{r}{0.52\textwidth}
   \vspace{-20pt}

   \begin{minipage}{0.52\textwidth}     
\begin{algorithm}[H]
   \caption{Layer-wise Index Reuse for \method{}}
   \label{alg:layer_wise_index_reuse}
   \begin{algorithmic}
   \State \textbf{Input:} Number of layers in LLMs $N_{\text{layers}}$, number of reuse layers $N_{\text{reuse}}$
   \State \textbf{Initialize:} Dictionary to store indices $\mathcal{I}_{\text{reuse}} = \{\}$
   \For {$l = 0$ to ($N_{\text{layers}}-1$)}
       \If {$l \mod N_{\text{reuse}} == 0$}
           \State $\mathbf{K}'_{l}, \mathbf{V}'_{l}, \mathcal{I}_l \gets \text{\method{}}(\mathbf{K}_l, \mathbf{V}_l)$ 
           \State $\mathcal{I}_{\text{reuse}}[l] \gets \mathcal{I}_l$ 
       \Else
           \State $\mathcal{I}_l \gets \mathcal{I}_{\text{reuse}}[ \left\lfloor \frac{l}{N_{\text{reuse}}} \right\rfloor \times N_{\text{reuse}} ]$ 
       \EndIf
       \State $\mathbf{K}'_{l} \gets \text{index\_select}(\mathbf{K}_l, \mathcal{I}_l)$ 
       \State $\mathbf{V}'_{l} \gets \text{index\_select}(\mathbf{V}_l, \mathcal{I}_l)$ 
   \EndFor
   
\end{algorithmic}
\end{algorithm}
\end{minipage}
\vspace{-10pt}
\end{wrapfigure}
   
As shown in Table~\ref{tab:jaccard_similarity_models}, \method{} consistently achieves a higher average Jaccard similarity between adjacent layers compared to SnapKV in different model architectures, indicating that the retained token index in \method{} is more similar to each other. For a more detailed visualization, please refer to Appendix \ref{appendix:index_reuse_similarity}. 


Based on the above findings of the KV cache index, we propose a training-free \textit{layer-wise index reuse} method to further reduce the additional cost of KV cache compression, which reuses compressed token indices across multiple layers. We evaluate the efficiency and effectiveness of the layer-wise index reuse method in Section~\ref{sec:index_reuse}. Reuse of the layer-wise index reduces the KV cache compression time by 20\% compared to the FullKV baseline, with a performance drop of only 0.5\%.  

\begin{wraptable}{r}{0.52\textwidth}
   \vspace{-10pt}
   \caption{\centering GSM8K Performance Comparison. \\ \# SLM=StreamingLLM, SKV=SnapKV, PKV=PyramidKV}
   \vspace{-5pt}
   \centering
   \resizebox{0.52\columnwidth}{!}{
   \begin{tabular}{l|ccccc}
   % \specialrule{1pt}{0pt}{2pt}
   \specialrule{1pt}{0pt}{2pt}
   \multirow{2}{*}{\textbf{\makecell{ Ratio}}} & \multirow{2}{*}{\makecell{SLM}} & \multirow{2}{*}{H2O} & \multirow{2}{*}{SKV} & \multirow{2}{*}{PKV} & \multirow{2}{*}{\makecell{\textbf{ChunkKV} \\ 
   \textbf{(Ours)}}} \\
   & & & & & \\
   \midrule
   \multicolumn{6}{c}{DeepSeek-R1-Distill-Llama-8B FullKV: 69.4\% $\uparrow$} \\
   \midrule
   10\% & 51.6\% & 55.6\% & 57.6\% & 62.6\% & \cellcolor{red!20}\textbf{65.7\%} \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3.1-8B-Instruct FullKV: 79.5\% $\uparrow$} \\
   \midrule
   30\% & 70.5\% & 72.2\% & 76.1\% & 77.1\% & \cellcolor{red!20}\textbf{77.3\%} \\
   20\% & 63.8\% & 64.0\% & 68.8\% & 71.4\% & \cellcolor{red!20}\textbf{77.6\%} \\
   10\% & 47.8\% & 45.0\% & 50.3\% & 48.2\% & \cellcolor{red!20}\textbf{65.7\%} \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3-8B-Instruct FullKV: 76.8\% $\uparrow$} \\
   \midrule
   30\% & 70.6\% & 73.6\% & 70.2\% & 68.2\% & \cellcolor{red!20}\textbf{74.6\%} \\
   \midrule
   \multicolumn{6}{c}{Qwen2-7B-Instruct FullKV: 71.1\% $\uparrow$} \\
   \midrule
   30\% & 70.8\% & 61.2\% & 70.8\% & 64.7\% & \cellcolor{red!20}\textbf{73.5\%} \\
   \midrule
   \specialrule{1pt}{0pt}{2pt}
   % \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
     \vspace{-10pt}

   \label{tab:GSM8K}
\end{wraptable}


This \textit{layer-wise index reuse} method is formally described in Algorithm \ref{alg:layer_wise_index_reuse}. The \method{} compression process returns the compressed KV cache and their respective token indices, denoted as $\mathcal{I}_l$. For layer-wise index reuse, we define a grouping of layers such that all $N_{\text{reuse}}$ layers share the same token indices for \method{}. Specifically, for a group of layers $\left\{l, l+1, \ldots, l+N_{\text{reuse}}-1\right\}$, we perform \method{} on the first layer $l$ to obtain the token indices $\mathcal{I}_l$ and reuse $\mathcal{I}_l$ for the subsequent layers $l+1, l+2, \ldots, l+N_{\text{reuse}}-1$. The notation $\mathbf{K}_l[\mathcal{I}_l]$ and $\mathbf{V}_l[\mathcal{I}_l]$ indicates the selection of key and value caches based on the indices in $\mathcal{I}_l$. The efficiency analysis for layer-wise index reuse is provided in Appendix~\ref{appendix:index_reuse_efficiency}.


% \subsection{Theoretical Understanding}
% \label{sec:theory}
\textbf{Theoretical Understanding.} We provide a theoretical understanding from the in-context learning (ICL)~\citep{xie2022an} to interpret why maintaining the KV cache according to a continuous sequence in \method{} is better than according to sparse tokens. 
Informally speaking, the continuously chunk-level KV cache preserves the whole examples (semantic information) in ICL, thus reducing the requirement on distinguishability, i.e., lower bound of KL divergence between the example and the question (Equation~\ref{eq:noise_dinsting} in condition~\ref{cond:2}). The complete analysis is provided in Appendix~\ref{appendix:theory}.


\section{Experiment Results}
\label{sec:experiment_results}
In this section, we conduct experiments to evaluate the effectiveness of \method{} on KV cache compression in two benchmark fields, with a chunk size set to 10 even for various model architectures. The first is the In-Context Learning benchmark, for which we select GSM8K~\citep{gsm8k} and Jailbreakv~\citep{jailbreakv,tang2025ghost,weifan2025jailbreaklora} to evaluate the performance of \method{}, furthermore, we also include multi-step reasoning LLM DeepSeek-R1-Distill-Llama-8B~\citep{deepseekr1} to evaluate the performance of \method{}. The In-Context Learning scenario is a crucial capability for LLMs and has been adapted in many powerful technologies such as Chain-of-Thought~\citep{wei2022chain,lai2025mediatormemoryefficientllmmerging,diao2023active,pan2023plum}. The second is the Long-Context benchmark, which includes LongBench~\citep{bai2023longbench} and Needle-In-A-HayStack (NIAH)~\citep{needle}, both widely used for assessing KV cache compression methods.  All experiments were carried out three times, using the mean score to ensure robustness.


\vspace{-10pt}
\subsection{In-Context Learning}
\label{sec:icl}

The In-Context Learning (ICL) ability significantly enhances the impact of prompts on LLMs. For example, the Chain-of-Thought approach~\citep{wei2022chain} increases the accuracy of the GSM8K of the PaLM model~\citep{chowdhery2022palm} from 18\% to 57\% without additional training. In this section, we evaluate the performance of \method{} on the GSM8K, Many-Shot GSM8K \citep{agarwal2024many}, and JailbreakV \citep{jailbreakv} benchmarks.


\begin{wraptable}{r}{0.52\textwidth}
   \vspace{-5pt}

   \caption{Many-Shot (50-shot) GSM8K Performance Comparison.}
   \vspace{-5pt}

   \centering
   \resizebox{0.52\columnwidth}{!}{
   \begin{tabular}{l|ccccc}
   \specialrule{1pt}{0pt}{2pt}

   \multirow{2}{*}{\textbf{\makecell{ Ratio}}} & \multirow{2}{*}{\makecell{SLM}} & \multirow{2}{*}{H2O} & \multirow{2}{*}{SKV} & \multirow{2}{*}{PKV} & \multirow{2}{*}{\makecell{\textbf{ChunkKV} \\ 
   \textbf{(Ours)}}} \\

   & & & & & \\
   \midrule
   \multicolumn{6}{c}{DeepSeek-R1-Distill-Llama-8B FullKV: 71.2\% $\uparrow$} \\
   \midrule
   10\% & 63.2\% & 54.2\% & 54.1\% & 59.2\% & \cellcolor{red!20}\textbf{68.2\%} \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3.1-8B-Instruct FullKV: 82.4\% $\uparrow$} \\
   \midrule
   10\% & 74.3\% & 51.2\% & 68.2\% & 70.3\% & \cellcolor{red!20}\textbf{79.3\%} \\
   \midrule
   \specialrule{1pt}{0pt}{2pt}
   % \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
 \vspace{-0.2cm}
   \label{tab:many_shot_GSM8K}
% \end{table}
\end{wraptable}


% \subsubsection{GSM8K}  
% \label{sec:gsm8k}
\vspace{-5pt}
\textbf{GSM8K.}
In the in-context learning scenario, we evaluated multiple KV cache compression methods for GSM8K~\citep{gsm8k}, which contains more than 1,000 arithmetic questions on LLaMA-3-8B-Instruct, LLaMA-3.1-8B-Instruct~\citep{meta2024llama3}, Qwen2-7B-Instruct~\citep{qwen2} and DeepSeek-R1-Distill-Llama-8B~\citep{deepseekr1}. Following \citet{agarwal2024many}, we consider many-shot GSM8K as a long-context reasoning scenario, which is a more challenging task than long-context retrieval benchmark LongBench~\citep{bai2023longbench}. The CoT prompt settings for this experiment are the same as those used by~\citet{wei2022chain}, for many-shot GSM8K we set the number of shots to 50, where the prompt length is more than 4k tokens. For more details on the prompt settings, please refer to the APPENDIX \ref{appendix:prompt}.


% \textbf{GSM8K Results}
Table \ref{tab:GSM8K} presents the performance comparison. The results show that \method{} outperforms other KV cache compression methods on different models and compression ratios. Table \ref{tab:many_shot_GSM8K} presents the performance comparison of many-shot GSM8K, and also \method{} outperforms other KV cache compression methods. The consistent superior performance of \method{} in both models underscores its effectiveness in maintaining crucial contextual information for complex arithmetic reasoning tasks by chunk level KV cache rather than the discrete token level.


\begin{wraptable}{r}{0.52\textwidth}
   \vspace{-5pt}
   \caption{JailbreakV Performance Comparison.}
   \vspace{-5pt}
   \centering
   \resizebox{0.52\columnwidth}{!}{
   \begin{tabular}{l|ccccc}
   % \specialrule{1pt}{0pt}{2pt}
   \specialrule{1pt}{0pt}{2pt}

   \multirow{2}{*}{\textbf{\makecell{ Ratio}}} & \multirow{2}{*}{\makecell{SLM}} & \multirow{2}{*}{H2O} & \multirow{2}{*}{SKV} & \multirow{2}{*}{PKV} & \multirow{2}{*}{\makecell{\textbf{ChunkKV} \\ 
   \textbf{(Ours)}}} \\
   & & & & & \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3.1-8B-Instruct FullKV: 88.9\% $\uparrow$} \\
   \midrule
   20\% & 65.0\% & 71.7\% & 88.0\% & 87.5\% & \cellcolor{red!20}\textbf{89.0\%} \\
   10\% & 53.1\% & 65.4\% & 84.3\% & 85.5\% & \cellcolor{red!20}\textbf{87.9\%} \\
   \midrule
   \specialrule{1pt}{0pt}{2pt}
   % \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
   \vspace{-10pt}

   \label{tab:jailbreak}
\end{wraptable}


% \subsubsection{Jailbreak}
\vspace{-5pt}
\textbf{Jailbreak.} In this section, we evaluate the performance of \method{} on the JailbreakV benchmark~\citep{jailbreakv}, which is a safety Jailbreak benchmark for language models. The prompt settings are the same as those used by~\citet{jailbreakv}.

% \textbf{Jailbreak} 
Table \ref{tab:jailbreak} presents the performance comparison. The results demonstrate that \method{} outperforms other KV cache compression methods on different compression ratios. This shows that for safety benchmark, the chunk level KV cache is more effective than other discrete token level compression methods.


\subsection{Long-Context Benchmark}
\label{sec:long-context-overall}
LongBench and NIAH are two widely used benchmarks for KV cache compression methods. Both benchmarks have a context length that exceeds $10K$. NIAH requires retrieval capability, while LongBench is a meticulously designed benchmark suite that tests the capabilities of language models in handling extended documents and complex information sequences. For more details on LongBench, please refer to the APPENDIX \ref{appendix:evaluation}.


% \subsubsection{LongBench} 
% \label{sec:longbench}
\textbf{LongBench.} We use LongBench~\citep{bai2023longbench} to assess the performance of \method{} on tasks involving long-context inputs.  We evaluated multiple KV cache eviction methods using the LongBench benchmark with LLaMA-3-8B-Instruct~\citep{meta2024llama3}, Mistral-7B-Instruct-v0.3~\citep{jiang2023mistral7b}, and Qwen2-7B-Instruct~\citep{qwen2}, with a KV cache compression ratio of $10\%$. LongBench-ZH provides the Chinese subtask, and Qwen2-7B-Instruct also supports Chinese, so we tested Qwen2-7B-Instruct with different KV cache compression methods on the Chinese subtasks. 


\begin{wraptable}{r}{0.53\textwidth}
   \vspace{-12pt}
   
   \caption{KV cache compression methods on the LongBench benchmark. Results show performance gap compared to FullKV baseline (negative values indicate worse performance). }
   \vspace{-5pt}
   \centering
   \resizebox{0.53\columnwidth}{!}{
   \begin{tabular}{l|ccccc}
   % \specialrule{1pt}{0pt}{2pt}
   \specialrule{1pt}{0pt}{2pt}
   \multirow{2}{*}{\textbf{\makecell{ Ratio}}} & \multirow{2}{*}{\makecell{SLM}} & \multirow{2}{*}{H2O} & \multirow{2}{*}{SKV} & \multirow{2}{*}{PKV} & \multirow{2}{*}{\makecell{\textbf{ChunkKV} \\ 
   \textbf{(Ours)}}} \\
   & & & & & \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3-8B-Instruct FullKV: 41.46 $\uparrow$} \\
   \midrule
   10\% & -13.80\% & -10.61\% & -3.16\% & -3.33\% & \cellcolor{red!20}\textbf{-2.29\%} \\
   20\% & -6.42\% & -8.85\% & -2.24\% & -2.00\% & \cellcolor{red!20}\textbf{-1.74\%} \\
   30\% & -2.36\% & -5.38\% & -0.07\% & -0.22\% & \cellcolor{red!20}\textbf{+0.31\%} \\
   \midrule
   \multicolumn{6}{c}{Mistral-7B-Instruct-v0.3 FullKV: 48.08 $\uparrow$} \\
   \midrule
   10\% & -16.58\% & -9.30\% & -3.54\% & -3.52\% & \cellcolor{red!20}\textbf{-2.85\%} \\
   \midrule
   \multicolumn{6}{c}{Qwen2-7B-Instruct FullKV: 40.71 $\uparrow$} \\
   \midrule
   10\% & -5.28\% & -0.64\% & -0.39\% & -0.98\% & \cellcolor{red!20}\textbf{+0.42\%} \\
   \midrule

   \multicolumn{6}{c}{Qwen2-7B-Instruct on LongBench-ZH FullKV: 38.60 $\uparrow$} \\
   \midrule
   10\% & -15.95\% & -5.31\% & +0.18\% & -5.31\% & \cellcolor{red!20}\textbf{+2.20\%} \\
   \specialrule{1pt}{0pt}{2pt}
   % \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
      \vspace{-20pt}
   \label{table:longbench_averages}
\end{wraptable}

% \vspace{-0.8em}
% \textbf{LongBench Results.} 
Tables~\ref{table:longbench_averages} present the performance gap (in percentage) between each method and the FullKV baseline, where negative values indicate performance degradation compared to FullKV. The table is evaluated in both the LongBench English and Chinese subtasks, where \method{} outperforms other compression methods overall. This suggests that \method{}'s approach of retaining semantic chunks is more effective in preserving important information compared to other discrete token-based compression methods. For the 70B model and Chinese subtask results, please refer to Appendices \ref{appendix:longbench} and \ref{appendix:multilingual}.


\begin{wraptable}{r}{0.52\textwidth}
   \vspace{-13pt}
   \caption{NIAH Performance Comparison.}
   \vspace{-5pt}
   \centering
   \resizebox{0.52\columnwidth}{!}{
   \begin{tabular}{c|ccccc}
   % \specialrule{1pt}{0pt}{2pt}
   \specialrule{1pt}{0pt}{2pt}
   \multirow{2}{*}{\textbf{\makecell{KV cache \\ Size}}} & \multirow{2}{*}{\makecell{SLM}} & \multirow{2}{*}{H2O} & \multirow{2}{*}{SKV} & \multirow{2}{*}{PKV} & \multirow{2}{*}{\makecell{\textbf{ChunkKV} \\ 
   \textbf{(Ours)}}} \\
   & & & & & \\
   \midrule
   \multicolumn{6}{c}{LlaMa-3.1-8B-Instruct FullKV: 74.6\% $\uparrow$} \\
   \midrule
   512 & 32.0\% & 68.6\% & 71.2 \% & 72.6\% & \cellcolor{red!20}\textbf{74.5\%} \\
   256 & 28.0\% & 61.7\% & 68.8\% & 69.5\% & \cellcolor{red!20}\textbf{74.1\%} \\
   128 & 23.7\% & 47.9\% & 58.9\% & 65.1\% & \cellcolor{red!20}\textbf{73.8\%} \\
   96 & 21.5\% & 41.0\% & 56.2\% & 63.2\% & \cellcolor{red!20}\textbf{70.3\%} \\
   \midrule
   \multicolumn{6}{c}{Mistral-7B-Instruct FullKV: 99.8\% $\uparrow$} \\
   \midrule
   128 & 44.3\% & 88.2\% & 91.6\% & 99.3\% & \cellcolor{red!20}\textbf{99.8\%} \\
   \midrule
   \specialrule{1pt}{0pt}{2pt}
   % \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
   \label{tab:main_NIAH}
   \vspace{-15pt}
\end{wraptable}


\textbf{Needle-In-A-HayStack.} 
We use NIAH~\citep{needle} to evaluate the long-context retrieval capability of LLMs. NIAH assesses how well LLM extracts hidden tricked information from extensive documents, and following LLM-as-a-Judge~\citep{zheng2023judging} we apply GPT-4o-mini~\citep{openai2023gpt4omini} to assess the accuracy of the retrieved information. We evaluated multiple KV cache eviction methods using NIAH with LLaMA-3-8B-Instruct and Mistral-7B-Instruct-v0.2, setting benchmark context lengths to 8k and 32k tokens.


\begin{figure}[t]
   \centering
   \begin{subfigure}[b]{0.49\textwidth}
       \centering
       \includegraphics[width=1\textwidth]{./figs/NIAH/small_fig/llama-3-8B-Instruct_spankv_baseline_128.pdf}
       \caption{\method{}, accuracy 73.8\%}
       \label{fig:NIAH_llama3_spankv}
   \end{subfigure}
   \begin{subfigure}[b]{0.49\textwidth}
      \centering
      \includegraphics[width=1\textwidth]{./figs/NIAH/small_fig/llama-3-8B-Instruct_pyramidkv_baseline_128.pdf}
      \caption{PyramidKV, accuracy 65.1\%}
      \label{fig:NIAH_llama3_pyramidkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.49\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/small_fig/llama-3-8B-Instruct_snapkv_baseline_128.pdf}
      \caption{SnapKV, accuracy 58.9\%}
      \label{fig:NIAH_llama3_snapkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.49\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/small_fig/llama-3-8B-Instruct_streamingllm_baseline_128.pdf}
      \caption{StreamingLLM, accuracy 23.7\%}
      \label{fig:NIAH_llama3_streamingllm}
   \end{subfigure}
   \caption{ NIAH benchmark for LLaMA3-8B-Instruct  with KV cache size=128 under 8k context length.}
   \label{fig:NIAH_llama3}
   \vspace{-20pt}
\end{figure}


% \textbf{Needle-In-A-HayStack Results.}
Table~\ref{tab:main_NIAH} provides statistical results for different compression methods. These findings clearly indicate the effectiveness of ChunkKV in managing varying token lengths and depth percentages, making it a robust choice for KV cache management in LLMs. Figure~\ref{fig:NIAH_llama3} presents the NIAH benchmark results for LLaMA-3-8B-Instruct. The vertical axis represents the depth percentage, while the horizontal axis represents the token length, with shorter lengths on the left and longer lengths on the right. A cell highlighted in green indicates that the method can retrieve the needle at that length and depth percentage. The detailed visualization of the NIAH benchmark can be found in the Appendix \ref{appendix:niah}. The visualization results demonstrate that \method{} outperforms other KV cache compression methods. 


\vspace{-10pt}
\subsection{Index Reuse}
\label{sec:index_reuse}

This section will evaluate the performance of the layer-wise index reuse approach with \method{} from the two aspects of efficiency and performance. 

% \paragraph{Settings}
\textbf{Measuring Efficiency.}
We evaluated the latency and throughput of ChunkKV compared to FullKV using LLaMA3-8B-Instruct on an A40 GPU. All experiments were conducted with reuse layer is 2, batch size set to 1 and inference was performed using Flash Attention 2, each experiment was repeated 10 times and the average latency and throughput were reported.


\begin{wraptable}{r}{0.65\textwidth}
   \centering
   \vspace{-10pt}
   \caption{Latency and throughput comparison between ChunkKV and FullKV under different input-output configurations. Percentages in parentheses indicate improvements over FullKV baseline.}
   \vspace{-5pt}
   \label{tab:efficiency}
   \resizebox{0.65\columnwidth}{!}{
   \begin{tabular}{l|cc|cc}
   \toprule
   \multirow{2}{*}{Method} & \multicolumn{2}{c|}{Sequence Length} & \multicolumn{2}{c}{Performance Metrics} \\
   \cmidrule{2-5}
   & Input & Output & Latency(s) $\downarrow$ & Throughput(T/S) $\uparrow$ \\
   \midrule
   FullKV & 4096 & 1024 & 43.60 & 105.92  \\
   ChunkKV & 4096 & 1024 & 37.52 (13.9\%) & 118.85 (12.2\%) \\
   ChunkKV\_reuse & 4096 & 1024 & \textbf{37.35} (14.3\%) & \textbf{124.09} (17.2\%) \\
   \midrule
   FullKV & 4096 & 4096 & 175.50 & 37.73 \\
   ChunkKV & 4096 & 4096 & 164.55 (6.2\%) & 40.58 (7.6\%) \\
   ChunkKV\_reuse & 4096 & 4096 & \textbf{162.85} (7.2\%) & \textbf{41.12} (9.0\%) \\
   \midrule
   FullKV & 8192 & 1024 & 46.48 & 184.08 \\
   ChunkKV & 8192 & 1024 & 37.83 (18.6\%) & 228.96 (24.4\%) \\
   ChunkKV\_reuse & 8192 & 1024 & \textbf{36.85} (20.7\%) & \textbf{232.99} (26.5\%) \\
   \midrule
   FullKV & 8192 & 4096 & 183.42 & 55.93 \\
   ChunkKV & 8192 & 4096 & 164.78 (10.2\%) & 65.14 (16.5\%) \\
   ChunkKV\_reuse & 8192 & 4096 & \textbf{162.15} (11.6\%) & \textbf{66.05} (18.1\%) \\
   \bottomrule
   \end{tabular}
   }
   \vspace{-10pt}
\end{wraptable}


The results in Table \ref{tab:efficiency} show that the layer-wise index reuse strategy (ChunkKV\_reuse) further boosts performance, achieving up to a 20.7\% reduction in latency, and throughput improvements are particularly notable for longer input sequences, with ChunkKV\_reuse delivering up to a 26.5\% improvement over FullKV. For more detailed results, please refer to Appendix \ref{appendix:efficiency_results}.

% \paragraph{Settings}

\begin{wraptable}{r}{0.45\textwidth}
   \centering
   \vspace{-12pt}
   \caption{Reusing Indexing Performance Comparison on LongBench and GSM8K. $\triangle$ indicates the performance degradation of index reuse compared to the baseline.}
   \vspace{-5pt}
   \resizebox{0.45\textwidth}{!}{
   \begin{tabular}{c|cc}
   \toprule
   \multirow{3}{*}{Model} & \multicolumn{2}{c}{\method{}} \\
   \cmidrule(lr){2-3}
   & Baseline & Index Reuse$_\triangle$  \\

   \midrule
   \multicolumn{3}{c}{LongBench} \\
   \midrule
   LLaMA-3-8B-Inst & 40.51 & 40.27$_{-0.59\%}$  \\
   Mistral-7B-Inst & 46.71 & 46.43$_{-0.59\%}$  \\
   Qwen2-7B-Inst   & 40.88 & 40.76$_{-0.29\%}$  \\
   \midrule
   \multicolumn{3}{c}{GSM8K} \\
   \midrule
   LLaMA-3-8B-Inst & 74.5 & 74.6$_{+0.13\%}$  \\
   Qwen2-7B-Inst   & 71.2 & 71.2$_{+0.00\%}$  \\
   \bottomrule
   \end{tabular}
   }
   \vspace{-10pt}
   \label{tab:reuse_ablation_mini}
\end{wraptable}
\textbf{Measuring Task Performance.}
We evaluate the effectiveness of our proposed layer-wise index reuse technique on LongBench~\citep{bai2023longbench} and GSM8K~\citep{gsm8k} benchmarks. For these experiments, we use the same configuration as our main LongBench experiments in Section~\ref{sec:long-context-overall}, with index reuse applied to consecutive layers (reuse layers = 2).

As shown in Table~\ref{tab:reuse_ablation_mini}, layer-wise index reuse maintains the performance of the models while reducing computational requirements. In LongBench, performance degradation is minimal (less than 0.6\%) for both models, while GSM8K shows neutral or slightly positive effects. This validates that semantic chunks selected by \method{} remain consistently important across adjacent transformer layers, enabling efficient computation without sacrificing accuracy. An additional analysis of different reuse depths and their impact on throughput is provided in Appendix~\ref{appendix:index_reuse}.


Overall, these findings on efficiency and performance suggest that layer-wise index reuse can be an effective technique for optimizing the efficiency-performance trade-off in KV cache compression, with the potential for model-specific tuning to maximize benefits.


% \section{Ablation study}
\subsection{Chunk Size}
This section aims to investigate the impact of chunk size on the performance of \method{}. Different chunk sizes will lead to varying degrees of compression on the semantic information of the data. We used the same experiment setting as in LongBench and NIAH (Section \ref{sec:long-context-overall}). The chunk size is set from the range $\{3,5,10,20,30\}$ under the compression rate 10\% for LongBench and the 128 KV cache size for NIAH. For more experiments with different compression ratios, see Appendix \ref{appendix:chunk_size}.


\begin{table}[h]
   \vspace{-10pt}
   \centering
   % \small
   \caption{LongBench and NIAH Results with Chunk Size Ablation}
   \resizebox{1\textwidth}{!}{
   \begin{tabular}{lcccccccc}
   \toprule
   \multirow{2}{*}{\textbf{Model}} & \multirow{2}{*}{\textbf{Full KV}} & \multicolumn{5}{c}{\textbf{Chunk KV}} & \multirow{2}{*}{\textbf{SnapKV}} & \multirow{2}{*}{\textbf{H2O}} \\
   \cmidrule(lr){3-7}
    & & \textbf{size=3} & \textbf{size=5} & \textbf{size=10} & \textbf{size=20} & \textbf{size=30} & & \\
    \midrule
    \multicolumn{9}{c}{\textbf{LongBench $\uparrow$}} \\
    \midrule
   LLaMA-3-8B-Instruct & 41.46 & 40.49 & 40.47 & \textbf{40.51} & 40.05 & 39.57 & 40.15 & 37.06 \\
   Mistral-7B-Instruct & 48.08 & 46.45 & 46.51 & \textbf{46.71} & 46.42 & 45.98 & 46.38 & 43.61 \\
   % Qwen2-7B-Instruct & 40.71 & 40.38 & 40.33 & 40.66 & \textbf{40.88} & 40.73 & 40.55 & 40.45 \\
   \midrule
   \multicolumn{9}{c}{\textbf{NIAH $\uparrow$}} \\
   \midrule
   LLaMA-3-8B-Instruct & 74.6 & 65.6 & 69.1 & \textbf{73.8} & 72.0 & 71.2 & 58.9 & 47.9 \\
   Mistral-7B-Instruct & 99.8 & 98.1 & 99.2 & \textbf{99.8} & 99.8 & 99.1 & 91.6 & 88.2 \\
   \bottomrule
   \end{tabular}
   }
   \vspace{-10pt}
   \label{tab:ablation_chunk_size} 
   \end{table}


As shown in Table~\ref{tab:ablation_chunk_size}, the performance remains relatively stable when the chunk size is between 5 and 20, with the best results consistently achieved at chunk size 10. When the chunk size is too small (e.g., 3), the context is fragmented, leading to a slight drop in performance. Conversely, when the chunk size is too large (for example, 30), the semantic granularity becomes too coarse, and important fine-grained information may be lost, also resulting in performance degradation. This trend is consistent across both LongBench and NIAH benchmarks, as well as across different model architectures, indicating that the optimal chunk size is not highly sensitive to the specific task or model. We also provide a line graph in Appendix~\ref{appendix:chunk_size} to visually illustrate this trend. Based on these findings, we recommend using a chunk size of 10 as a robust default for most applications. For users with specific requirements, chunk size can be further tuned, but our results suggest that moderate values (5-20) generally offer a good trade-off between semantic preservation and compression efficiency.


\subsection{Comparing with KV Quantization}
For comprehensively evaluate the effectiveness of \method{}, we performed experiments comparing \method{} with the KIVI quantization methods~\citep{liu2024kivi}. Although both approaches aim to optimize LLM inference, they operate on fundamentally different principles: Quantization reduces KV matrix precision, whereas our eviction method reduces KV matrix size. For more detail, see Appendix~\ref{appendix:kv_cache_quantization}.

From an implementation perspective, quantization methods require the full KV cache during prefilling to produce quantized representations, which are then used during decoding. In contrast, \method{} employs token removal prior to prefilling, enabling operation with a compressed cache throughout the entire inference process. This distinction creates different efficiency profiles, and each method offers unique advantages.

 Due to the fact that KIVI requires an old Python version, the ChunkKV results are not aligned with Table~\ref{tab:efficiency}. The efficiency results in Table~\ref{table:longbench_kv_cache_quantization_efficiency} reveal ChunkKV's significant advantages in latency-critical metrics. Although both methods substantially reduce cache size (ChunkKV to 10\% and KIVI-2bits to 15.63\%), ChunkKV delivers superior metrics for Time to First Token (TTFT) and Token Processing Time (TPOT). Most remarkably, ChunkKV achieves a 164.66s total generation time compared to KIVI's 226.52s at 2-bit quantization, representing a 27.3\% improvement in overall inference speed. 

\begin{table}[!ht]
   \centering
   \small
   \caption{Efficiency Results for ChunkKV and KIVI on LlaMa-3-8B-Instruct}
   \resizebox{\textwidth}{!}{

   \begin{tabular}{lcccccccc}
   \toprule
   \textbf{Configuration} & \textbf{Prompt} & \textbf{Output} & \textbf{Compression} & \textbf{Prefilling} & \textbf{Cache} & \textbf{TTFT} & \textbf{TPOT} & \textbf{Total Gen.} \\
   & \textbf{Length} & \textbf{Length} & \textbf{Ratio / nbits} & \textbf{Time(s) $\downarrow$} & \textbf{Size(GB) $\downarrow$} & \textbf{(s) $\downarrow$} & \textbf{(ms) $\downarrow$} & \textbf{Time(s) $\downarrow$} \\
   \midrule
   FullKV & 8192 & 4096 & - & 1.5621 & 1.0000 & 1.6013 & 45.9421 & 184.2934 \\
   KIVI & 8192 & 4096 & 2bits & 1.4024 & 0.1563 & 1.4325 & 54.9561 & 226.5234 \\
   KIVI & 8192 & 4096 & 4bits & 1.3916 & 0.2813 & 1.4146 & 52.0510 & 214.5634 \\
   \rowcolor{red!20} ChunkKV & 8192 & 4096 & 10\% & 1.3653 & 0.1000 & 1.3914 & 39.8702 & 164.6600 \\
   \bottomrule
   \end{tabular}
   }
   \label{table:longbench_kv_cache_quantization_efficiency}
   \end{table}


\subsection{Analysis of Hybrid Compression: Chunk-level vs. Token-level at Different Layer Depths}
\label{sec:hybrid_analysis}

A critical question raised during the review process was whether the benefits of chunk-level compression diminish in deeper Transformer layers, where semantic information becomes more abstract and diffused. To investigate this, we designed a hybrid compression model that applies different strategies at varying network depths.

\paragraph{Experimental Setup}
We created a hybrid version of the LLaMA-3-8B-Instruct model. For the first 16 layers (bottom half), we applied our chunk-based \method{}. For the final 16 layers (top half), we applied SnapKV, a state-of-the-art token-level compression method. We then compared this hybrid model's performance on the diverse LongBench benchmark against pure \method{} and pure SnapKV at identical compression ratios.

\begin{table}[h!]
\centering
\caption{Performance of the hybrid compression model on LongBench. While the hybrid model shows strengths in global understanding tasks, pure \method{} achieves the best overall performance, validating the robustness of preserving semantic chunks even in deep layers.}
\label{tab:hybrid_experiment}
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|cccccc|c}
\toprule
\textbf{Method} & \textbf{Single-Doc QA} & \textbf{Multi-Doc QA} & \textbf{Summarization} & \textbf{Few-shot} & \textbf{Synthetic} & \textbf{Code} & \textbf{Avg. Score} $\uparrow$ \\
\midrule
FullKV & 32.19 & 34.59 & 24.96 & 68.48 & 36.96 & 54.41 & 41.46 \\
SnapKV (10\% Ratio) & 28.11 & 32.55 & 24.12 & 67.81 & 36.01 & 55.67 & 40.15 \\
\midrule
\rowcolor{blue!10}Hybrid Model (10\% Ratio) & 28.38 & 30.37 & \textbf{24.54} & \textbf{67.87} & 36.37 & 55.32 & 39.80 \\
\rowcolor{red!20}\method{} (10\% Ratio) & \textbf{28.50} & \textbf{33.46} & 22.20 & 67.62 & \textbf{37.47} & \textbf{58.98} & \textbf{40.51} \\
\bottomrule
\end{tabular}%
}
\end{table}

\paragraph{Analysis and Insights}
The results in Table~\ref{tab:hybrid_experiment} provide two key insights. First, the pure \method{} model achieves the highest overall average score. This empirically validates our core hypothesis: preserving local semantic integrity via chunking is a robust and effective strategy across all layers of the model, even where information is highly processed.

Second, the hybrid model reveals a fascinating, task-dependent performance trade-off.
\begin{itemize}[leftmargin=*, topsep=0pt, itemsep=2pt, parsep=0pt]
    \item \textbf{For local information retrieval tasks} (e.g., Single- and Multi-Document QA), pure \method{} is significantly superior. These tasks often require retrieving precise, intact text fragments. \method{}'s ability to preserve complete linguistic units prevents critical information from being fragmented, which is essential in these scenarios.
    \item \textbf{For global understanding tasks} (e.g., Summarization and Few-shot Learning), the hybrid model performs best. In these tasks, synthesizing information from across the entire context is key. In the deeper layers, where abstract representations are formed, the token-level SnapKV method may be more adept at retaining a broader, more diffuse set of globally important signals.
\end{itemize}
This analysis does not undermine our approach but rather enriches it, suggesting that the future of KV cache compression may lie in adaptive, task-aware strategies. However, for a general-purpose and robust solution, pure \method{} proves to be the most effective.
\section{Conclusion}
In this paper, we first indicate that current KV cache methods lack the semantic information of the data, and then we proposed a novel KV cache compression method that preserves semantic information by retaining more informative chunks. Through extensive experiments across multiple state-of-the-art LLMs (including DeepSeek-R1, LLaMA-3, Qwen2, and Mistral) and diverse benchmarks (GSM8K, LongBench, NIAH, and JailbreakV), we demonstrate that ChunkKV consistently outperforms existing methods while using only a fraction of the memory. Our comprehensive analysis shows that ChunkKV's chunk-based approach maintains crucial contextual information, leading to superior performance in complex reasoning tasks, long-context understanding, and safety evaluations. The method's effectiveness is particularly evident in challenging scenarios like many-shot GSM8K and multi-document QA tasks, where semantic coherence is crucial. Furthermore, our proposed layer-wise index reuse technique provides significant computational efficiency gains with minimal performance impact, achieving up to 20.7\% latency reduction and 26.5\% throughput improvement. These findings, supported by detailed quantitative analysis and ablation studies, establish ChunkKV as a significant advancement in KV cache compression technology, offering an effective solution for deploying LLMs in resource-constrained environments while maintaining high-quality outputs.

% Looking ahead, several promising directions emerge from this work. A key avenue for future research is the development of more sophisticated hybrid compression models. Inspired by the insights from our experiments and reviewer feedback, a two-stage hybrid strategy appears particularly compelling. Such a model would first apply ChunkKV for coarse-grained preservation of essential semantic blocks, followed by a token-level pruning method for fine-grained optimization within the less critical chunks. This multi-granularity approach could potentially offer the best of both worlds, achieving higher compression rates while robustly preserving information. Furthermore, exploring adaptive methods that can adjust the compression strategy based on the task type—shifting between chunk-level and token-level focus—presents another exciting path toward more intelligent and efficient LLM inference.

\section*{Acknowledge}
This work was supported by the NSFC grant 62432008; RGC RIF grant R6021-20; an RGC TRS grant T43-513/23N-2; RGC CRF grants C7004-22G, C1029-22G and C6015-23G; NSFC/RGC grant CRS\_HKUST601/24 and RGC GRF grants 16207922, 16207423 and 16203824; National Natural Science Foundation of China (Grant No.62506318); Guangdong Provincial Department of Education Project (Grant No.2024KQNCX028); Scientific Research Projects for the Higher-educational Institutions (Grant No.2024312096), Education Bureau of Guangzhou Municipality; Guangzhou-HKUST(GZ) Joint Funding Program (Grant No.2025A03J3957), Education Bureau of Guangzhou Municipality;the Guangzhou Municipal Joint Funding Project with Universities and Enterprises under Grant No. 2024A03J0616 and Guangzhou Municipality Big Data Intelligence Key Lab (2023A03J0012).

\bibliographystyle{unsrtnat} 
\bibliography{neurips_2025}
% \bibliographystyle{icml2025}
% \newpage
% \appendix
% \onecolumn

% \input{checklist.tex}


\clearpage
\onecolumn
\appendix 
\etocdepthtag.toc{mtappendix}
\etocsettagdepth{mtchapter}{none}
\etocsettagdepth{mtappendix}{subsection}
\renewcommand{\contentsname}{Appendix}
\tableofcontents 
\clearpage


\section{In-depth Analysis of ChunkKV vs. Discrete Token Methods}
\label{appendix:chunkkv_analysis}

\subsection{Quantitative Analysis}


To rigorously evaluate the effectiveness of ChunkKV compared to discrete token-based methods, we conducted systematic experiments using a LLaMA-3-8B-Instruct model. We randomly selected 100 sequences from the each sub-category of LongBench dataset and analyzed two key metrics across different model layers: KV cache L1 loss and attention cosine similarity. For each sequence, we:
1. Computed the full KV cache and attention patterns without compression as ground truth.
2. Applied ChunkKV, SnapKV, and H2O compression methods with a fixed 10\% compression ratio, and the parameters of the three methods are set the same as in Table \ref{table:longbench}.
3. Measured the differences between compressed and uncompressed versions.
\begin{figure*}[ht]
   \centering
   \includegraphics[width=1\textwidth]{./figs/attn_metrics_comparison.pdf}
   \caption{\centering Layer-wise comparison of L1 loss and attention cosine similarity between ChunkKV and discrete token-based methods in Single-Document QA sub-category of LongBench.}
   \label{fig:chunkkv_vs_discrete_token_quantitative}
\end{figure*}

\textbf{Results Analysis} As shown in Figure \ref{fig:chunkkv_vs_discrete_token_quantitative}, ChunkKV demonstrates superior performance across both metrics:

\begin{itemize}
\item \textbf{KV Cache L1 Loss:} ChunkKV achieves consistently lower L1 loss compared to SnapKV and H2O, particularly in the early and middle layers (layers 5-25). This indicates better preservation of the original KV cache information through the semantic chunk-based approach.

\item \textbf{Attention Cosine Similarity:} ChunkKV exhibits higher similarity scores across most layers, with notably strong performance in layers 0-5 and 20-30. This suggests better preservation of attention relationships between tokens, which is crucial for maintaining semantic understanding.
\end{itemize}

To quantify these improvements, we calculated average metrics across all layers, as shown in Table \ref{tab:quantitative_metrics_detailed}. ChunkKV achieves both the lowest L1 loss and highest attention cosine similarity, outperforming both baseline methods.


\begin{table*}[h]
   \caption{Detailed comparison of KV cache metrics across different task categories in LongBench.}
   \label{tab:quantitative_metrics_detailed}
   % \resizebox{1\textwidth}{!}{
   \centering
   \begin{tabular}{l|ccccc}
   \toprule
   \multirow{2}{*}{Method} & Single-Document & Multi-Document & \multirow{2}{*}{Summarization} & Few-shot & Synthetic \\
    & QA & QA & & Learning & \& Code \\
   \midrule
   \multicolumn{6}{c}{KV Cache L1 Loss $\downarrow$} \\
   \midrule
   \rowcolor{red!20}\textbf{ChunkKV} & \textbf{0.8741} & \textbf{0.8748} & \textbf{0.8770} & \textbf{0.8861} & \textbf{0.8726} \\
   SnapKV & 0.8921 & 0.8933 & 0.8930 & 0.8917 & 0.8938 \\
   H2O & 0.8905 & 0.8917 & 0.8913 & 0.8906 & 0.8915 \\
   \midrule
   \multicolumn{6}{c}{Attention Score Cosine Similarity $\uparrow$} \\
   \midrule
   \rowcolor{red!20}\textbf{ChunkKV} & \textbf{0.3567} & \textbf{0.3651} & \textbf{0.3841} & \textbf{0.4330} & \textbf{0.3805} \\
   SnapKV & 0.3513 & 0.3594 & 0.3771 & 0.4305 & 0.3759 \\
   H2O & 0.3491 & 0.3572 & 0.3750 & 0.4284 & 0.3740 \\
   \bottomrule
   \end{tabular}
   % }
   \end{table*}

% \begin{table}[h]
% \caption{Average KV cache L1 loss and attention cosine similarity across all layers.}
% \label{tab:quantitative_metrics}
% \resizebox{1\textwidth}{!}{
% \centering
% \begin{tabular}{l|cc}
% \toprule
% Method & KV Cache L1 Loss $\downarrow$ & Attention Cosine Similarity $\uparrow$ \\
% \midrule
% \rowcolor{red!20}ChunkKV & \textbf{0.8741} & \textbf{0.3567} \\
% SnapKV & 0.8921 & 0.3513 \\
% H2O & 0.8905 & 0.3491 \\
% \bottomrule
% \end{tabular}
% }
% \end{table}

\textbf{Significance of Results} While the improvements may appear modest in absolute terms (approximately 2\% in L1 loss and 1.5\% in cosine similarity), their practical significance is substantial. These metrics reflect the model's ability to maintain crucial semantic relationships and attention patterns, which are essential for complex reasoning tasks. The consistent improvements across different sequences demonstrate that preserving semantic chunks leads to better information retention than selecting individual tokens.

The enhanced performance is particularly evident in the middle layers of the model, which are typically responsible for higher-level semantic processing. This provides concrete evidence for why ChunkKV achieves superior performance on downstream tasks compared to discrete token-based methods.

\subsection{Hypothetical Scenario}

To provide a deeper understanding of ChunkKV's effectiveness compared to discrete token-based methods, we present a detailed analysis using a hypothetical scenario. This analysis aims to illustrate the fundamental differences between these approaches and explain why ChunkKV is more effective at preserving semantic information in long contexts.


Consider a comprehensive document that contains detailed information on various animals, including their habitats, diets, and behaviors. A user asks the question "What do pandas eat in the wild?"

Both ChunkKV and discrete token-based methods would use this question to calculate attention scores for the document. However, their approaches to selecting and retaining information differ significantly.

\subsubsection{Discrete Token-based Method}
A discrete token-based method might identify and retain individual tokens with high relevance scores, such as:
\begin{itemize}
    \item ``pandas",``eat", ``bamboo", ``wild", ``diet", ``food"
\end{itemize}

Although these tokens are relevant, they lack context and coherence. The method might discard other essential tokens that provide crucial context or complete the information.

\subsubsection{ChunkKV Method}
In contrast, ChunkKV would identify and retain semantically meaningful chunks, such as:
\begin{itemize}
    \item ``In the wild, pandas primarily eat bamboo shoots and leaves"
    \item ``Their diet consists of 99\% bamboo, but they occasionally consume other vegetation"
    \item ``Wild pandas may also eat small rodents or birds when available"
\end{itemize}

By preserving these chunks, ChunkKV maintains not only the relevant keywords but also their contextual relationships and additional pertinent information.

\subsection{Comparative Analysis}
The advantages of ChunkKV become evident when we consider how these retained pieces of information would be used in subsequent processing:

\begin{enumerate}
    \item \textbf{Contextual Understanding}: Discrete tokens require the model to reconstruct meaning from isolated words, which could lead to ambiguity. ChunkKV provides complete phrases or sentences, allowing for immediate and accurate comprehension.
    
    \item \textbf{Semantic Coherence}: ChunkKV preserves the semantic relationships within a chunk, crucial to understanding nuances such as the difference between primary and occasional food sources for pandas.
    
    \item \textbf{Information Density}: A single chunk can contain multiple relevant tokens in their proper context, potentially retaining more useful information within the same compressed cache size compared to discrete methods.
    
    \item \textbf{Reduced Ambiguity}: Discrete methods might retain the token ``eat" from various sentences about different animals. ChunkKV ensures that ``eat" is preserved specifically in the context of pandas in the wild.
    
    \item \textbf{Temporal and Logical Flow}: ChunkKV can maintain the sequence of ideas present in the original text, preserving any temporal or logical progression that may be crucial for understanding.
\end{enumerate}

\subsection{Implications for Model Performance}
This analysis suggests several key implications for model performance:

\begin{itemize}
    \item \textbf{Improved Accuracy}: By retaining contextually rich information, ChunkKV enables more accurate responses to queries, especially those requiring nuanced understanding.
    
    \item \textbf{Enhanced Long-context Processing}: Preservation of semantic chunks allows for better handling of long-range dependencies and complex reasoning tasks.
    
    \item \textbf{Reduced Computational Overhead}: Although both methods compress the KV cache, ChunkKV's approach may reduce the need for extensive context reconstruction, potentially improving inference efficiency.
    
    \item \textbf{Versatility}: The chunk-based approach is likely to be more effective across a wide range of tasks and domains as it preserves the natural structure of language.
\end{itemize}

This in-depth analysis demonstrates why ChunkKV is more effective in preserving semantic information in long contexts. By retaining coherent chunks of text, it provides language models with more contextually rich and semantically complete information, leading to improved performance in tasks that require deep understanding and accurate information retrieval from extensive documents.


\subsection{Implementation Details}
\label{appendix:chunkkv_implementation}
In our implementation of \method{}, we focus on maximizing computational efficiency to ensure practical deployment in real-world applications. In algorithm~\ref{alg:chunkkv_implementation}, we adopt several key optimization strategies to reduce both computational overhead and memory footprint during inference.

\begin{algorithm}
\caption{\method{} Implementation}
\label{alg:chunkkv_implementation}
\begin{algorithmic}[1]
\State \textbf{Input:} $\mathbf{Q} \in \mathbb{R}^{T_q \times d}$, $\mathbf{K} \in \mathbb{R}^{T_k \times d}$, $\mathbf{V} \in \mathbb{R}^{T_v \times d}$, observe window size $w$, chunk size $c$, compressed KV cache max length $L_{\text{max}}$
\State \textbf{Output:} Compressed KV cache $\mathbf{K}'$, $\mathbf{V}'$

\State $q_{\text{observe}} \gets \mathbf{Q}_{T_q - w:T_q}$ \Comment{Extract observe window queries}
\State $\mathbf{A} \gets q_{\text{observe}} \mathbf{K}^T$ \Comment{$\mathbb{R}^{w \times T_k}$ attention scores}

\State $C \gets \left\lceil \frac{T_k}{c} \right\rceil$ \Comment{Number of chunks}
\State Allocate $\text{chunk\_scores} \in \mathbb{R}^C$ to store chunk importance

\State \textbf{Vectorized Chunk Score Calculation:}
\For {$i = 0$ to $C-1$}
    \State $\text{start} \gets i \cdot c$
    \State $\text{end} \gets \min((i+1) \cdot c, T_k)$
    \State $\text{chunk\_scores}[i] \gets \sum_{j=\text{start}}^{\text{end}-1} \sum_{q=0}^{w-1} \mathbf{A}_{q,j}$ \Comment{Vectorized sum}
\EndFor

\State $k \gets \min(\left\lfloor \frac{L_{\text{max}}-w}{c} \right\rfloor, C)$ \Comment{Limit k to available chunks}
\State $\text{kept\_chunks} \gets \text{Top-K}(\text{chunk\_scores}, k)$ \Comment{Indices of top-k chunks}

\State \textbf{Build selection mask:}
\State $\text{mask} \gets \text{zeros}(T_k)$ \Comment{Binary mask for token selection}
\For {each $i$ in $\text{kept\_chunks}$}
    \State $\text{start} \gets i \cdot c$
    \State $\text{end} \gets \min((i+1) \cdot c, T_k)$
    \State $\text{mask}[\text{start}:\text{end}] \gets 1$ \Comment{Mark entire chunk as kept}
\EndFor

\State \textbf{Ensure recent context is preserved:}
\State $\text{mask}[T_k-w:T_k] \gets 1$ \Comment{Always keep most recent tokens}

\State \textbf{Efficient compression:}
\State $\text{indices} \gets \text{nonzero}(\text{mask})$ \Comment{Get indices of tokens to keep}
\State $\mathbf{K}' \gets \text{index\_select}(\mathbf{K}, \text{indices})$
\State $\mathbf{V}' \gets \text{index\_select}(\mathbf{V}, \text{indices})$

\State \textbf{Return} $\mathbf{K}'$, $\mathbf{V}'$
\end{algorithmic}
\end{algorithm}

The optimized implementation features several key improvements:

\textbf{Vectorized Operations.} We leverage batch matrix operations for attention score calculation instead of explicit loops, significantly reducing computational overhead. By computing chunk scores through optimized tensor operations (lines 7-11), we achieve substantial acceleration compared to token-by-token processing.

\textbf{Memory Optimization.} Our implementation employs a binary mask approach (lines 15-20) that avoids redundant memory allocations and reduces fragmentation. This single-pass token selection strategy consolidates both chunk preservation and recent window retention operations, leading to more efficient memory usage during inference.

\textbf{Boundary Handling.} We incorporate robust boundary checks throughout the algorithm (lines 10, 12, 18) to handle edge cases such as incomplete chunks or limited compressed length. This ensures stable performance across variable input lengths without requiring special case handling.

\textbf{Single-pass Compression.} Rather than performing multiple concatenation operations, our implementation builds a comprehensive selection mask that includes both semantically important chunks and recent context tokens. This approach (lines 21-25) minimizes memory copying operations and reduces computational overhead.

Our PyTorch implementation further extends these optimizations by leveraging GPU acceleration for all vector and matrix operations. The integration of CUDA kernels for critical operations like attention score calculation and top-k selection enables real-time performance even with long context windows.


\section{Additional Experiments}


% \subsection{Efficiency}
% \label{appendix:efficiency}

% We evaluated the latency and throughput of ChunkKV compared to FullKV using LLaMA3-8B-Instruct on an A40 GPU. All experiments were conducted with a batch size of 1 and inference was performed using Flash Attention 2, each experiment was repeated 10 times and the average latency and throughput were reported. The results demonstrate that ChunkKV not only maintains competitive performance but also achieves improved efficiency, which is further enhanced by layer-wise index reuse.

% \begin{table*}[ht]
%    \caption{Latency and throughput comparison between ChunkKV and FullKV under different input-output configurations. Percentages in parentheses indicate improvements over FullKV baseline.}
%    \label{tab:efficiency}
%    \resizebox{\textwidth}{!}{
%    \begin{tabular}{l|cc|cc}
%    \toprule
%    \multirow{2}{*}{Method} & \multicolumn{2}{c|}{Sequence Length} & \multicolumn{2}{c}{Performance Metrics} \\
%    \cmidrule{2-5}
%    & Input & Output & Latency(s) $\downarrow$ & Throughput(T/S) $\uparrow$ \\
%    \midrule
%    FullKV & 4096 & 1024 & 43.60 & 105.92  \\
%    ChunkKV & 4096 & 1024 & 37.52 (13.9\%) & 118.85 (12.2\%) \\
%    ChunkKV\_reuse & 4096 & 1024 & \textbf{37.35} (14.3\%) & \textbf{124.09} (17.2\%) \\
%    \midrule
%    FullKV & 4096 & 4096 & 175.50 & 37.73 \\
%    ChunkKV & 4096 & 4096 & 164.55 (6.2\%) & 40.58 (7.6\%) \\
%    ChunkKV\_reuse & 4096 & 4096 & \textbf{162.85} (7.2\%) & \textbf{41.12} (9.0\%) \\
%    \midrule
%    FullKV & 8192 & 1024 & 46.48 & 184.08 \\
%    ChunkKV & 8192 & 1024 & 37.83 (18.6\%) & 228.96 (24.4\%) \\
%    ChunkKV\_reuse & 8192 & 1024 & \textbf{36.85} (20.7\%) & \textbf{232.99} (26.5\%) \\
%    \midrule
%    FullKV & 8192 & 4096 & 183.42 & 55.93 \\
%    ChunkKV & 8192 & 4096 & 164.78 (10.2\%) & 65.14 (16.5\%) \\
%    ChunkKV\_reuse & 8192 & 4096 & \textbf{162.15} (11.6\%) & \textbf{66.05} (18.1\%) \\
%    \bottomrule
%    \end{tabular}
%    }
% \end{table*}
% \begin{table*}[ht]
%    \caption{Latency and throughput comparison between ChunkKV and FullKV under different input-output configurations. Percentages in parentheses indicate improvements over FullKV baseline.}
%    \label{tab:efficiency}
%    \centering
%    \resizebox{0.6\textwidth}{!}{
%    \begin{tabular}{l|cc|cc}
%    \toprule
%    \multirow{2}{*}{Method} & \multicolumn{2}{c|}{Sequence Length} & \multicolumn{2}{c}{Performance Metrics} \\
%    \cmidrule{2-5}
%    & Input & Output & Latency(s) $\downarrow$ & Throughput(T/S) $\uparrow$ \\
%    \midrule
%    FullKV & 4096 & 1024 & 43.60 & 105.92  \\
%    ChunkKV & 4096 & 1024 & 37.52 (13.9\%) & 118.85 (12.2\%) \\
%    ChunkKV\_reuse & 4096 & 1024 & \textbf{37.35} (14.3\%) & \textbf{124.09} (17.2\%) \\
%    \midrule
%    FullKV & 4096 & 4096 & 175.50 & 37.73 \\
%    ChunkKV & 4096 & 4096 & 164.55 (6.2\%) & 40.58 (7.6\%) \\
%    ChunkKV\_reuse & 4096 & 4096 & \textbf{162.85} (7.2\%) & \textbf{41.12} (9.0\%) \\
%    \midrule
%    FullKV & 8192 & 1024 & 46.48 & 184.08 \\
%    ChunkKV & 8192 & 1024 & 37.83 (18.6\%) & 228.96 (24.4\%) \\
%    ChunkKV\_reuse & 8192 & 1024 & \textbf{36.85} (20.7\%) & \textbf{232.99} (26.5\%) \\
%    \midrule
%    FullKV & 8192 & 4096 & 183.42 & 55.93 \\
%    ChunkKV & 8192 & 4096 & 164.78 (10.2\%) & 65.14 (16.5\%) \\
%    ChunkKV\_reuse & 8192 & 4096 & \textbf{162.15} (11.6\%) & \textbf{66.05} (18.1\%) \\
%    \bottomrule
%    \end{tabular}
%    }
% \end{table*}

% The results in Table \ref{tab:efficiency} highlight several key findings:

% \begin{itemize} 
%     \item ChunkKV consistently outperforms FullKV across all configurations, achieving latency improvements ranging from 6.2\% to 18.6\%. 
%     \item The layer-wise index reuse strategy (ChunkKV\_reuse) further boosts performance, achieving up to a 20.7\% reduction in latency. 
%     \item Throughput improvements are particularly notable for longer input sequences, with ChunkKV\_reuse delivering up to a 26.5\% improvement over FullKV.
% \end{itemize}

% These efficiency gains are even more pronounced with longer input sequences, demonstrating that ChunkKV is particularly well-suited for processing long-context inputs while maintaining minimal memory overhead.

\subsection{Layer-Wise Index Reuse}

\subsubsection{Efficiency Analysis}
\label{appendix:index_reuse_efficiency}
The layer-wise index reuse method significantly reduces the computational complexity of \method{}. Without index reuse, \method{} would be applied to all $N_{\text{layers}}$ layers, resulting in a total compression time of $N_{\text{layers}} \cdot T_{\text{compress}}$, where $T_{\text{compress}}$ is the time taken to compress one layer. With index reuse, \method{} is only applied to $\frac{N_{\text{layers}}}{N_{\text{reuse}}}$ layers, reducing the total time to $\frac{N_{\text{layers}}}{N_{\text{reuse}}} \cdot T_{\text{compress}} + (N_{\text{layers}} - \frac{N_{\text{layers}}}{N_{\text{reuse}}}) \cdot T_{\text{select}}$, where $T_{\text{select}}$ is the time taken to select indices, which is typically much smaller than $T_{\text{compress}}$. This results in a theoretical speedup factor of:

\[
\text{Speedup} = \frac{N_{\text{layers}} \cdot T_{\text{compress}}}{\frac{N_{\text{layers}}}{N_{\text{reuse}}} \cdot T_{\text{compress}} + (N_{\text{layers}} - \frac{N_{\text{layers}}}{N_{\text{reuse}}}) \cdot T_{\text{select}}}
\]

Assuming $T_{\text{select}}$ is negligible compared to $T_{\text{compress}}$, this simplifies to approximately $N_{\text{reuse}}$. In practice, the actual speedup may vary depending on the specific implementation and hardware, but it can still lead to substantial time savings, especially for models with a large number of layers. 


\subsubsection{Index Reuse Performance}
\label{appendix:index_reuse}
Figure~\ref{fig:index_reuse_performance_gsm8k} and Table \ref{tab:reuse_GSM8K} illustrates the performance of \method{} with varying index reuse layers on the GSM8K benchmark. Figure~\ref{fig:index_reuse_performance_longbench} and Table \ref{tab:reuse_longbench} illustrates the performance of \method{} with varying index reuse layers on the LongBench benchmark. The experiment reveals that math problems are more sensitive to index reuse layers compared to LongBench. Both LLaMA3-8B-Instruct and Qwen2-7B-Instruct exhibit significant performance degradation, with LLaMA3-8B-Instruct experiencing a steeper decline after two layers of index reuse than Qwen2-7B-Instruct. This suggests that the Qwen2-7B-Instruct model may be more robust to index reuse.

\begin{figure}[!ht]
   \centering
   \includegraphics[width=0.7\textwidth]{./figs/Index_Reuse_GSM8K_Performance.pdf}
   \caption{\centering GSM8K Performance Comparison with different index reuse layers}
   \label{fig:index_reuse_performance_gsm8k}
\end{figure}

\begin{table}[!ht]
   \centering
   \caption{Reusing Indexing Performance Comparison on GSM8K}
   % \resizebox{0.8\textwidth}{!}{
   \begin{tabular}{c|cccccccc}
   \toprule
   \multirow{3}{*}{Model} & \multicolumn{8}{c}{Number of Index Reuse Layers} \\
   \cmidrule(lr){2-9}
   & 1 & 2 & 3 & 5 & 8 & 10 & 20 & 28/32 \\
   \midrule
   LLaMA-3-8B-Instruct & 74.5 & 74.6 & 65.9 & 44.1 & 15.3 & 2.20 & 1.60 & 1.80 \\
   Qwen2-7B-Instruct   & 71.2 & 71.2 & 73.0 & 69.4 & 67.4 & 71.1 & 54.0 & 49.4 \\
   \bottomrule
   \end{tabular}
   % }
   \label{tab:reuse_GSM8K}
\end{table}


\begin{figure}[!ht]
   % \vspace{-10pt}
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/Index_Reuse_LongBench_Performance.pdf}
   \caption{ Comparison with different index reuse layers on LongBench.}
   \label{fig:index_reuse_performance_longbench}
   \vspace{-10pt}
\end{figure}

\begin{table}[!ht]
   \centering
   \caption{Reusing Indexing Performance Comparison on LongBench}
   % \resizebox{0.8\textwidth}{!}{
   \begin{tabular}{c|ccccccc}
   \toprule
   \multirow{3}{*}{Model} & \multicolumn{7}{c}{Number of Index Reuse Layers} \\
   \cmidrule(lr){2-8}
   & 1 & 2 & 3 & 5  & 10 & 20 & 28/32 \\
   \midrule
   LLaMA-3-8B-Instruct & 40.51 & 40.27 & 39.45 & 40.05 & 39.59 & 37.42 & 35.67 \\
   Mistral-7B-Instruct & 46.71 & 46.43 & 46.48 & 45.90 & 44.36 & 41.84 & 39.56 \\
   Qwen2-7B-Instruct   & 40.88 & 40.76 & 40.91 & 41.58 & 40.19 & 37.42 & 36.79 \\
   \bottomrule
   \end{tabular}
   % }
   \label{tab:reuse_longbench}
\end{table}


\subsubsection{Layer-Wise Index Similarity}
\label{appendix:index_reuse_similarity}


This section details the experiment of layer-wise index reuse similarity described in Section \ref{sec:layer_wise_index_reuse}. The inference prompt is randomly selected from the LongBench benchmark, and the preserved indices for H2O, SnapKV, and \method{} are saved in the log file. For multi-head attention, only the indices of the first head are saved. PyramidKV, which has varying preserved index sizes across different layers, is not applicable for this experiment. Then we calculate the Jaccard similarity of the preserved indices of adjacent layers for different models. Table \ref{tab:app_jaccard_similarity_models} shows the Jaccard similarity of the preserved indices of adjacent layers for different models.

\begin{table}[!ht]
   \centering
   \caption{Retained KV Cache Indices Similarity of Adjacent Layers for Different Models.}
   \resizebox{0.8\textwidth}{!}{
   \begin{tabular}{l|ccc}
   \toprule
   \textbf{Method} & \textbf{H2O} & \textbf{SnapKV} & \textbf{\method{}}  \\
   \midrule
   LLaMA-3-8B-Instruct     & 25.31\%& 27.95\% & \textbf{57.74\%} \\
   Qwen2-7B-Instruct        & 14.91\%& 16.50\% & \textbf{44.26\%} \\
   Mistral-7B-Instruct      & 15.15\% & 15.78\% & \textbf{52.16\%} \\
   \bottomrule
   \end{tabular}
   }
   \label{tab:app_jaccard_similarity_models}
\end{table}

Figures \ref{fig:app_index_reuse_heatmap_llama_h2o}-\ref{fig:app_index_reuse_heatmap_llama_chunkkv} (LLaMA-3-8B-Instruct), 
\ref{fig:app_index_reuse_heatmap_mistral_h2o}-\ref{fig:app_index_reuse_heatmap_mistral_chunkkv} (Mistral-7B-Instruct), and
\ref{fig:app_index_reuse_heatmap_qwen_h2o}-\ref{fig:app_index_reuse_heatmap_qwen_chunkkv} (Qwen2-7B-Instruct) display the heatmaps of layer-wise indices similarity of the preserved KV cache indices by H2O, SnapKV and \method{} on different models. 
The pattern of the layer-wise indices similarity heatmap is consistent across different models, aligning with our findings in Section \ref{sec:layer_wise_index_reuse}.
\clearpage
\begin{figure*}[!ht]
   \centering
      \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_llama_h2o_1example.pdf}
      \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices by H2O on LLaMA-3-8B-Instruct}
      \label{fig:app_index_reuse_heatmap_llama_h2o}
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_llama_snapkv_1example.pdf}
   \caption{Layer-wise similarity heatmaps of the preserved KV cache indices by SnapKV on LLaMA-3-8B-Instruct}
   \label{fig:app_index_reuse_heatmap_llama_snapkv}   
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_llama_spankv_1example.pdf}
   \caption{Layer-wise similarity heatmaps of the preserved KV cache indices by ChunkKV on LLaMA-3-8B-Instruct}
   \label{fig:app_index_reuse_heatmap_llama_chunkkv}   
\end{figure*}

\begin{figure*}[!ht]
   \centering
      \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_mistral_h2o_1example.pdf}
      \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices by H2O on Mistral-7B-Instruct}
      \label{fig:app_index_reuse_heatmap_mistral_h2o}
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_mistral_snapkv_1example.pdf}
   \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices  by SnapKV on Mistral-7B-Instruct}
   \label{fig:app_index_reuse_heatmap_mistral_snapkv}   
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_mistral_spankv_1example.pdf}
   \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices  by ChunkKV on Mistral-7B-Instruct}
   \label{fig:app_index_reuse_heatmap_mistral_chunkkv}   
\end{figure*}

\begin{figure*}[!ht]
   \centering
      \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_qwen_h2o_1example.pdf}
      \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices  by H2O on Qwen2-7B-Instruct}
      \label{fig:app_index_reuse_heatmap_qwen_h2o}
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_qwen_snapkv_1example.pdf}
   \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices  by SnapKV on Qwen2-7B-Instruct}
   \label{fig:app_index_reuse_heatmap_qwen_snapkv}   
\end{figure*}

\begin{figure*}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/layer_similarity_heatmap_qwen_spankv_1example.pdf}
   \caption{ Layer-wise similarity heatmaps of the preserved KV cache indices  by ChunkKV on Qwen2-7B-Instruct}
   \label{fig:app_index_reuse_heatmap_qwen_chunkkv}   
\end{figure*}
% \clearpage


\subsection{LongBench}
\label{appendix:longbench}

The Table \ref{table:longbench} shows the average performance of KV cache compression methods in the LongBench English subtask categories. The \method{} achieves the best performance on the overall average, and the Multi-Document QA category, which supports that chunk method is more effective for semantic preservation.
\begin{table*}[!ht]
   \caption{Comprehensive performance comparison of KV cache compression methods across LongBench English subtasks. Results are shown for various models and tasks, highlighting the effectiveness of different compression techniques.}
   \resizebox{\textwidth}{!}{
   \centering
   % \setlength{\tabcolsep}{8pt}  % Reduce column separation
   \begin{tabular}{l|c@{\hspace{1.5pt}}c@{\hspace{3pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{2pt}}c@{\hspace{7pt}}c@{\hspace{7pt}}c|c}
   \specialrule{1pt}{0pt}{2pt}
   \multirow{5}{*}{Method}  & \multicolumn{3}{c}{Single-Document QA} & \multicolumn{3}{c}{Multi-Document QA}& \multicolumn{3}{c}{Summarization}& \multicolumn{3}{c}{Few-shot Learning}& \multicolumn{2}{c}{Synthetic} & \multicolumn{2}{c}{Code} & \multirow{6}{*}{\textbf{Avg. $\uparrow$} } \\
   \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}\cmidrule(lr){11-13}\cmidrule(lr){14-15}\cmidrule(lr){16-17}
   & \rotatebox[origin=c]{30}{NrtvQA} & \rotatebox[origin=c]{30}{Qasper} & \rotatebox[origin=c]{30}{MF-en} & \rotatebox[origin=c]{30}{HotpotQA} & \rotatebox[origin=c]{30}{2WikiMQA} & \rotatebox[origin=c]{30}{Musique} & \rotatebox[origin=c]{30}{GovReport} & \rotatebox[origin=c]{30}{QMSum} & \rotatebox[origin=c]{30}{MultiNews} & \rotatebox[origin=c]{30}{TREC} & \rotatebox[origin=c]{30}{TriviaQA} & \rotatebox[origin=c]{30}{SAMSum} & \rotatebox[origin=c]{30}{PCount} & \rotatebox[origin=c]{30}{PRe} & \rotatebox[origin=c]{30}{Lcc} & \rotatebox[origin=c]{30}{RB-P} & \\
   \cmidrule(lr){1-17}
   Avg len &18,409&3,619&4,559&9,151&4,887&11,214&8,734&10,614&2,113&5,177&8,209&6,258&11,141&9,289&1,235&4,206& \\
   
   \midrule
   \multicolumn{18}{c}{LlaMa-3-8B-Instruct, KV Size = Full} \\
   \arrayrulecolor{black}\midrule
   FullKV &25.70 & 29.75 & 41.12 & 45.55 & 35.87 & 22.35 & 25.63 & 23.03 & 26.21 & 73.00 & 90.56 & 41.88 & 4.67 & 69.25 & 58.05 & 50.77 & 41.46 \\
   
   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{LlaMa-3-8B-Instruct, KV Size Compression Ratio = $10\%$} \\
   \arrayrulecolor{black}\midrule
   StreamingLLM &20.62 &13.09 &22.10 &36.31 &28.01 &15.61 &21.47 &21.05 &19.39 &62.00 &84.18 &40.27 &4.62 &69.10 &58.84 &55.26 & 35.74\\
   H2O &24.80 &  17.32 &31.80 &40.84 &33.28 &18.90 &22.29 &22.29 &21.82 &40.00 &90.51 &40.55 &5.79 &\textbf{69.50} &58.04 &55.26 & 37.06 \\
   SnapKV & 25.08 & 22.02 & \textbf{37.95} & 43.36 & 35.08 & 20.29 & 22.94 & 22.64 & 21.37 & 71.00 & 90.47 & 40.15 & 5.66 & 69.25 & 58.69 & 56.50 & 40.15 \\
   PyramidKV &\textbf{25.58} &20.77 &35.85 &\textbf{43.80} &33.03 &\textbf{21.45} &\textbf{23.68} &22.26 &\textbf{21.85} &71.50 &90.47 &\textbf{41.66} &5.84 &69.25 &58.52 &55.91 & 40.08 \\
   \rowcolor{red!20}\textbf{\method{}} & 
   24.89 & 
   \textbf{22.96} & 
   37.64 & 
   43.27 & 
   \textbf{36.45} & 
   20.65 & 
   22.80 & 
   \textbf{22.97} & 
   20.82 & 
   \textbf{71.50} & 
   \textbf{90.52} & 
   40.83 & 
   \textbf{5.93} & 
   69.00 & 
   \textbf{60.49} & 
   \textbf{57.48} & 
   \textbf{40.51} \\
   \arrayrulecolor{black}\midrule


   \multicolumn{18}{c}{LlaMa-3-8B-Instruct, KV Size Compression Ratio = $20\%$} \\
   \arrayrulecolor{black}\midrule
   StreamingLLM & 23.35 &18.97 &32.94 &42.39 &29.37 &18.76 &\textbf{25.78} &21.92 &\textbf{25.16} &71.00 &88.85 &40.82 &5.04 &69.00 &56.46 &51.12 & 38.80\\
   H2O & 25.60  &21.88 &35.36 &42.06 &32.68 &19.72 &23.54 &22.77 &22.72 &45.50 &\textbf{90.57} &\textbf{41.67} &\textbf{5.51} &69.25 &54.97 & 50.95 & 37.79 \\
   SnapKV & 25.50 &25.95 &38.43 &44.12 &35.38 &20.49 &24.85 &23.36 &23.51 &\textbf{72.50} &90.52 &40.91 &5.23 &69.25 &56.74 &51.75 & 40.53 \\
   PyramidKV &25.36 &26.88 &37.99 &44.21 &\textbf{35.65} &\textbf{21.43} &25.52 &\textbf{23.43} &23.47 &72.00 &90.56 &41.45 &5.26 &\textbf{69.50} &56.55 &50.93 & 40.63 \\
   \rowcolor{red!20}\textbf{\method{}} & \textbf{26.13} & \textbf{28.43} & \textbf{38.59} & \textbf{44.46} & 34.13 & 21.06 & 24.72 & 23.11 & 22.91 & 71.50 & 90.56 & 41.51 & 5.09 & 69.00 & \textbf{58.17} & \textbf{52.51} & \textbf{40.74}  \\

   \arrayrulecolor{black}\midrule

   \multicolumn{18}{c}{LlaMa-3-8B-Instruct, KV Size Compression Ratio = $30\%$} \\
   \arrayrulecolor{black}\midrule
   StreamingLLM & 24.49 & 22.53 & 35.30 & \textbf{44.33} & 32.81 & 19.00 & \textbf{27.12} & 22.19 & \textbf{25.93} & 72.50 & 89.84 & 41.75 & \textbf{5.41} & 69.00 & 60.40 & 55.13 & 40.48\\
   H2O & 25.87 & 23.03 & 37.06 &43.71 &33.68 &20.93 &24.56 &23.14 &23.58 &50.50 & \textbf{90.77} &41.96 &4.91 &69.25 &59.38 &55.39 & 39.23 \\
   SnapKV & 25.15 & 28.75 & \textbf{39.28} & 43.57 & 36.16 & 21.58 & 25.56 & \textbf{23.19} & 24.30 & \textbf{73.00} & 90.52 & 41.70 & 4.96 & 69.25 & 60.27 & 55.74 & 41.43 \\
   PyramidKV &25.42 &27.91 &38.81 &44.15 &\textbf{36.28} & \textbf{21.72} &26.50 &23.10 &24.28 &72.00 &90.56 &41.87 &4.67 &\textbf{69.50} &60.09 &55.19 & 41.37 \\
   \rowcolor{red!20}\textbf{\method{}} & \textbf{25.88} & \textbf{29.58} & 38.99 & 43.94 & 34.16 & 21.70 & 26.50 & 23.15 & 23.95 & 72.00 & 90.56 & \textbf{42.47} & 5.34 & 69.25 & \textbf{61.68} & \textbf{56.35} & \textbf{41.59}  \\

   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{Mistral-7B-Instruct-v0.3, KV Size = Full} \\
   \arrayrulecolor{black}\midrule
   FullKV &29.07&41.58&52.88&49.37&39.01&28.58&34.93&25.68&27.74&76.00&88.59&47.59&6.00&98.50&61.41&62.39 & 48.08 \\
   
   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{Mistral-7B-Instruct-v0.3, KV Size Compression Ratio = $10\%$} \\
   \arrayrulecolor{black}\midrule
   StreamingLLM & 25.15 & 25.47 & 30.08 & 44.39 & 32.49 & 19.40 & 24.11 & 20.85 & 19.55 & 65.00 & 88.21 & 44.83 & 4.50 & 79.50 & 59.48 & 58.82 & 40.11 \\
   H2O & 29.35 & 33.39 & 50.39 & 49.58 & 36.76 & 27.42 & 25.16 & 24.75 & 22.12 & 42.00 & 89.00 & \textbf{47.04} & 5.50 & 98.50 & 57.58 & 59.24 & 43.61 \\
   SnapKV & 28.54 & \textbf{36.88} & 53.42 & 50.15 & 38.17 & 27.99 & 26.67 & \textbf{25.21} & \textbf{22.33} & 72.00 & 89.36 & 45.44 & \textbf{5.50} & \textbf{99.00} & 59.79 & \textbf{61.63} & 46.38 \\
   PyramidKV & 29.40 & 35.39 & 52.96 & 49.93 & 38.67 & 28.63 & \textbf{27.59} & 24.99 & 22.77 & 74.00 & \textbf{90.02} & 46.07 & 4.00 & 98.50 & 58.54 & 60.88 & 46.39 \\
   \rowcolor{red!20}\textbf{\method{}} & \textbf{29.75} & 36.82 & \textbf{53.99} & \textbf{50.33} & \textbf{38.72} & \textbf{29.01} & 27.03 & 24.76 & 21.42 & \textbf{76.00} & 88.73 & 46.49 & 5.00 & 98.00 & \textbf{59.98} & 61.47 & \textbf{46.71} \\
   
   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{Qwen2-7B-Instruct, KV Size = Full} \\
   \arrayrulecolor{black}\midrule
   FullKV & 25.11 & 42.64 & 44.29 & 14.25 & 13.22 & 9.08 & 36.38 & 23.43 & 26.53 & 77.00 & 89.99 & 44.88 & 6.75 & 75.92 & 60.17 & 61.84 & 40.71\\
   
   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{Qwen2-7B-Instruct, KV Size Compression Ratio = $10\%$} \\
   \arrayrulecolor{black}\midrule
   StreamingLLM & 25.15 &45.42 &41.46 &13.66 &11.95 &8.72 &32.79 &21.49 &\textbf{26.24} & 77.50 &89.15 &44.54 &7.50 &50.50 &60.03 &60.91 &38.56 \\
   H2O & 26.17 & 44.33 & 42.54 & 12.81 & 12.46 & 9.15 & \textbf{33.24} & 22.69 & 25.94 & 76.50 & 89.44 & 44.32 & 8.00 & 76.00 & \textbf{61.28} & \textbf{62.39} & 40.45 \\
   SnapKV & 26.84 & \textbf{45.96} & \textbf{45.79} & 14.27 & \textbf{13.35} & 9.91 & 32.62 & 22.70 & 25.83 & 77.00 & 89.19 & 44.71 & 7.50 & 71.50 & 60.35 & 61.37 & 40.55 \\
   PyramidKV & \textbf{27.51} & 44.45 & 43.59 & 13.35 & 13.13 & 9.12 & 32.28 & 22.60 & 25.45 & 77.00 & \textbf{89.44} & 44.53 & 7.00 & 73.50 & 60.91 & 61.24 & 40.31 \\
   \rowcolor{red!20}\textbf{\method{}} & 26.48 & 44.19 & 45.04 & \textbf{15.94} & 12.60 & \textbf{10.52} & 32.38 & \textbf{22.87} & 25.91 & \textbf{77.50} & 89.22 & \textbf{44.78} & \textbf{8.50} & \textbf{76.50} & 60.64 & 61.32 & \textbf{40.88} \\
   
   \arrayrulecolor{black}\bottomrule
   \end{tabular}
   }
   
   \label{table:longbench}
   \vspace{-3mm}
   \end{table*}
   

Table~\ref{table:longbench_70B} shows additional experiments with LLaMA-3-70B-Instruct, comparing our \method{} method with baselines with KV cache size = 256 on the LongBench dataset:
   \begin{table*}[!ht]
      \caption{Comprehensive performance comparison of \method{} across LongBench English subtasks with 70B model.}
      \resizebox{\textwidth}{!}{
      \centering
      % \setlength{\tabcolsep}{8pt}  % Reduce column separation
      \begin{tabular}{l|c@{\hspace{1.5pt}}c@{\hspace{3pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{2pt}}c@{\hspace{7pt}}c@{\hspace{7pt}}c|c}
      \specialrule{1pt}{0pt}{2pt}
      \multirow{5}{*}{Method}  & \multicolumn{3}{c}{Single-Document QA} & \multicolumn{3}{c}{Multi-Document QA}& \multicolumn{3}{c}{Summarization}& \multicolumn{3}{c}{Few-shot Learning}& \multicolumn{2}{c}{Synthetic} & \multicolumn{2}{c}{Code} & \multirow{6}{*}{\textbf{Avg. $\uparrow$} } \\
      \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}\cmidrule(lr){11-13}\cmidrule(lr){14-15}\cmidrule(lr){16-17}
      & \rotatebox[origin=c]{30}{NrtvQA} & \rotatebox[origin=c]{30}{Qasper} & \rotatebox[origin=c]{30}{MF-en} & \rotatebox[origin=c]{30}{HotpotQA} & \rotatebox[origin=c]{30}{2WikiMQA} & \rotatebox[origin=c]{30}{Musique} & \rotatebox[origin=c]{30}{GovReport} & \rotatebox[origin=c]{30}{QMSum} & \rotatebox[origin=c]{30}{MultiNews} & \rotatebox[origin=c]{30}{TREC} & \rotatebox[origin=c]{30}{TriviaQA} & \rotatebox[origin=c]{30}{SAMSum} & \rotatebox[origin=c]{30}{PCount} & \rotatebox[origin=c]{30}{PRe} & \rotatebox[origin=c]{30}{Lcc} & \rotatebox[origin=c]{30}{RB-P} & \\
      \cmidrule(lr){1-17}
      Avg len &18,409&3,619&4,559&9,151&4,887&11,214&8,734&10,614&2,113&5,177&8,209&6,258&11,141&9,289&1,235&4,206& \\
      
      \midrule
      \multicolumn{18}{c}{LlaMa-3-70B-Instruct, KV Size = Full} \\
      \arrayrulecolor{black}\midrule
      FullKV &27.56 & 46.00 & 49.75 & 50.86 & 56.16 & 29.77 & 31.66 & 21.80 & 27.42 & 75.50 & 92.58 & 46.08 & 12.00 & 73.50 & 45.02 & 68.74 & 47.15 \\
      
      \arrayrulecolor{black}\midrule
      \multicolumn{18}{c}{LlaMa-3-70B-Instruct, KV Cache Size = 256} \\
      \arrayrulecolor{black}\midrule
      \textbf{H2O} & 
      26.42 & 
      37.52 & 
      36.23 & 
      47.82 & 
      49.34 & 
      24.30 & 
      26.96 & 
      20.67 & 
      \textbf{26.50} & 
      72.00 & 
      92.16 & 
      43.56 & 
      12.00 & 
      73.00 & 
      47.21 & 
      63.13 & 
      43.67  \\
      
      \textbf{SnapKV} & 
      26.79 & 
      42.23 & 
      48.78 & 
      \textbf{50.89} & 
      55.46 & 
      28.14 & 
      23.85 & 
      22.20 & 
      24.62 & 
      73.00 & 
      93.05 & 
      44.53 & 
      12.00 & 
      72.50 & 
      \textbf{54.24} & 
      65.60 & 
      46.11 \\ 
   

      \rowcolor{red!20}\textbf{ChunkKV} & 
      \textbf{27.37} & 
      \textbf{45.95} & 
      \textbf{49.23} & 
      50.51 & 
      \textbf{55.49} & 
      \textbf{30.28} & 
      \textbf{30.47} & 
      \textbf{22.63} & 
      26.30 & 
      \textbf{75.50} & 
      92.58 & 
      \textbf{46.51} & 
      \textbf{12.00} & 
      \textbf{73.50} & 
      49.51 & 
      \textbf{68.05} & 
      \textbf{47.24}  \\
   
  
      \arrayrulecolor{black}\bottomrule
      \end{tabular}
      }
      
      \label{table:longbench_70B}
      \vspace{-3mm}
      \end{table*}
  

\subsection{Needle-In-A-Haystack}
\label{appendix:niah}

Figure~\ref{fig:NIAH_llama} and  \ref{fig:NIAH_mistral} visualizes the performance of \method{} on the NIAH benchmark for LLaMA-3-8B-Instruct and Mistral-7B-Instruct with a KV cache size of 128 under 8k and 32k context length. The performance of \method{} is consistently better as the context length increases.
\begin{figure}[!ht]
   \centering
   \begin{subfigure}[b]{0.9\textwidth}
       \centering
       \includegraphics[width=\textwidth]{./figs/NIAH/llama-3-8B-Instruct_spankv_baseline_128.pdf}
       \caption{\method{}, accuracy 73.8\%}
       \label{fig:NIAH_llama_spankv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/llama-3-8B-Instruct_pyramidkv_baseline_128.pdf}
      \caption{PyramidKV, accuracy 65.1\%}
      \label{fig:NIAH_llama_pyramidkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/llama-3-8B-Instruct_snapkv_baseline_128.pdf}
      \caption{SnapKV, accuracy 58.9\%}
      \label{fig:NIAH_llama_snapkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/llama-3-8B-Instruct_h2o_baseline_128.pdf}
      \caption{H2O, accuracy 47.9\%}
      \label{fig:NIAH_llama_h2o}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/llama-3-8B-Instruct_streamingllm_baseline_128.pdf}
      \caption{StreamingLLM, accuracy 23.7\%}
      \label{fig:NIAH_llama_streamingllm}
   \end{subfigure}
   \caption{NIAH benchmark for LLaMA-3-8B-Instruct with KV cache size=128 under 8k context length}
   \label{fig:NIAH_llama}
\end{figure}

\begin{figure}[!ht]
   \centering
   \begin{subfigure}[b]{0.9\textwidth}
       \centering
       \includegraphics[width=\textwidth]{./figs/NIAH/Mistral-7B-Instruct-v0.2_spankv_0920_baseline_128.pdf}
       \caption{\method{}, accuracy 99.8\%}
       \label{fig:NIAH_mistral_spankv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/Mistral-7B-Instruct-v0.2_pyramidkv_0920_baseline_128.pdf}
      \caption{PyramidKV, accuracy 99.3\%}
      \label{fig:NIAH_mistral_pyramidkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/Mistral-7B-Instruct-v0.2_snapkv_0920_baseline_128.pdf}
      \caption{SnapKV, accuracy 91.6\%}
      \label{fig:NIAH_mistral_snapkv}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/Mistral-7B-Instruct-v0.2_h2o_0920_baseline_128.pdf}
      \caption{H2O, accuracy 88.2\%}
      \label{fig:NIAH_mistral_h2o}
   \end{subfigure}
   \begin{subfigure}[b]{0.9\textwidth}
      \centering
      \includegraphics[width=\textwidth]{./figs/NIAH/Mistral-7B-Instruct-v0.2_streamingllm_0920_baseline_128.pdf}
      \caption{StreamingLLM, accuracy 44.3\%}
      \label{fig:NIAH_mistral_streamingllm}
   \end{subfigure}
   \caption{NIAH benchmark for Mistral-7B-Instruct with KV cache size=128 under 32k context length}
   \label{fig:NIAH_mistral}
\end{figure}
\clearpage

% Table \ref{tab:NIAH} shows the performance of \method{} in the NIAH data set with different KV cache sizes on LLaMA-3-8B-Instruct.
% \begin{table}[!ht]
%    \centering
%    \caption{NIAH Performance Comparison with Different KV Cache Sizes}
%    \begin{tabular}{l|cccc}
%    \toprule
%    \textbf{Method} & \textbf{Size = 96} & \textbf{Size = 128} & \textbf{Size = 256} & \textbf{Size = 512} \\
%    \midrule
%    StreamingLLM      & 21.5 & 23.7 & 28.0 & 32.0 \\
%    H2O      & 41.0 & 47.9 & 61.7 & 68.6 \\
%    SnapKV      & 56.2 & 58.9 & 68.8 & 71.2 \\
%    PyramidKV      & 63.2 & 65.1 & 69.5 & 72.6 \\
%    \rowcolor{red!20} \textbf{ChunkKV}      & \textbf{70.3} & \textbf{73.8} & \textbf{74.1} & \textbf{74.5} \\
%    \rowcolor{gray!20} FullKV & 74.6 & 74.6 & 74.6 & 74.6 \\
%    \bottomrule
%    \end{tabular}
%    \label{tab:NIAH}
% \end{table}


\subsection{Chunk Size}
\label{appendix:chunk_size}

% Table \ref{tab:ablation_size_compression}  and \ref{tab:ablation_size_niah} show the performance of \method{} with different comperession ratios and different chunk sizes on the LongBench and NIAH. We conducted extensive experiments across different compression ratios and KV cache sizes to shows the effectiveness of \method{} and the chunk size is robust.

% \begin{table}[!ht]
%    \centering
%    \caption{LongBench Performance with Different Chunk Sizes and Compression Ratios for LLaMA-3-8B-Instruct}
%    % \resizebox{0.9\textwidth}{!}{%
%    \begin{tabular}{c|ccccccc}
%    \toprule
%    Compression & \multicolumn{7}{c}{Chunk Size} \\
%    \cmidrule{2-8}
%    Rate & 1 & 3 & 5 & 10 & 15 & 20 & 30 \\
%    \midrule
%    10\% & 37.32 & 40.49 & 40.47 & \textbf{40.51} & 40.21 & 40.05 & 39.57 \\
%    20\% & 38.80 & 40.66 & 40.57 & \textbf{40.74} & 40.53 & 40.46 & 40.04 \\
%    30\% & 39.23 & 41.02 & 41.29 & \textbf{41.59} & 41.38 & 41.33 & 41.02 \\
%    \bottomrule
%    \end{tabular}%
%    % }
%    \label{tab:ablation_size_compression}
% \end{table}

% \begin{table}[!ht]
%    \centering
%    \caption{NIAH Performance with Different Chunk Sizes and KV Cache Sizes for LLaMA-3-8B-Instruct}
%    % \resizebox{0.9\textwidth}{!}{%
%    \begin{tabular}{c|ccccccc}
%    \toprule
%    KV Cache & \multicolumn{7}{c}{Chunk Size} \\
%    \cmidrule{2-8}
%    Size & 1 & 3 & 5 & 10 & 15 & 20 & 30 \\
%    \midrule
%    96 & 41.0 & 63.2 & 65.2 & \textbf{70.3} & 67.2 & 65.3 & 53.1 \\
%    128 & 47.9 & 65.6 & 69.1 & \textbf{73.8} & 72.3 & 72.0 & 71.2 \\
%    256 & 61.7 & 70.3 & 71.2 & \textbf{74.1} & 73.2 & 72.3 & 71.1 \\
%    512 & 68.6 & 72.6 & 72.5 & \textbf{74.5} & 74.3 & 74.0 & 72.6 \\
%    \bottomrule
%    \end{tabular}%
%    % }
%    \label{tab:ablation_size_niah}
% \end{table}

Table \ref{tab:lablation_size_compression_appendix}  and \ref{tab:ablation_size_niah} show the performance of \method{} with different comperession ratios and different chunk sizes on the LongBench and NIAH. We conducted extensive experiments across different compression ratios and KV cache sizes to shows the effectiveness of \method{} and the chunk size is robust.


\begin{table}[!h]
   \centering
      % \vspace{-0.2cm}
   \caption{LongBench Performance with Different Chunk Sizes and Compression Ratios for LLaMA-3-8B-Instruct}
   \resizebox{0.88\columnwidth}{!}{
   \begin{tabular}{c|ccccccc}
   \toprule
   Compression & \multicolumn{7}{c}{Chunk Size} \\
   \cmidrule{2-8}
   Rate & 1 & 3 & 5 & 10 & 15 & 20 & 30 \\
   \midrule
   10\% & 37.32 & 40.49 & 40.47 & \textbf{40.51} & 40.21 & 40.05 & 39.57 \\
   20\% & 38.80 & 40.66 & 40.57 & \textbf{40.74} & 40.53 & 40.46 & 40.04 \\
   30\% & 39.23 & 41.02 & 41.29 & \textbf{41.59} & 41.38 & 41.33 & 41.02 \\
   \bottomrule
   \end{tabular}%
   }
  \vspace{-0.3cm}
   \label{tab:lablation_size_compression_appendix}
\end{table}


\begin{table}[!h]
   \centering
   \caption{NIAH Performance with Different Chunk Sizes and KV Cache Sizes for LLaMA-3-8B-Instruct}
   \resizebox{0.8\columnwidth}{!}{
   \begin{tabular}{c|ccccccc}
   \toprule
   KV Cache & \multicolumn{7}{c}{Chunk Size} \\
   \cmidrule{2-8}
   Size & 1 & 3 & 5 & 10 & 15 & 20 & 30 \\
   \midrule
   96 & 41.0 & 63.2 & 65.2 & \textbf{70.3} & 67.2 & 65.3 & 53.1 \\
   128 & 47.9 & 65.6 & 69.1 & \textbf{73.8} & 72.3 & 72.0 & 71.2 \\
   256 & 61.7 & 70.3 & 71.2 & \textbf{74.1} & 73.2 & 72.3 & 71.1 \\
   512 & 68.6 & 72.6 & 72.5 & \textbf{74.5} & 74.3 & 74.0 & 72.6 \\
   \bottomrule
   \end{tabular}%
   }
   \vspace{-0.3cm}
   \label{tab:ablation_size_niah}
\end{table}

\begin{figure}[!ht]
   \centering
   \includegraphics[width=0.8\textwidth]{./figs/LongBench_Performance_vs_Chunk_Size.pdf}
   \caption{LongBench Performance Comparison with different chunk size under 10\% compression rate.}
   \label{fig:chunk_size_performance}
\end{figure}

Figure~\ref{fig:chunk_size_performance} shows the performance of the \method{} with different chunk size on the LongBench and NIAH benchmarks. The three colorful curves represent three LLMs with different chunk sizes, and the colorful dashed line is the corresponding FullKV performance. From Figure~\ref{fig:chunk_size_performance}, we can observe that the LongBench performance of \method{} is not significantly affected by the chunk size, with performance variations less than 1\%. The three curves are closely aligned, indicating that chunk sizes in the range of $\{10,20\}$ exhibit better performance. 


Table \ref{tab:ablation_size_gsm8k} and Figure~\ref{fig:chunk_size_gsm8k} shows that the \method{} with different chunk sizes on GSM8K displays the same curve pattern as LongBench. The CoT prompt length for GSM8K is only 1K tokens, so the optimal chunk size range is smaller.

\begin{figure}[!ht]
   \centering
   \includegraphics[width=1\textwidth]{./figs/GSM8K_Performance_vs_Chunk_Size_2plots.pdf}
   \caption{ GSM8K Performance Comparison  with different chunk size}
   \label{fig:chunk_size_gsm8k}
\end{figure}

\begin{table}[!ht]
   \centering
   \caption{GSM8K Performance Comparison with different chunk sizes}
   % \resizebox{0.8\textwidth}{!}{%
   \begin{tabular}{c|cccc|c}
   \toprule
   \multirow{3}{*}{Model} & \multicolumn{4}{c|}{Chunk Size}  & \multirow{3}{*}{Full KV} \\
   \cmidrule{2-5}
          & 3     & 5     & 10    & 20     & \\
   \midrule
   LLaMA-3-8B-Instruct & \textbf{74.6} & 74.5 & 73.9 & 63.2 & 76.8 \\
   Qwen2-7B-Instruct   & \textbf{73.5} & 71.2 & 71.8 & 71.7  & 71.1 \\
   \bottomrule
   \end{tabular}%
   % }
   \label{tab:ablation_size_gsm8k}
\end{table}
\subsection{Multi-Lingual}
\label{appendix:multilingual}
Table~\ref{table:longbench_zh} is the Chinese support model Qwen2-7B-Instruct evaluated on the LongBench Chinese subtask, where \method{} achieves better performance than other compression methods and the full KV cache performance. Both the English and Chinese results indicate that ChunkKV is a promising approach for maintaining crucial information in the KV cache.
% \clearpage


\begin{table}[!ht]
   \caption{Performance comparison of Chinese subtask on LongBench for Qwen2-7B-Instruct.}
   \centering
   \resizebox{1\textwidth}{!}{
      \begin{tabular}{l|ccccc|c}
         \arrayrulecolor{black}\toprule

         \multirow{3}{*}{Method}  & Single-Document QA & Multi-Document QA & Summarization & Few-shot Learning & Synthetic  &  \multirow{4}{*}{\textbf{Avg. $\uparrow$}}   \\
 
         \cmidrule(lr){2-2}\cmidrule(lr){3-3}\cmidrule(lr){4-4}\cmidrule(lr){5-5}\cmidrule(lr){6-6}
           & MF-zh & DuReader & VCSum & LSHT & PR-zh \\
         \cmidrule(lr){1-6}
           Avg len &6,701&15,768&15,380&22,337&6,745& \\
         \arrayrulecolor{black}\midrule
         \multicolumn{7}{c}{Qwen2-7B-Instruct, KV Size = Full} \\
         \arrayrulecolor{black}\midrule
         FullKV & 39.17 & 23.63 & 16.21 & 43.50 & 70.50 & 38.60 \\
         
         \arrayrulecolor{black}\midrule
         \multicolumn{7}{c}{Qwen2-7B-Instruct, KV Size Compression Ratio = $10\%$} \\
         \arrayrulecolor{black}\midrule
         StreamingLLM & 38.05 & \textbf{23.24} & 15.92 & 40.50 & 44.50 & 32.44 \\
         H2O & 37.99 &19.58 & 16.16 & 41.67 & 67.35 & 36.55 \\
         SnapKV & 44.25 & 20.27 & 16.24 & \textbf{44.50} & 68.10 & 38.67 \\
         PyramidKV & 36.57 & 20.56 & 16.15 & 43.50 & 66.50 & 36.55 \\
         \rowcolor{red!20}\textbf{\method{}} & \textbf{45.92} & 20.15 & \textbf{16.37} & 43.75 & \textbf{71.10} & \textbf{39.45} \\
         \arrayrulecolor{black}\bottomrule
      \end{tabular}
   }
   
\label{table:longbench_zh}
\vspace{-3mm}
\end{table}


\subsection{KV Cache Quantization}
\label{appendix:kv_cache_quantization}

For comprehensively evaluate the effectiveness of \method{}, we conducted experiments comparing \method{} with quantization methods KIVI~\citep{liu2024kivi}. While both approaches aim to optimize LLM inference, they operate on fundamentally different principles: quantization reduces KV matrix precision, whereas our eviction method reduces KV matrix size.

From an implementation perspective, quantization methods require the full KV cache during prefilling to produce quantized representations, which are then used during decoding. In contrast, \method{} employs token removal prior to prefilling, enabling operation with a compressed cache throughout the entire inference process. This distinction creates different efficiency profiles, and each method offers unique advantages.

The following tables present our detailed analysis comparing \method{} with KIVI, across both performance and efficiency metrics on the LLaMA-3-8B-Instruct model. It should be noted that due to KIVI's dependency on older Python versions, there may be some discrepancies between the efficiency results reported here and those in Table~\ref{tab:efficiency} of our paper.

\begin{table}[!ht]
   \caption{\centering Comprehensive performance comparison of \method{} and quantization methods across LongBench English subtasks.}
   \resizebox{\textwidth}{!}{
   \centering
   % \setlength{\tabcolsep}{8pt}  % Reduce column separation
   \begin{tabular}{l|c@{\hspace{1.5pt}}c@{\hspace{3pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{4pt}}c@{\hspace{0pt}}c@{\hspace{0pt}}c@{\hspace{2pt}}c@{\hspace{7pt}}c@{\hspace{7pt}}c|c}
   \specialrule{1pt}{0pt}{2pt}
   \multirow{5}{*}{Method}  & \multicolumn{3}{c}{Single-Document QA} & \multicolumn{3}{c}{Multi-Document QA}& \multicolumn{3}{c}{Summarization}& \multicolumn{3}{c}{Few-shot Learning}& \multicolumn{2}{c}{Synthetic} & \multicolumn{2}{c}{Code} & \multirow{6}{*}{\textbf{Avg. $\uparrow$} } \\
   \cmidrule(lr){2-4}\cmidrule(lr){5-7}\cmidrule(lr){8-10}\cmidrule(lr){11-13}\cmidrule(lr){14-15}\cmidrule(lr){16-17}
   & \rotatebox[origin=c]{30}{NrtvQA} & \rotatebox[origin=c]{30}{Qasper} & \rotatebox[origin=c]{30}{MF-en} & \rotatebox[origin=c]{30}{HotpotQA} & \rotatebox[origin=c]{30}{2WikiMQA} & \rotatebox[origin=c]{30}{Musique} & \rotatebox[origin=c]{30}{GovReport} & \rotatebox[origin=c]{30}{QMSum} & \rotatebox[origin=c]{30}{MultiNews} & \rotatebox[origin=c]{30}{TREC} & \rotatebox[origin=c]{30}{TriviaQA} & \rotatebox[origin=c]{30}{SAMSum} & \rotatebox[origin=c]{30}{PCount} & \rotatebox[origin=c]{30}{PRe} & \rotatebox[origin=c]{30}{Lcc} & \rotatebox[origin=c]{30}{RB-P} & \\
   \cmidrule(lr){1-17}
   Avg len &18,409&3,619&4,559&9,151&4,887&11,214&8,734&10,614&2,113&5,177&8,209&6,258&11,141&9,289&1,235&4,206& \\
   
   \midrule
   \multicolumn{18}{c}{LlaMa-3-8B-Instruct, KV Size = Full} \\
   \arrayrulecolor{black}\midrule
   FullKV &25.70 & 29.75 & 41.12 & 45.55 & 35.87 & 22.35 & 25.63 & 23.03 & 26.21 & 73.00 & 90.56 & 41.88 & 4.67 & 69.25 & 58.05 & 50.77 & 41.46 \\
   
   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{ChunkKV with different compression ratios} \\
   \arrayrulecolor{black}\midrule
   \textbf{ChunkKV-10\%} & 
   24.89 & 
   22.96 & 
   37.64 & 
   43.27 & 
   36.45 & 
   20.65 & 
   22.80 & 
   22.97 & 
   20.82 & 
   71.50 & 
   90.52 & 
   40.83 & 
   5.93 & 
   69.00 & 
   60.49 & 
   57.48 & 
   40.51 \\ 

   \textbf{ChunkKV-20\%} & 
   26.13 & 
   28.43 & 
   38.59 & 
   44.46 & 
   34.13 & 
   21.06 & 
   24.72 & 
   23.11 & 
   22.91 & 
   71.50 & 
   90.56 & 
   41.51 & 
   5.09 & 69.00 & 58.17 & 52.51 & 40.74  \\

   \rowcolor{red!20}\textbf{ChunkKV-30\%} & 
   25.88 & 
   29.58 & 
   38.99 & 
   43.94 & 
   34.16 & 
   21.70 & 
   26.50 & 23.15 & 23.95 & 72.00 & 90.56 & 42.47 & 5.34 & 69.25 & 61.68 & 56.35 & \textbf{41.59}  \\

   \arrayrulecolor{black}\midrule
   \multicolumn{18}{c}{KIVI with different quantization nbits} \\
   \arrayrulecolor{black}\midrule

   \textbf{KIVI-2bits} & 
   25.45 & 
   29.45 & 
   40.85 & 
   45.25 & 
   35.55 & 
   22.05 & 
   25.48 & 
   22.95 & 
   26.06 & 
   70.25 & 
   89.00 & 
   41.73 & 
   4.25 & 
   64.00 & 
   59.72 & 
   46.70 & 
   40.54  \\

   \textbf{KIVI-4bits} & 
   25.57 & 
   29.60 & 
   41.00 & 
   45.40 & 
   35.70 & 
   22.20 & 
   28.78 & 
   23.00 & 
   26.49 & 
   72.50 & 
   89.84 & 
   42.16 & 
   4.35 & 
   64.52 & 
   59.91 & 
   46.84 & 
   41.11  \\

   \textbf{KIVI-8bits} & 
   25.70 & 
   29.73 & 
   41.11 & 
   45.52 & 
   35.81 & 
   22.34 & 
   25.62 & 
   23.01 & 
   26.19 & 
   72.97 & 
   90.54 & 
   41.83 & 
   4.62 & 
   69.22 & 
   58.02 & 
   50.74 & 
   41.43  \\


   \arrayrulecolor{black}\bottomrule
   \end{tabular}
   }
   
   \label{table:longbench_kv_cache_quantization}
   \vspace{-3mm}
   \end{table}
   
       
% \begin{table}[!ht]
%    \centering
%    \small
%    \caption{Efficiency Results for ChunkKV and KIVI on LlaMa-3-8B-Instruct}
%    \resizebox{\textwidth}{!}{

%    \begin{tabular}{lcccccccc}
%    \toprule
%    \textbf{Configuration} & \textbf{Prompt} & \textbf{Output} & \textbf{Compression} & \textbf{Prefilling} & \textbf{Cache} & \textbf{TTFT} & \textbf{TPOT} & \textbf{Total Gen.} \\
%    & \textbf{Length} & \textbf{Length} & \textbf{Ratio / nbits} & \textbf{Time(s) $\downarrow$} & \textbf{Size(GB) $\downarrow$} & \textbf{(s) $\downarrow$} & \textbf{(ms) $\downarrow$} & \textbf{Time(s) $\downarrow$} \\
%    \midrule
%    FullKV & 8192 & 4096 & - & 1.5621 & 1.0000 & 1.6013 & 45.9421 & 184.2934 \\
%    KIVI & 8192 & 4096 & 2bits & 1.4024 & 0.1563 & 1.4325 & 54.9561 & 226.5234 \\
%    KIVI & 8192 & 4096 & 4bits & 1.3916 & 0.2813 & 1.4146 & 52.0510 & 214.5634 \\
%    \rowcolor{red!20} ChunkKV & 8192 & 4096 & 10\% & 1.3653 & 0.1000 & 1.3914 & 39.8702 & 164.6600 \\
%    \bottomrule
%    \end{tabular}
%    }
%    \label{table:longbench_kv_cache_quantization_efficiency}
%    \end{table}

Table~\ref{table:longbench_kv_cache_quantization} demonstrates that ChunkKV with 30\% compression ratio achieves comparable performance (41.59) to KIVI with 8-bit quantization (41.43) across LongBench subtasks. Notably, ChunkKV shows particularly strong results in code-related tasks, while KIVI maintains better performance in single-document QA scenarios.

% The efficiency results in Table~\ref{table:longbench_kv_cache_quantization_efficiency} reveal ChunkKV's significant advantages in latency-critical metrics. While both methods reduce cache size substantially (ChunkKV to 10\% and KIVI-2bits to 15.63\%), ChunkKV delivers superior Time to First Token (TTFT) and Token Processing Time (TPOT) metrics. Most remarkably, ChunkKV achieves a 164.66s total generation time compared to KIVI's 226.52s at 2-bit quantization, representing a 27.3\% improvement in overall inference speed.

These results highlight the complementary strengths of eviction and quantization approaches. ChunkKV's semantic chunk-based method delivers particularly strong efficiency benefits while maintaining competitive performance with state-of-the-art quantization techniques.


\subsection{Comparison with Orthogonal and Training-Based Methods}
\label{appendix:comparison_advanced}

To better position ChunkKV in the broader landscape of inference optimization, we conducted additional experiments comparing our training-free eviction method against Palu~\citep{chang2024palu}, a recent training-based compression method. The experiments were conducted on the LongBench benchmark with the LLaMA-3-8B-Instruct model.

\begin{table}[h!]
\centering
\caption{Performance comparison of ChunkKV with  Palu (training-based) on the LongBench benchmark (LLaMA-3-8B-Instruct). ChunkKV demonstrates a superior performance-efficiency trade-off compared to training-based methods on this diverse benchmark and remains competitive with orthogonal quantization techniques.}
\label{tab:palu_comparison}
\resizebox{\textwidth}{!}{%
\begin{tabular}{lccccccc}
\toprule
\textbf{Method} & \textbf{Single-Doc QA} & \textbf{Multi-Doc QA} & \textbf{Summarization} & \textbf{Few-shot} & \textbf{Synthetic} & \textbf{Code} & \textbf{Average Score} $\uparrow$ \\
\midrule
FullKV (FP16) & 32.19 & 34.59 & 24.96 & 68.48 & 36.96 & 54.41 & 41.46 \\
\midrule
\multicolumn{8}{l}{\textit{Eviction Methods (Ours)}} \\
ChunkKV (10\% Ratio) & 28.50 & 33.46 & 22.20 & 67.62 & 37.47 & 58.98 & 40.51 \\
ChunkKV (30\% Ratio) & 31.48 & 33.27 & 24.53 & 68.34 & 37.30 & 59.02 & 41.59 \\
\midrule
\multicolumn{8}{l}{\textit{Training-Based Methods}} \\
Palu (50\% Ratio) & 8.77 & 7.43 & 20.71 & 61.62 & 8.59 & 18.43 & 22.48 \\
Palu (70\% Ratio) & 10.06 & 8.31 & 23.71 & 68.64 & 35.00 & 38.97 & 28.84 \\
\bottomrule
\end{tabular}%
}
\end{table}

The results in Table~\ref{tab:palu_comparison} highlight key differences. Palu, as a training-based method, experiences significant performance degradation on the diverse LongBench benchmark, which represents an Out-Of-Distribution (OOD) challenge relative to its training data. In contrast, our training-free method, ChunkKV, demonstrates robust performance.

\subsection{Efficiency results}
\label{appendix:efficiency_results}
Table~\ref{table:longbench_kv_cache_efficiency_appendix} shows the efficiency results for different KV cache strategies with varying output lengths and the metrics are Time to First Token (TTFT) and Token Processing Time (TPOT).
\begin{table}[!ht]
   \centering
   \small
   \caption{Efficiency Results for Different KV Cache Strategies with Varying Output Lengths}
   \begin{tabular}{lccccc}
   \toprule
   \multirow{2}{*}{\textbf{Method}} & \multirow{2}{*}{\textbf{Input}} & \multirow{2}{*}{\textbf{Output}} & \multirow{2}{*}{\textbf{TTFT(s) $\downarrow$}} & \multirow{2}{*}{\textbf{TPOT(ms/token) $\downarrow$}} \\
    & & & & \\
   \midrule
   \multicolumn{5}{c}{\textbf{Output Length = 1024}} \\
   \midrule
   FullKV & 4096 & 1024 & 1.17 & 42.58 \\
   StreamingLLM & 4096 & 1024 & 1.02 (12.8\%) & 35.94 (15.6\%) \\
   SnapKV & 4096 & 1024 & 1.04 (11.1\%) & 36.24 (14.9\%) \\
   ChunkKV & 4096 & 1024 & 1.03 (12.0\%) & 36.64 (13.9\%) \\
   ChunkKV\_reuse & 4096 & 1024 & 1.00 (14.5\%) & 36.47 (14.3\%) \\
   \midrule
   \multicolumn{5}{c}{\textbf{Output Length = 4096}} \\
   \midrule
   FullKV & 4096 & 4096 & 1.17 & 42.85 \\
   StreamingLLM & 4096 & 4096 & 1.09 (6.8\%) & 40.01 (6.6\%) \\
   SnapKV & 4096 & 4096 & 1.10 (6.0\%) & 40.24 (6.1\%) \\
   ChunkKV & 4096 & 4096 & 1.10 (6.0\%) & 40.17 (6.3\%) \\
   ChunkKV\_reuse & 4096 & 4096 & 1.08 (7.7\%) & 40.16 (6.3\%) \\
   \bottomrule
   \end{tabular}
   \label{table:longbench_kv_cache_efficiency_appendix}
   \end{table}


\begin{table}[h!]
\centering
\caption{Latency and throughput comparison on LLaMA-3-8B-Instruct, including long-context scenarios up to 16k tokens. Percentages in parentheses indicate improvements over the FullKV baseline. ChunkKV with layer-wise reuse (ChunkKV\_reuse) consistently delivers the best efficiency gains, especially in long-context settings.}
\label{tab:efficiency_long_context}
\resizebox{\textwidth}{!}{%
\begin{tabular}{l|cc|cc}
\toprule
\multirow{2}{*}{\textbf{Method}} & \multicolumn{2}{c|}{\textbf{Sequence Length}} & \multicolumn{2}{c}{\textbf{Performance Metrics}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5}
& \textbf{Input} & \textbf{Output} & \textbf{Latency(s) $\downarrow$} & \textbf{Throughput(T/S) $\uparrow$} \\
\midrule
\multicolumn{5}{l}{\textit{4k Context Length}} \\
FullKV & 4096 & 1024 & 43.60 & 105.92 \\
SnapKV & 4096 & 1024 & 37.92 (13.0\%) & 120.42 (13.7\%) \\
ChunkKV & 4096 & 1024 & 37.52 (13.9\%) & 118.85 (12.2\%) \\
\rowcolor{red!20}ChunkKV\_reuse & 4096 & 1024 & \textbf{37.35 (14.3\%)} & \textbf{124.09 (17.2\%)} \\
\midrule
\multicolumn{5}{l}{\textit{16k Context Length}} \\
FullKV & 16384 & 1024 & 49.60 & 323.24 \\
SnapKV & 16384 & 1024 & 39.20 (21.0\%) & 381.11 (17.9\%) \\
ChunkKV & 16384 & 1024 & 38.82 (21.7\%) & 381.61 (18.1\%) \\
\rowcolor{red!20}ChunkKV\_reuse & 16384 & 1024 & \textbf{36.96 (25.5\%)} & \textbf{389.21 (20.4\%)} \\
\bottomrule
\end{tabular}%
}
\end{table}

\section{Theoretical Understanding}
\label{appendix:theory}
In this section, we provide the theoretical interpretation from the perspective from the In-context learning (ICL) to further understand how ChunkKV outperforms token-level KV cache compression.

\textbf{Pretraining Data Distribution.} Given a set of concepts $\Theta$ and a concept $\theta \in \Theta$, we define the pretraining data is sampled from $p(\obs_1, \dots, \obs_T) = \int_{\theta \in \Theta} p(\obs_1, \dots, \obs_T \vert \theta)p(\theta)d \theta$~\citep{xie2022an}. Each token $\obs$ is sampled from a vocabulary $\obsset$. For simplicity, we write $\obs_{1:t} = \obs_1\dots\obs_{t}$.

\textbf{Language Modeling.}
Current LLMs~\citep{brown2020language,touvron2023llama2,xie2022an} usually utilize the next word prediction as the language modelling, which predicts the next token $\obs_{t}$ given the previous tokens $\obs_1\dots\obs_{t-1}$ for all $t=1,\dots, T$. Formally, a language modelling can be writen as the distribution $f(\obs_{t} \vert \obs_{1:t-1})$. And it is pretrained on a huge corpus sampled from the pretraining distribution $p(\obs_1,\dots,\obs_{t+1})$~\citep{xie2022an}. Considering the large scale of the model size and dataset size, it can be assumed that the $f(\obs_1\dots\obs_{t+1})$ has been aligned with the $p(\obs_1\dots\obs_{t+1})$~\citep{xie2022an}.

% \obs,\dots,\obs_{t+1}

\textbf{Prompt Distribution.}
Following~\citep{xie2022an}, a prompt is composed of an input token sequence $x$ followed by an output token $y$. Then, the $i$-th training example \footnote{Here, training example in prompts means happens during the prompt learning.} that can appear in any place in the whole prompt $o_{1:T}$ is defined as $O_i$ consisting of an input $x_i=O_i \left[1:k-1 \right]$ (the first $k-1$ tokens) followed by the output $y_i = O_i \left[k\right]$ at the end, where the length $k$ is fixed for simplicity. 

The $i$-th training example is independently generated as follows: 1) Generate a start hidden state $\hiddensegstart_i$ from a \emph{prompt start distribution} $\ppromptstart$;
2) Given $\hiddensegstart_i$, generate the example sequence $\obsseg_i=[\X_i,\y_i]$ from $p(\obsseg_i \vert \hiddensegstart_i, \theta^\star)$.
The test input $\Xtest = \X_{n+1}$ is sampled similarly. Then, the prompt consists of a sequence of training examples ($\promptseq$) followed by the example $\Xtest$:

\begin{align}
    [\promptseq, \Xtest] = [\X_1, \y_1, \X_2, \y_2, \dots, \X_n, \y_n, \Xtest] \sim \pprompt.
\end{align}


\textbf{In-context learning setups and Assumptions.} We follow other settings and assumptions in ~\citep{xie2022an}. With the greedy decoding~\citep{fubreak}, sampling the next token from the language modeling $f(o_t \vert o_{1:t-1})$ becomes the predictor as $y =\argmax_{o_t} f(o_t|o_{1:t-1})$. 

Thus, for $[\promptseq, \Xtest]$, the in-context learning predictor can be written as $f_{n}(\Xtest) := \argmax_y p(y|\promptseq, \Xtest)$, which outputs the most likely prediction over the \emph{pretraining distribution} conditioned on the \emph{prompt distribution}. Its expected 0-1 error with $n$ examples is $\Lzeroone(f_n) = \E_{\Xtest,\ytest \sim \pprompt}[\indicator[f_{n}(\Xtest) \neq \ytest]]$.


We define $p_\theta^i(o):=p(O[i]=o|O[1:i-1],\theta)$ of the $i$-th token with previous tokens and the analogous distribution $p^{i}_{prompt}:=p_{prompt}(O[i]=o|O[1:i-1])$ under the prompt distribution. Following~\citep{xie2022an}, there is a distinguishability condition formalizes when in-context learning occurs giving the concept $\theta$. 

The distinguishability condition is dependent on a KL divergence between the previous two distributions and the error terms $\epsilon_\theta$ resulting from the distribution mismatch between the prompt and the pertaining distributions for each example. Letting $p_{\theta}^i(o)$ and $p^{i}_{prompt}$ correspond to the concept $\theta$ and $\theta^\star$.

% the KL divergence are defined as follows,

\begin{condition}[distinguishability~\citep{xie2022an}] The $\theta^\star$ is distinguishable if for all $\theta\in\Omega$, $\theta \neq\theta^\star$,
\begin{align}\label{eq:distinguish1}
    \sum_{i=1}^k \text{KL}_{i}(\theta^\star||\theta)>\epsilon_\theta,
\end{align}
where the $\text{KL}_{i}(\theta^\star||\theta) :=\mathbb{E}_{O[1:i-1]\sim p_{prompt}}[\text{KL}(p^{i}_{prompt}||p_\theta^{i})].$
\label{cond:distinguish}
\end{condition}


\textbf{Noises from KV Cache Compression.} Naturally, because of the sparsified KV cache, some history tokens in $o_{1:t-1}$ at different layers lost its attention score calculation with respect to the next word prediction $o_t$. We can regard this as the noise added onto the $o_{1:t-1}$. Thus, distincting $\theta^\star$ from $\theta$ requires larger KL divergence. Following~\citep{zhoucan}, we provide the following second condition about the distinguishability with the KV cache sparsity.

\begin{condition}[distinguishability under sparsified KV cache] With the noise introduced by the sparsified KV cache of the sparse ratio $r$, the distribution mismatch between the prompt and the pretraining distribution that is approximated by LLM is enlarged, resulting in a varied requirement with error term $\xi_\theta(r)$ for $\theta^*$ being distinguishable if for all $\theta\in\Theta$, $\theta\neq\theta^*$,
    \begin{align}\label{eq:noise_dinsting}
        \sum_{i=1}^k \text{KL}_{i}(\theta^*||\theta)>\epsilon_\theta+\xi_\theta(r),\quad {\rm where}\quad \xi_\theta(r)\propto r.
    \end{align}
    \label{cond:2}
\end{condition}


\begin{lemma}[noisy-relaxed bound in ~\citep{xie2022an,zhoucan}]\label{lemma:1}
let $\mathcal{B}$ denotes the set of $\theta$ which does not satisfy Condition~\ref{cond:distinguish}. We assume that $\text{KL}(p_{prompt}(y_\text{test}|x_\text{test}))||p(y_\text{test}|x_\text{test},\theta)$ is bounded for all $\theta$ and that $\theta^\star$ minimizes the multi-class logistic risk as,
\begin{align}\label{eq:lemma:1:LCE}
\begin{split}
L_\text{CE}(\theta)=-\mathbb{E}_{x_\text{test}\sim p_{prompt}}[p_{prompt}(y_\text{test}|x_\text{test})\cdot\log p(y_\text{test}|x_\text{test},\theta)].
\end{split}
\end{align}
If
\begin{align}\label{eq:prompt_tau_upperbound_concrete_with_epsilon}
\mathbb{E}_{x_\text{test}\sim p_{prompt}}[\text{KL}(p_{prompt}(y_\text{test}|x_\text{test})
|| p(y_\text{test}|x_\text{test},\theta))]\leq (\epsilon_{\theta} + \xi_\theta(r)),\quad \forall \quad \theta\in\mathcal{B},
\end{align}
then
\begin{align}\label{eq:L01-inf}
\lim_{n\rightarrow\infty} L_{0-1}(f_n) \leq \inf_{f} L_{0-1}(f) + g^{-1}\bigg(\sup_{\theta\in\mathcal{B}}(\epsilon_\theta)\bigg),
\end{align}
where $g(\nu) = \frac{1}{2}\big((1-\nu)\log(1-\nu)+(1+\nu)\log(1+\nu)\big)$ is the calibration function~\citep{Steinwart2007HowTC,pires2016multiclass} for the multiclass logistic loss for $\nu\in[0,1]$.
\end{lemma}


Following~\citep{Kleijn2012TheBT,xie2022an}, KL divergence is assumed to haver the 2nd-order Taylor expansion with the concept $\theta$. Then, we have the following theorem and proof.

\begin{theorem}
\label{thm:continuity}~\citep{xie2022an,zhoucan}
Let the set of $\theta$ which does not satisfy Equation~\ref{eq:distinguish1} in Condition~\ref{cond:distinguish} to be $\mathcal{B}$.
Assume that KL divergences have a 2nd-order Taylor expansion around $\theta^\star$:
\begin{align}
    \forall j>1,~~\text{KL}_{i}(\theta^\star||\theta) = \frac{1}{2}(\theta - \theta^\star)^\top \fisherinfj (\theta - \theta^\star) + O(\|\theta - \theta^\star\|^3)
\end{align}
where $\fisherinfj$ is the Fisher information matrix of the $j$-th token distribution with respect to $\theta^\star$.
Let $\conditionnum = \frac{\max_{j}\lambdamax(\fisherinfj)}{\min{j}\lambdamin(\fisherinfj)}$ where $\lambdamax,\lambdamin$ return the largest and smallest eigenvalues.
Then for $k \geq 2$ and as $n\rightarrow \infty$, the 0-1 risk of the in-context learning predictor $f_n$ is bounded as
\begin{align}\label{eq:final-theorem-bound}
    \lim_{n\rightarrow \infty} \Lzeroone(f_n) \leq \inf_{f} \Lzeroone(f) + g^\minv\left(O\left(\frac{\conditionnum\sup_{\theta \in \badset}(\epsilon_{\theta} + \xi_\theta(r))}{k-1}\right)\right)
\end{align}
\end{theorem}

\begin{proof}~\citep{xie2022an}
By the Taylor expansion on $\theta$, we have for any $\theta$ in $\mathcal{B}$ that
\begin{align}
    \sum_{j=2}^k \text{KL}_{i}(\theta^\star||\theta)
    &\geq \frac{1}{2}\sum_{j=2}^k (\theta - \theta^\star)^\top \fisherinfj (\theta - \theta^\star) + (k-1)O(\|\theta - \theta^\star\|^3)\\
                      &\geq \frac{1}{2}(k-1)\lambdamin(\fisherinfj) \|\theta - \theta^\star\|^2\\
    \implies \|\theta - \theta^\star\|^2 &\leq \frac{(\epsilon_{\theta} + \xi_\theta(r))}{\frac{1}{2}(k-1)(\min_j~\lambdamin(\fisherinfj))}.
\end{align}
% We use this to bound the last KL term by plugging it in below:
% Using the above term to bound the last KL term ($k$-th token), we have:
We can bound the last KL term ($k$-th token) with the above term:
\begin{align}
    \text{KL}_{k}(\theta^\star||\theta) &= \frac{1}{2}(\theta - \theta^\star)^\top  \fisherinfk(\theta - \theta^\star) + O(\|\theta - \theta^\star\|^3)\\
         &\leq \frac{1}{2}(\max_j~\lambdamax(\fisherinfj))\|\theta - \theta^\star\|^2 + O(\|\theta - \theta^\star\|^2)\\
                 &\leq \frac{(\epsilon_{\theta} + \xi_\theta(r))(\max_j~\lambdamax(\fisherinfj) + O(1))}{(k-1)\min_j~\lambdamin(\fisherinfj)}.
\end{align}
Rearranging above equation, and with $\text{KL}_{k}(\theta^\star||\theta) = \E_{\Xtest \sim \pprompt} [KL(\pprompt(\ytest \vert \Xtest) \| p(\ytest \vert \Xtest, \theta)) ]$, there is
\begin{align}\label{eq:prompt_tau_upperbound_concrete}
\E_{\Xtest \sim \pprompt} [KL(\pprompt(\ytest \vert \Xtest) \| p(\ytest \vert \Xtest, \theta)) ] \leq \frac{(\epsilon_{\theta} + \xi_\theta(r))(\max_j~\lambdamax(\fisherinfj) + O(1))}{(k-1)\min_j~\lambdamin(\fisherinfj)}
\end{align}
Combining Equation~\ref{eq:prompt_tau_upperbound_concrete} with Equation~\ref{eq:prompt_tau_upperbound_concrete_with_epsilon} into Lemma~\ref{lemma:1} completes the proof.
\end{proof}

\textbf{KV Cache Sparsification.} Revisiting the Equation~\ref{eq:noise_dinsting} in Condition~\ref{cond:2}, the $\xi_\theta(r)$ is enlarged with the sparsity ratio $r$. The higher compression ratio $r$ (means that more KV cache are discarded), the more noise $\xi_\theta(r)$. Then it leads to the higher bound of the $\lim_{n\rightarrow\infty} L_{0-1}(f_n)$ in Equation~\ref{eq:lemma:1:LCE} in Lemma~\ref{lemma:1}. Next, we discuss how KV cache compression influences the Equation~\ref{eq:noise_dinsting}.

\textbf{Token-level Importance Measure.}
The token-level KV cache methods usually calculate the importance of different tokens. Then, the KV cache with indexes that have higher importance will be preserved. Such indexes are normaly choosed as the attention score. Considering that the case in Figure~\ref{fig:main}, where each token in the $i$-th training~\footnote{Here, training means prompt learning~\citep{xie2022an}.} example sequence ($\obsseg_i=[\X_i,\y_i]$) might be compressed, and tokens are sparsified concretely without out dependency to other tokens. However, in the generation process of the $i$-th training example, $\obsseg_i=[\X_i,\y_i]$ is sampled from $p(\obsseg_i \vert \hiddensegstart_i, \theta^\star)$ and $p_\theta^j(o):=p(O[j]=o|O[1:j-1],\theta)$ of the $j$-th token with previous tokens and the analogous distribution $p^{j}_{prompt}:=p_{prompt}(O[j]=o|O[1:j-1])$. And the KL divergence is defined as $\text{KL}_{j}(\theta^\star||\theta) :=\mathbb{E}_{O[1:j-1]\sim p_{prompt}}[\text{KL}(p^{j}_{prompt}||p_\theta^{j})]$, which means that in a training example $\obsseg_i=[\X_i,\y_i] = \obsseg_i[1:k]$, each token $\obsseg_i[j]$ has strong dependency with $\obsseg_i[1:j-1]$, noises on previous any $j$-th token will influence the distinguishability on the following tokens, i.e. requiring larger $\left\{ \text{KL}_{u}(\theta^\star||\theta) \right\}_{u>j}$.

On the other hand, the token-level sparsity enlarges the requirement on the distinguishability uniformly for each example $\obsseg_i$ (the case in Figure~\ref{fig:main}), which uniformly loses the bound of $\Lzeroone(f_n)$ as in Equation~\ref{eq:final-theorem-bound}.


\textbf{Chunk-level Importance Measure.} Different from token-level importance measure, \method{} regards tokens in a continuous window as a basic unit that should be left or discarded as a whole. The preserved window can be regarded as saving the complete $\obsseg_i=[\X_i,\y_i]$ without noise. Thus, \method{} reduces the noise $\xi_\theta(r)$ for the preserved $\obsseg_i$, which lowers the bound of $\Lzeroone(f_n)$. 

More intuitively, \method{} focus on reducing the noise on some complete training examples, but some other examples overall with low importance will be discarded. Then, the model identifies the $\Xtest$ from those clean and more related training examples $\obsseg_i$ and neglect those $\obsseg_i$ with less importance.

Note that here, we do not provide the rigorous proof on how KV cache sparsity enhances the requirement of the distinguishability and how different $\text{KL}_{j}(\theta^\star||\theta)$ on $\obsseg_i=[\X_i,\y_i]$ influences the bound $\Lzeroone(f_n)$. We left this as the future work to analyze how KV cache sparsity influences the in-context learning.


% complete $\obsseg_i=[\X_i,\y_i]$ helps 


% thus reducing the requirement of the distinguishability on the 


% Given $\hiddensegstart_i$, generate the example sequence $\obsseg_i=[\X_i,\y_i]$ from $p(\obsseg_i \vert \hiddensegstart_i, \theta^\star)$.


% $\text{KL}_{i}(\theta^\star||\theta) :=\mathbb{E}_{O[1:i-1]\sim p_{prompt}}[\text{KL}(p^{i}_{prompt}||p_\theta^{i})].$

% 


\section{Additional Related Work}
\label{appendix:related_work}

\textbf{Chunking Method.} 
The chunking methodology is widely used in the field of NLP due to its simplicity and effectiveness~\citep{sang1999representing}. In the era of LLMs, chunking is primarily applied in data pre-processing. For example, \citet{shicontext} suggest grouping related training data into chunks to achieve better convergence curves to pre-train LLMs. \citet{fei-etal-2024-extending} apply a topic-based chunking method to improve the semantic compression of prompts. Furthermore, chunking plays an important role in the Retrieval-Augmented Generation (RAG) field~\citep{yepes2024financialreportchunkingeffective, smith2024evaluating, anthropic_contextual_retrieval_2024}. It serves to divide documents into units of information with semantic content suitable for embedding-based retrieval and processing by LLMs.

\textbf{Layer-Wise Technique.}
The layer-wise technique is widely used in the training and inference of large language models (LLMs). LISA~\citep{pan2024lisa} is a layer-wise sampling method based on observations of the training dynamics of Low-Rank Adaptation (LoRA)\citep{hu2021lora,tang2025ghost,weifan2025jailbreaklora,lai2025mediatormemoryefficientllmmerging,tang2024fusionllmdecentralizedllmtraining} across layers. LAMB\citep{you2019lamb} is a layer-wise adaptive learning rate method that speeds up LLM training by stabilizing training convergence with large batch sizes. DoLa~\citep{chuang2023dola} employs layer-wise contrasting to reduce hallucinations during LLM inference.


\textbf{KV Cache Sharing.}
Recent work has explored various strategies for sharing KV caches across transformer layers. Layer-Condensed KV Cache (LCKV) \citep{wu2024layercondensedkvcacheefficient} computes KVs only for the top layer and pairs them with queries from all layers, while optionally retaining standard attention for a few top and bottom layers to mitigate performance degradation. Similarly, You Only Cache Once (YOCO) \citep{sun2024yoco} computes KVs exclusively for the top layer but pairs them with queries from only the top half of layers, employing efficient attention in the bottom layers to maintain a constant cache size. In contrast, Cross-Layer Attention (CLA) \citep{brandon2024reducing} divides layers into groups, pairing queries from all layers in each group with KVs from that group's bottom layer. MiniCache \citep{liu2024minicache} introduces a novel method that merges layer-wise KV caches while enabling recovery during compute-in-place operations, optimizing KV cache size. These methods illustrate various trade-offs between computation, memory usage, and model performance when sharing KV caches across transformer layers. 

\textbf{Long-Context Benchmarks.}
The landscape of long-context model benchmarks has evolved to encompass a wide range of tasks, with particular emphasis on retrieval and comprehension capabilities. Benchmarks for understanding have made significant strides, with $\infty$-Bench~\citep{zhang2024infty} pushing the boundaries by presenting challenges that involve more than 100,000 tokens. LongBench~\citep{bai2023longbench} has introduced bilingual evaluations, addressing tasks such as long-document question answering, summarization, and code completion. Complementing these efforts, ZeroSCROLLS~\citep{shaham2023zeroscrolls} and L-Eval~\citep{an2023eval} have broadened the scope to include a diverse array of practical natural language tasks, including query-driven summarization.

In parallel, retrieval benchmarks have largely relied on synthetic datasets, offering researchers precise control over variables such as the length of input tokens. This approach minimizes the impact of disparate parametric knowledge resulting from varied training methodologies. A significant body of recent work has concentrated on the development of synthetic tasks specifically for retrieval evaluation~\citep{needle, mohtashami2023landmark, longchat, liu2024lost, hsieh2024ruler}. In addition, researchers have explored the potential of extended contexts in facilitating various forms of reasoning~\citep{tay2020long}. \citet{liu2025can} explores the influence of the KV cache method beyond long-context benchmarks. LongGenBench~\citep{liu2024longgenbench} advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer.

This dual focus on synthetic retrieval tasks and comprehensive understanding benchmarks reflects the field's commitment to rigorously assessing the capabilities of long-context models across diverse linguistic challenges.

\textbf{Prompting Compression.}
In the field of prompt compression, various designs effectively combine semantic information to compress natural language. \citep{wingate-etal-2022-prompt} utilize soft prompts to encode more information with fewer tokens. \citep{Chevalier2023AdaptingLM} present AutoCompressor, which uses soft prompts to compress the input sequence and extend the original length of the base model. Both \citep{zhou2023recurrentgpt} and \citep{wang2023recursively} recurrently apply LLMs to summarize input texts, maintaining long short-term memory for specific purposes such as story writing and dialogue generation. The LLMLingua series~\citep{jiang-etal-2023-llmlingua,jiang-etal-2024-longllmlingua,fei-etal-2024-extending} explores the potential of compressing LLM prompts in long-context, reasoning, and RAG scenarios. \citep{fei-etal-2024-extending} use pre-trained language models to chunk the long context and summarize semantic information, compressing the original context.

\textbf{KV Cache Quantization.}
Among quantization approaches for KV cache optimization, SmoothQuant~\citep{xiao2023smoothquant} stands out as a notable post-training technique. By implementing balanced transformations between activation and weight quantization difficulties, this method enables 8-bit KV cache compression with minimal performance degradation. However, research by ~\citep{zhao2024atom} has demonstrated that SmoothQuant encounters substantial accuracy deterioration when compressed beyond 8 bits. In parallel developments, the FlexGen system~\citep{flexgen} implements group-wise quantization at 4 bits for both key and value caches, while KIVI~\citep{liu2024kivi} introduces an innovative approach with asymmetric 2-bit per-channel quantization that requires no additional tuning phase. Additionally, AnTKV~\citep{li2025antkv} introduces a two-stage token-aware vector quantization framework, leveraging gradient-weighted centroid learning and anchor-token selection to achieve ultra-low-bit KV cache quantization with minimal accuracy loss.

\section{Statistics of Models}
\label{appendix:config_models}
Table~\ref{table:app_model_config} provides configuration parameters for LLMs that we evaluated in our experiments.
\begin{table}[ht]
   \resizebox{1\textwidth}{!}{
   \centering
   \begin{tabular}{c|c|c|c}
   \toprule

   \textbf{Model Name} & \textbf{LLaMA-3-8B-Instruct} & \textbf{Mistral-7B-Instruct-v0.2 \& 0.3} & \textbf{Qwen2-7B-Instruct} \\
   \midrule
   $L$ (Number of layers) & 32 & 32 & 28 \\
   \midrule
   $N$ (Number of attention heads) & 32 & 32 & 28\\
   \midrule
   $D$ (Dimension of each head) & 128  & 128 & 128 \\
   \bottomrule
   \end{tabular}
   }
   \caption{ \centering Models Configuration Parameters}
   \label{table:app_model_config}
\end{table}

\section{Statistics of Datasets}
\label{appendix:evaluation}
LongBench is a meticulously designed benchmark suite that evaluates the capabilities of language models in handling extended documents and complex information sequences. This benchmark was created for multi-task evaluation of long-context inputs and includes 17 datasets covering tasks such as single-document QA~\citep{kovcisky2018narrativeqa,dasigi2021dataset}, multi-document QA~\citep{yang2018hotpotqa,ho2020constructing,trivedi2022musique,he2017dureader}, summarization~\citep{huang2021efficient,zhong2021qmsum,fabbri2019multi,wu2023vcsum}, few-shot learning~\citep{li2002learning,gliwa2019samsum,joshi2017triviaqa}, synthetic tasks and code generation~\citep{guo2023longcoder,liu2023repobench}. The datasets feature an average input length ranging from $1K$ to $18K$ tokens, requiring substantial memory for KV cache management.

Table \ref{tab:dataset_statistic} shows the statistics of the datasets that we used in our experiments.
\begin{table}[!h]
   \centering
   % \small
   % \vskip 0.15in
   \begin{sc}
       % \resizebox{0.6\linewidth}{!}{
       \begin{tabular}{l|rr}
           \toprule
           Dataset  & \# Train & \# Test \\ \midrule
           
           GSM8K~\citep{gsm8k}    
           & 7,473     
           & 1,319   \\
           
           LongBench~\citep{bai2023longbench} 
           & - 
           & 4,750 \\
   
           NIAH*~\citep{needle} 
           & -
           & 800 \\

           JailBreakV~\cite{jailbreakv}
           & - 
           &28,000 \\
           
           \bottomrule
           \end{tabular}
       % }
   \end{sc}
   % \end{small}
   \caption{Dataset Statistics. \textsc{\# Train} and \textsc{\# Test} represent the number of training and test samples, respectively. *: The size of the NIAH test set varies based on the context length and step size, typically around 800 samples per evaluation.}

   \label{tab:dataset_statistic}
   % \vspace{-0.3cm}
   \end{table}


\section{Prompt}
\label{appendix:prompt}
Table \ref{tab:demo_prompt} shows the prompt for the Figure \ref{fig:main}.
\begin{table}[ht]
   \centering
   \footnotesize
   \resizebox{1\linewidth}{!}{
   \begin{tabular}{p{15cm}}
   \specialrule{1pt}{0pt}{2pt}
      \specialrule{1pt}{0pt}{2pt}
    \multicolumn{1}{c}{The prompt for demonstration}\\
   \midrule
   $\dots \dots$ \\
   $\dots \dots$ \\
   The purple-crested turaco (Gallirex porphyreolophus) or, in South Africa, the purple-crested loerie, (Khurukhuru in the Luven\d{d}a (Ven\d{d}a) language) is a species of bird in the clade Turaco with an unresolved phylogenetic placement. Initial analyses placed the purple-crested turaco in the family Musophagidae, but studies have indicated that these birds do not belong to this family and have been placed in the clade of Turacos with an unresolved phylogeny. It is the National Bird of the Kingdom of Eswatini, and the crimson flight feathers of this and related turaco species are important in the ceremonial regalia of the Swazi royal family.
   This bird has a purple-coloured crest above a green head, a red ring around their eyes, and a black bill. The neck and chest are green and brown. The rest of the body is purple, with red flight feathers. Purple-crested turacos are often seen near water sources, where they can be observed drinking and bathing, which helps them maintain their vibrant plumage.
   Purple-crested turacos are considered to be large frugivores that are known to carry cycad seeds from various plant species long distances from feeding to nesting sites. After fruit consumption, they regurgitate the seeds intact where they can germinate nearby. G. porphyreolophus primarily consumes fruits whole like many other large frugivores which are suggested to be necessary for effective ecosystem functioning. Among similar turacos, the purple-crested turaco have faster minimum transit times when consuming smaller seed diets than larger seed diets, and G. porphyreolophus has been shown to have significantly faster pulp (seedless fruit masses) transit time than another closely related Turaco when fed only the pulp of larger-seeding fruits than smaller-seeding fruits.
   In addition to their frugivorous diet, these birds are occasionally seen foraging for other food items such as nuts and leaves, which provide essential nutrients. They are also known to coexist with various other animals, including those that might enjoy strawberries and other similar fruits. The purple-crested turaco's role in seed dispersal is crucial, and their interaction with different elements of their habitat, including water and diverse plant materials, highlights their importance in maintaining ecological balance.\\
   $\dots \dots$ \\
   $\dots \dots$ \\
   \\
   \specialrule{1pt}{0pt}{2pt}
      \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
   % \vskip -1 em
   \caption{The prompt for demonstration }
   \label{tab:demo_prompt}
   % \vskip -1 em
\end{table}


Here we provide the CoT prompt exemplars for GSM8K which is used in section \ref{sec:icl}.
\begin{table}[ht]
   \centering
   \footnotesize
   \resizebox{1\linewidth}{!}{
   \begin{tabular}{p{15cm}}
   \specialrule{1pt}{0pt}{2pt}
  \specialrule{1pt}{0pt}{2pt}

    \multicolumn{1}{c}{GSM8K experiemnt CoT Prompt Exemplars}\\
   \midrule
   Question: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done, there will be 21 trees. How many trees did the grove workers plant today?     

   There are 15 trees originally.
   Then there were 21 trees after some more were planted.
   So there must have been 21 - 15 = 6.
   The answer is 6.
   \\
   Question: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot? 

   There are originally 3 cars.
   2 more cars arrive.
   3 + 2 = 5.
   The answer is 5.
   \\
   Question: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total? 

   Originally, Leah had 32 chocolates.
   Her sister had 42.
   So in total they had 32 + 42 = 74.
   After eating 35, they had 74 - 35 = 39.
   The answer is 39.
   \\
   Question: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops did Jason give to Denny? 

   Jason started with 20 lollipops.
   Then he had 12 after giving some to Denny.
   So he gave Denny 20 - 12 = 8.
   The answer is 8.
   \\
   Question: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does he have now? 

   Shawn started with 5 toys.
   If he got 2 toys each from his mom and dad, then that is 4 more toys.
   5 + 4 = 9.
   The answer is 9.
   \\
   Question: There were nine computers in the server room. Five more computers were installed each day, from monday to thursday. How many computers are now in the server room? 

   There were originally 9 computers.
   For each of 4 days, 5 more computers were added.
   So 5 * 4 = 20 computers were added.
   9 + 20 is 29.
   The answer is 29.
   \\
   Question: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many golf balls did he have at the end of wednesday? 

   Michael started with 58 golf balls.
   After losing 23 on tuesday, he had 58 - 23 = 35.
   After losing 2 more, he had 35 - 2 = 33 golf balls.
   The answer is 33.
   \\
   Question: Olivia has \$23. She bought five bagels for \$3 each. How much money does she have left? 

   Olivia had 23 dollars.
   5 bagels for 3 dollars each will be 5 x 3 = 15 dollars.
   So she has 23 - 15 dollars left.
   23 - 15 is 8.
   The answer is 8.
   \\
   \specialrule{1pt}{0pt}{2pt}
  \specialrule{1pt}{0pt}{2pt}
   \end{tabular}
   }
   % \vskip -1 em
   \caption{GSM8K CoT Prompt Exemplars }
   \label{tab:CoT_prompt}
   % \vskip -1 em
   \end{table}

\section{Impact Statement}
\label{sec:impact_statement}
Our study does not involve human subjects, data collection from individuals, or experiments on protected groups. The models and datasets used in this work are publicly available and widely used in the research community. We have made efforts to ensure our experimental design and reporting of results are fair, unbiased, and do not misrepresent the capabilities or limitations of the methods presented.

In our work on KV cache compression for large language models, we acknowledge the potential broader impacts of improving efficiency in AI systems. While our method aims to reduce computational resources and potentially increase accessibility of these models, we recognize that more efficient language models could also lead to increased deployment and usage, which may have both positive and negative societal implications. We encourage further research and discussion on the responsible development and application of such technologies.

We declare no conflicts of interest that could inappropriately influence our work. All experiments were conducted using publicly available resources, and our code will be made available to ensure reproducibility. We have made every effort to cite relevant prior work appropriately and to accurately represent our contributions in the context of existing research.


\section{Limitations}
\label{sec:limitations}

While ChunkKV demonstrates strong performance and efficiency, we acknowledge several limitations that provide avenues for future research.

First, the core design of ChunkKV prioritizes the preservation of semantic coherence by treating chunks as indivisible units. This approach may be less suitable for tasks requiring absolute semantic fidelity, where every single token can be critical. For example, in domains such as legal or biomedical document analysis, discarding any part of the text, even if it appears less relevant based on attention scores, could lead to the loss of crucial information. Our method is therefore optimized for tasks where capturing the core gist is more important than verbatim retention of every detail.

Second, our experiments with a hybrid compression strategy (using ChunkKV in early layers and a token-level method in deeper layers) revealed a nuanced, task-dependent performance trade-off. While pure ChunkKV excelled in local information retrieval tasks (e.g., Single- and Multi-Document QA), the hybrid model showed surprising strength in tasks requiring global understanding (e.g., Summarization). This suggests that a one-size-fits-all compression strategy may not be optimal for all scenarios. The ideal approach might involve dynamically adapting the compression granularity based on the perceived task requirements.

Finally, our current implementation relies on fixed-size chunks for computational efficiency. While our ablations show this is a robust strategy, adaptively determining chunk boundaries based on linguistic cues (e.g., sentence endings) could potentially improve semantic integrity further, although this would introduce additional inference latency that must be carefully managed.

\section{Licenses}
\label{app_licenses}
For the evaluation dataset, all the datasets, including, GSM8K~\citep{gsm8k}, LongBench~\citep{bai2023longbench} are released under MIT license. NIAH~\citep{needle} is released under GPL-3.0 license.
   

\end{document}