\documentclass{article}
\usepackage[nonatbib,final]{neurips_2025}
\usepackage[numbers]{natbib}
% [preprint] option:
%     \usepackage[preprint]{neurips_2025}
% to compile a camera-ready version, add the [final] option, e.g.:

\input{math_command}
\input{graphics}
\usepackage{tabularx}
\newcolumntype{L}[1]{>{\raggedright\arraybackslash\setlength{\parskip}{0pt}\setlength{\topsep}{0pt}}p{#1}} % table
\usepackage{array}
\hypersetup{
    colorlinks,
    linkcolor={red!50!black},
    citecolor={red!50!black},
    urlcolor={red!50!black}
}

\newcommand{\kvc}{\text{KV}_c}
\newcommand{\kvp}{\text{KV}_{c,\text{evicted}}}
\newcommand{\ct}{\mathcal{T}}
\newcommand{\lm}{f_\text{LM}}

\usepackage{pifont} % for check and x marks
\newcommand{\cmark}{\textcolor{dg!80!black!90}{\ding{51}}} % check mark
\newcommand{\xmark}{\textcolor{dr}{\ding{55}}}   % x mark
\renewcommand{\arraystretch}{1.05} % Slightly tighter row spacing


\title{KVzip: Query-Agnostic KV Cache Compression\\ with Context Reconstruction}

% Using \And between authors leaves it to LaTeX to determine where to break the
% lines. Using \AND forces a line break at that point. 
\author{
  \vspace{-1.7em}\\
  \textbf{Jang-Hyun Kim$^{1\,2}$, ~Jinuk Kim$^{1\,2}$, ~Sangwoo Kwon$^{1}$, ~Jae W. Lee$^{1}$,}\\
  \textbf{Sangdoo Yun$^{3}$, ~Hyun Oh Song\thanks{Corresponding author}~~$^{1\,2}$}\vspace{2pt}\\
  $^{1}$Seoul National University, $^{2}$Neural Processing Research Center, $^{3}$NAVER AI Lab\\
  \texttt{\small\{blue378, hyunoh\}@snu.ac.kr}\vspace{2pt}\\
  \url{https://github.com/snu-mllab/KVzip}
}


\begin{document}

\maketitle

\vspace{-1.2em}
\begin{abstract}
\vspace{-0.5em}
Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces \textit{KVzip}, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by $3$\minus$4\times$ and FlashAttention decoding latency by approximately $2\times$, with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1, Qwen2.5, and Gemma3, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90\% cache budget ratio under multi-query scenarios.\looseness=-1
\end{abstract}


\input{section/1.intro}
\input{section/2.prelim}
\input{section/3.method}
\input{section/4.experiment}
\input{section/5.conclusion}

\bibliography{main}
\bibliographystyle{abbrvnat}
% \input{section/7.checklist}
\input{section/6.appendix}

\end{document}