\section{Preliminaries}

\subsection{Notation and Problem Formulation}
An autoregressive Transformer LLM $f_{LM}$ caches historical Key and Value states during inference. For an input of $N$ tokens, let $V=\{1,\dots,N\}$ denote the cached token positions. KV cache compression selects a retained subset $S \subset V$ under budget $|S|\le B$, aiming to preserve the behavior of the full cache:
\begin{equation}
f_{LM}(\cdot \mid S) \approx f_{LM}(\cdot \mid V)
\end{equation}
which we evaluate through downstream long-context performance.

\subsection{Submodular Set Functions}
For a set utility $F:2^V\rightarrow\mathbb{R}$, submodularity formalizes diminishing returns~\cite{krause2014submodular}: for $A\subseteq B\subseteq V$ and $v\in V\setminus B$,
\begin{equation}
F(A \cup \{v\}) - F(A) \ge F(B \cup \{v\}) - F(B)
\end{equation}
Monotonicity means $F(A)\le F(B)$ whenever $A\subseteq B$. Cardinality-constrained monotone submodular maximization is NP-hard, but sequential greedy selection achieves a $(1-1/e)$ approximation guarantee~\cite{nemhauser1978analysis}. We use this result only as theoretical motivation for redundancy-aware cache selection.