Prediction with a Short Memory

$ \def\R{\mathbb{R}} \def\model{q_\theta} \newcommand{\fd}[2]{D_f(#1 \; \| \; #2)} \newcommand{\Pof}[1]{P\left(#1\right)} \newcommand{\Eof}[2]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\p}[1]{\left(#1\right)} \newcommand{\b}[1]{\left[#1\right]} \newcommand{\br}[1]{\left\{#1\right\}} \newcommand{\KL}[2]{D_{\text{KL}}\p{#1\;||\;#2}} \newcommand{\CE}[2]{\text{H}\p{#1, #2}} \newcommand{\H}[1]{\text{H}\p{#1}} \def\given{\;|\;} $

12/18/2025

Prediction with a short memory (Sharan et al, 2018) considers the simple problem of predicting the next observation $x_t$ given a sequence of past observations $x_{0}, x_1, \ldots, x_{t-1}$. Given such an underlying sequence drawn from an unknown distribution, it might seem hard to match the oracle algorithm which can predict the true distribution of $x_t$ conditioned on the full past. This paper makes the surprising observation that a predictor that only looks at the $\ell$ most recent tokens can achieve "low" error. Specifically, if the mutual information between the future and past is on average $I$, the optimal predictor conditioned on the $\ell$ most recent observations has KL error bounded by $I/\ell$ (alternatively, achieving $\epsilon$ error only requires looking at the last $I/\epsilon$ observations).

The paper does a thorough job of characterizing this result with more specific results for HMMs, computational/statistical lower bounds, etc. This blog post will share a very simple proof given in Sections 3 and 4 that I personally found more illuminating and general. We will walk through a subset of the analysis in these sections.

Preliminaries

The entropy of a random variable $X$ is $H(X) = \Eof{X}{-\log\Pof{X}}$. The mutual information between two random variables $X, Y$ is defined as $I(X;Y) = H(X) - H(X\given Y)$, or how much the entropy in $X$ decreases as a result of knowing $Y$ (i.e. $H(X)$ is 0 if the variables are independent and is $H(X)$ if $Y$ determines $X$).

We first observe that the mutual information is symmetric

\begin{aligned} & I(X;Y) \\ &= \Eof{X,Y}{-\log\Pof{X}} - \Eof{X,Y}{-\log\Pof{X\given Y}} \\ &= \Eof{X,Y}{\log\frac{\Pof{X\given Y}}{\Pof{X}}} \\ &= \Eof{X,Y}{\log\frac{\Pof{Y\given X}}{\Pof{Y}}} \\ &= \Eof{X,Y}{-\log\Pof{Y}} - \Eof{X,Y}{-\log\Pof{Y\given X}} \\ &= I(Y;X) \end{aligned}

In the above derivation, note that simplifying $\Eof{X,Y}{\log\frac{\Pof{X\given Y}}{\Pof{X}}}$ yields

$$I(X ; Y) = \Eof{Y}{\KL{\Pof{X\given Y}}{\Pof{X}}}$$

We also note we can generalize this definition of mutual information by conditoning on a random variable $Z$, giving.

$$I(X ; Y \given Z) = \Eof{Y,Z}{\KL{\Pof{X\given Y,Z}}{\Pof{X \given Z}}}$$

We define the mutual information between the future and the past observations $\{x_t\}$ from the model $\mathcal{M}$ by averaging over every time step we make a prediction. After defining $x_i^j$ as $x_i, \ldots, x_j$, the mutual information is defined as

$$I(\mathcal{M}) = \lim_{T\to\infty} \frac{1}{T} \sum_{t\in[0,T-1]} I(x^{\infty}_t ; x^{t-1}_{-\infty})$$

It will be helpful later to observe that mutual information has a chain rule which can be derived from auto-regressively decomposing the future predictions.

$$\Pof{x_t^{\infty} \given x^{t-1}_{-\infty}} = \prod_{i\in[t,\infty]} \Pof{x_i^{\infty} \given x^{t-1}_{-\infty}, x^{i-1}_t} \\ \Longrightarrow \\ I(x_t^{\infty} ; x^{t-1}_{-\infty}) = \sum_{i\in[t,\infty]} I(x_i^{\infty} ; x^{t-1}_{-\infty} \given x^{i-1}_t)$$

Comparing short memory and the oracle

In this section, we will analyze the error of the optimal algorithm that predicts the observations indexed $[\tau, \tau + \ell - 1]$ given no prior history. Note that this is strictly more difficult than always having $\ell$ past observations. Since I train language models, this feels familiar to me as "concatenate all documents and chunk into independent length $\ell$ windows".

We will compare this short memory algorithm with the oracle algorithm that observes the full history. For time $t \in [\tau, \tau + \ell - 1]$, we will define the gap between the short and long memory algorithms as

$$\delta_{\tau}^{(t)} = \Eof{x_{-\infty}^{t-1}}{\KL{\Pof{x_t \given x^{t-1}_{-\infty}}}{\Pof{x_t \given x^{t-1}_{\tau}}}}$$

where our goal is to bound $\frac{1}{\ell}\sum_{t\in[\tau, \tau + \ell - 1]} \delta_{\tau}^{(t)}$. Note that under this definition, the error directly simplifies to

$$\delta_{\tau}^{(t)} = I(x_t ; x^{\tau-1}_{-\infty} \given x^{t-1}_{\tau})$$

This can be seen from the earlier definition of conditional mutual information with the future $X = x_t$, far past $Y = x_{-\infty}^{\tau-1}$, and recent past $Z = x^{t-1}_{\tau}$, and can be read as "how much additional information does the far past contain about the future". This quantifies one of the most important intuitions of the paper that difficult predictions contain a lot of information about the past.

Given this equivalence between error and mutual information, we can bound the average error of the window by the error of the slightly harder task of predicting the entire future from $x_{-\infty}^{\tau-1}$. We now put all of the above steps together.

\begin{aligned} & \frac{1}{\ell}\sum_{t\in[\tau,\tau+\ell-1]} \delta_{\tau}^{(t)} && \text{[average error of window]} \\ = & \frac{1}{\ell}\sum_{t\in[\tau,\tau+\ell-1]} I(x_t ; x^{\tau-1}_{-\infty} \given x^{t-1}_{\tau}) && \text{[error is mutual info]} \\ \leq & \frac{1}{\ell}\sum_{t\in[\tau,\infty]} I(x_t ; x^{\tau-1}_{-\infty} \given x^{t-1}_{\tau}) && \text{[harder to predict full future]} \\ = & \frac{1}{\ell} I(x_\tau^{\infty} ; x^{\tau-1}_{-\infty}) && \text{[chain rule]} \\ \end{aligned}

Woohoo, we have now related the average prediction error of the short memory predictor over a window to the full mutual information between the future and past at time $\tau$! To finish the proof, to predict $x_t$, we can average over predictors chunked at $\tau \in [t-\ell+1, t]$. This results in the average error being bounded by the full mutual information.

\begin{aligned} &\lim_{T\to\infty} \frac{1}{T}\sum_{t\in[0,T-1]} \frac{1}{\ell}\sum_{\tau \in [t-\ell+1,t]} \delta^{(t)}_{\tau} && \text{[average error of chunked predictors]} \\ = &\lim_{T\to\infty} \frac{1}{T}\sum_{\tau\in[0,T-1]} \frac{1}{\ell}\sum_{t \in [\tau,\tau + \ell - 1]} \delta^{(t)}_{\tau} && \text{[re-indexing under limit]} \\ \leq &\lim_{T\to\infty} \frac{1}{T}\sum_{\tau\in[0,T-1]} \frac{1}{\ell} I(x_\tau^{\infty} ; x^{\tau-1}_{-\infty}) && \text{[earlier bound]} \\ = & \frac{1}{\ell} I(\mathcal{M}) && \text{[mutual info of model]} \\ \end{aligned}

This completes the relationship between the difficulty of predicting with a short memory and the difficulty of predicting with the full history (i.e. the mutual information).

Conclusion

The above proof uses rather simple identities to bound the error of prediction with a short memory as $\mathcal{O}\p{\frac{1}{\ell}}$. For me, this empirically lines up with some observations that the loss decreases at this reciprocal rate for context length at both training and inference. I really love the simplicity of this result, both in problem statement and proof. The full paper contains more results with concrete rates and lower bounds for HMM's, which may be more intuitive for others. Thank you for reading, and feel free to reach out with any questions or thoughts!