7/28/2025
Reinforcement learning allows one to train policies by scoring their generations with an arbitrary reward model. In the simpest version of RL, you generate from a model $\pi$ and also differentiate with respect to it. From a systems perspective, its desirable to have a different policy generate the trajectories. This is because inference might be lagging behind training, you want to take multiple gradient steps per rollout, inference might use a different precision, etc. In this article, we will take a first principles look at the design decisions for dealing with this problem of off-policy RL.
Massive credit to Konwoo for teaching me basically everything shared here.
Our goal is to maximize the expected reward of a policy $\pitheta$, which is given by
\begin{aligned} J(\theta) = \Eof{\tau \sim \pitheta} {R(\tau)} \end{aligned}
for rollouts $\tau$ scored by a reward function $R$. Though there are traditionally inputs $x$ associated with the rollouts, we will suppress them for simplicity. Akin to supervised machine learning, we would like to take the derivative of $J(\theta)$ with respect to $\theta$ and update $\theta$ in the direction of the gradient. However, this derivative is different since (1) the reward function is not differentiable and (2) the sampler is your policy, not a separate data generating process. REINFORCE constructs the policy gradient by utilizing the log-gradient trick as shown below.
\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta \Eof{\tau \sim \pitheta} {R(\tau)} \\ &= \nabla_\theta \int_{\tau} \pitheta(\tau) R(\tau) d\tau && \text{[definition of expectation]}\\ &= \int_{\tau} \nabla_\theta \pitheta(\tau) R(\tau) d\tau && \text{[swap integral and gradient]}\\ &= \int_{\tau} \pitheta(\tau) \nabla_\theta \log \pitheta(\tau) R(\tau) d\tau && \text{[log-gradient trick, $\nabla f(x) = f(x) \nabla \log f(x)$]}\\ &= \boxed{\Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau) {R(\tau)}}} && \text{[definition of expectation]} \end{aligned}
The boxed expression is the policy gradient which we want to use to update $\theta$.
To actually compute the policy gradient, it is standard to construct a surrogate reward function that PyTorch's autograd can differentiate through. We note there are two ways to do this. The first method applies the log-gradient trick in reverse and leverages $\frac{\nabla_\theta \pitheta(\tau)}{\pitheta(\tau)}$ with the objective
\begin{equation}\label{eq:on-policy-surrogate-1} J(\theta) = \Eof{\tau \sim \pitheta} {\frac{\pitheta(\tau)}{\texttt{detach}(\pitheta(\tau))} {R(\tau)}} \end{equation}
where the $\texttt{detach}$ operator signifies that we are not differentiating through the denominator. The second method directly exploits the fact that the policy gradient uses $\nabla_\theta \log \pitheta(\tau)$ by constructing the surrogate reward
\begin{equation}\label{eq:on-policy-surrogate-2} J(\theta) = \Eof{\tau \sim \pitheta} {\log \pitheta(\tau) {R(\tau)}} \end{equation}
Note that the even though the first and second methods are not equivalent in value, their gradients are the same (both correspond to the policy gradient). Therefore, in our vanilla on-policy setting, both methods are exactly the same.
Though this is an unbiased estimate of the policy gradient, we are estimating it with a finite number of data points and might want to reduce its variance. One way to do this is to subtract a baseline $b$ from the reward. As long as this is independent of the trajectory $\tau$, this doesn't change the expected policy gradient, as shown below.
\begin{aligned} \nabla_\theta J_{b}(\theta) & = \Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau) ({R(\tau)} - b)} \\ & = \nabla_\theta J(\theta) - b \Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau)} && \text{[$b$ is independent of $\tau$]} \\ & = \nabla_\theta J(\theta) - b \Eof{\tau \sim \pitheta} {\frac{\nabla_\theta \pitheta(\tau)}{\pitheta(\tau)}} && \text{[log-gradient trick]} \\ & = \nabla_\theta J(\theta) - b \int_{\tau}\nabla_\theta {{\pitheta(\tau)}} && \text{[convert to integral]} \\ & = \nabla_\theta J(\theta) - b \nabla_\theta \int_{\tau} {{\pitheta(\tau)}} && \text{[exchange order]} \\ & = \nabla_\theta J(\theta) - b \nabla_\theta 1 && \text{[integral of density is 1]} \\ & = \nabla_\theta J(\theta) && \text{[derivative of constant]} \\ \end{aligned}
Note that this fully follows from the expectation of the log-gradient being zero (modularly shown in Appendix A). A common target for the baseline is the value function, or the mean reward of trajectories sampled from the policy. This can be estimated by learning a value function on the fly or by taking the mean reward of trajectories sampled from the policy in a batch. We will ignore baselines until we analyze the clipped policy gradient.
From a systems perspective, the policy generating the data is different from the policy we are updating (Mnih et al, 2016, Espeholt et al, 2018). For example, language model inference is expensive and utilizes a different infrastructure than training (vLLM vs HuggingFace). Therefore, the sampling distribution can be different from the training distribution because it is a few steps behind, it uses different code, or is quantized for efficiency. Unfortunately, the policy gradient assumes that the data is sampled from the policy we are updating.
Off-policy RL methods are designed to handle this. The core solution is to use importance sampling to reweight the likelihood of each trajectory. Therefore, even though the trajectories are sampled from a different policy, we get an unbiased estimate of the policy gradient if we had been sampling on-policy. The importance sampled policy gradient is given by
\begin{aligned} \nabla_\theta J(\theta) = \Eof{\tau \sim \pigen} {\frac{\pitheta(\tau)}{\pigen(\tau)} \nabla_\theta \log \pitheta(\tau) {R(\tau)}} \end{aligned}
where $\pigen$ is the policy that generated the data. In expectation, this is the same as the on-policy policy gradient. However, this estimator has incredibly high variance because the importance sampling weights can get very large; this is worsened the further off-policy we are. How can we address this?
The solution employed by the most popular methods of Proximal Policy Optimization (PPO) (Schulman et al. 2017) and Trust Region Policy Optimization (TRPO) (Schulman et al. 2015) is to clip the importance sampling weights to be between 1-$\epsilon$ and 1+$\epsilon$, denoted $\clip{\frac{\pitheta(\tau)}{\pigen(\tau)}}{1-\epsilon}{1+\epsilon}$. On top of this, PPO uses a pessimistic bound on the reward, which means it takes the minimum reward between clipping and no clipping. This is given by the following expression adapting Equation \eqref{eq:on-policy-surrogate-1} with importance sampling, clipping, and pessimistic updates:
\begin{aligned} J_{\text{PPO}}(\theta) = \Eof{\tau \sim \pigen} {\min \left(\frac{\pitheta(\tau)}{\pigen(\tau)} {R(\tau)}, \clip{\frac{\pitheta(\tau)}{\pigen(\tau)}}{1-\epsilon}{1+\epsilon} {R(\tau)} \right)} \end{aligned}
Note that similar to Equation \eqref{eq:on-policy-surrogate-1}, this update leverages the fact that the numerator of the importance sampling weight $\pitheta(\tau)$ is differentiable whereas the denominator $\pigen(\tau)$ is detached.
On a historical note, people have really not tried to change PPO that much. For example, in the original paper, they tuned the clipping parameter $\epsilon$ to be $0.2$ after trying $0.1$ and $0.3$. DAPO (Yu et al, 2025) notes that people have defaulted to this practice since, and they found that using $0.28$ led to much better performance, confirmed by papers such as Magistral (Mistral-AI, 2025).
Also note that this isn't all there is to PPO. It also has other features such as learning a value function for the baseline, boostrapping, entropy bonuses, etc. We will not study these today.
The conservativeness of the PPO update is motivated by classical intuition from "trust regions" and the natural policy gradient (Kakade, 2002) where one should not stray too far from the reference policy. In older RL settings, the generator policy is frequently reset to the current policy, promoting the use of updates that don't stray too far from the initial policy.
However, this can prevent going further off-policy. One striking failure mode of PPO is that the policy will not move at all for positive updates that trigger clipping. This is because when clipping triggers, the importance sampling weight becomes a constant, the reward is also a constant, and the gradient becomes zero. Though this is desired under the trust region intuition, this prevents more aggressive off-policy RL. Though we still need clipping for stability, we can apply it directly to the policy gradient via the following expression:
\begin{aligned} \nabla_\theta J(\theta) = \Eof{\tau \sim \pigen} {\clip{\frac{\pitheta(\tau)}{\pigen(\tau)}}{1-\epsilon}{1+\epsilon} \nabla_\theta \log \pitheta(\tau) {R(\tau)}} \end{aligned}
If we would like this policy gradient instead of the zero'd out gradient, we can follow Equation \eqref{eq:on-policy-surrogate-2} instead of Equation \eqref{eq:on-policy-surrogate-1} and get the following expression:
\begin{aligned} J(\theta) = \Eof{\tau \sim \pigen} {\clip{\texttt{detach}\left(\frac{\pitheta(\tau)}{\pigen(\tau)}\right)}{1-\epsilon}{1+\epsilon} \log \pitheta(\tau) {R(\tau)}} \end{aligned}
Similar to the dichotomy between Equation \eqref{eq:on-policy-surrogate-1} and Equation \eqref{eq:on-policy-surrogate-2}, this reward and the PPO reward have different values, but yield the same gradient when there is no clipping. However, when there is clipping, the corrected reward yields nonzero gradient, unlike PPO. This is the approach followed by Minimax in their CISPO objective (Minimax, 2025) and by Meta in their asynchronous RL implementation (Wu et al, 2025). In both implementations, they only set an upper bound for clipping. This less conservative approach works better in practice and enables going further off-policy. From here on out, we will always clip the importance sampling ratio in the gradients, not rewards.
Though clipping induces stability by reducing variance, it makes our gradient estimator biased and inconsistent. This means that even with infinite samples per batch, we will get an incorrect estimate of the gradient and the policy will not converge to the optimal policy. Though we are generally happy to trade off some bias for variance, it is unclear that clipping is the best way to do this.
One related property is that unlike on-policy RL, shifting the reward by a constant will change the expected policy gradient. For example, suppose we introduced a constant baselines $b$. As established earlier, when clipping doesn't trigger, the policy gradient doesn't change. However, when upper-bound clipping triggers, the policy gradient gets shifted by
\begin{aligned} -(1+\epsilon) \nabla_\theta \log \pitheta(\tau) b \end{aligned}
This term is not zero expectation and corresponds to a vanilla behavior cloning term. Specifically, for any clipped trajectory, maximizing this term leads to imitating the generator policy on the samples that triggered clipping. If $b$ is negative, this leads to running away from the generator policy. Therefore, calibrating the rewards is really important for clipped off-policy RL to prevent cloning, not just for stability.
This fact has been observed in recent papers in the context of REINFORCE with no importance sampling where they find that calibrating this term helps (Arnal et al, 2025, Le Roux et al, 2025) and in the context of older papers on off-policy bandits on a batch level (Swaminathan and Joachims, 2015). Our bias is slightly different in that it comes from clipped importance sampling weights which one hopes is closer to the true policy gradient.
There is a very cute interpretation of the policy gradient as the gradient of the KL divergence between the policy improvement operator and the current policy (Konwoo and I learned from Ghosh et al, 2020). Specifically, define the policy improvement operator of a policy $\pi$ as $\mathcal{R}\pi$ defined by
\begin{aligned} \mathcal{R}\pi(\tau) := \frac{R(\tau) \cdot \pi(\tau)}{\int_{\tau'} R(\tau) \cdot\pi({\tau'})} \end{aligned}
Note that the optimal policy is a fixed point of this operation, namely $\mathcal{R}\pi^* = \pi^*$. When generating data from a (detached) policy $\pigen$ for a policy $\pitheta$, it can be shown that the vanilla REINFORCE policy gradient is proportional to $-\nabla_{\theta} \text{KL}(\mathcal{R}\pigen || \pitheta).$
\begin{aligned} & -\nabla_{\theta} \text{KL}(\mathcal{R}\pigen || \pitheta) \\ &= -\nabla_{\theta} \Eof{\mathcal{R}\pigen}{\log \mathcal{R}\pigen(\tau) - \log \pitheta(\tau)} && \text{[definition]} \\ &= -\nabla_{\theta} \Eof{\mathcal{R}\pigen}{- \log \pitheta(\tau)} && \text{[drop constant]} \\ &= \nabla_\theta \int_{\tau} {\frac{R(\tau) \cdot \pigen(\tau)}{\int_{\tau'} R(\tau)\cdot\pigen({\tau'})} \log \pitheta(\tau)} && \text{[convert to integral]} \\ &\propto \nabla_\theta \int_{\tau} {R(\tau) \pigen(\tau) \log \pitheta(\tau)} && \text{[drop constant]} \\ &= \int_{\tau} {R(\tau) \pigen(\tau) \nabla_\theta \log \pitheta(\tau)} && \text{[change order]} \\ &= \Eof{\tau \sim \pigen}{R(\tau) \nabla_\theta \log \pitheta(\tau)} && \text{[convert to expectation]} \\ \end{aligned}
and we've recovered the original policy gradient! Therefore, we can view the policy gradient as taking one step closer to the policy under the operator.
This explains a lot of the phenomenon we're seeing. The most interesting one is the cloning: if we retain a stale generator policy, then we converge to $\mathcal{R}\pigen$, which is not the optimal policy. In on-policy land, since we continually reset the generator, we iteratively approach the optimal policy.
So far, since we have treated this problem as contextual bandits where each trajectory is a single step. However, real language has multiple tokens per trajectory. Unlike older RL settings, many popular reward functions (i.e. is the solution to this math problem correct) only assign outcome-level rewards instead of rewarding intermediate tokens. Therefore, we can mostly proceed with our algorithms as earlier, assuming that each token receives the reward of the final outcome.
However, when we are clipping the importance sampling ratios, it will be different based on whether we clip at a token level or sequence level. It is unclear how to handle this: though sequence-level is the principled solution, it will incur much higher variance than the token-level importance sampling weights. Group Sequence Policy Optimization (Zheng et al, 2025) claims that sequence level is better, but they have to do a length normalization to make the sampling ratio stable. This shows that there are different ways to handle the bias/variance tradeoff with different properties.
A different, potentially more principled approach is to use value-based RL/soft-Q learning. Under this paradigm, we exploit the fact that we can characterize the optimal policy even if we can't compute it. This gives rise to a condition that only the optimal policy satisfies. We can formulate this as a consistency objective over any generations, which can minimize using gradient descent.
Let's do math (following exposition from Kimi Team, 2025). Suppose we wanted to do KL-constrained RL. Our objective is then
\begin{aligned} \min_\theta \Eof{\tau \sim \pitheta} {R(\tau)} - \beta \text{KL}(\pitheta || \piref) \end{aligned}
It is known that the minimizer of the above objective is the following
\begin{aligned} \pi^*(\tau) = \frac{1}{Z} \piref(\tau) \exp\left(\frac{{R(\tau)}}{\beta}\right) \end{aligned}
where $Z = \int_{\tau} \piref(\tau) \exp\left(\frac{R(\tau)}{\beta}\right) d\tau$ is the partition function, making this a valid distribution (proof in Appendix TODO). $\beta\log Z$ is also called the soft-value function, quantifying the expected reward from generating with the current policy. This means that the optimal policy satisfies
\begin{aligned} {R(\tau)} - \beta \log Z = \beta \log \frac{\pi^*(\tau)}{\piref(\tau)} \end{aligned}
Therefore, the minimizer of the following objective corresponds to the optimal policy
\begin{aligned} \Eof{\tau \sim \pigen} {\left({R(\tau)} - \beta \log Z - \beta \log \frac{\pitheta(\tau)}{\piref(\tau)}\right)^2} \end{aligned}
Kimi Team, 2025 makes the approximation that $\beta \log Z \approx \bar{R} := \Eof{\tau \sim \pigen} {R(\tau)}$, which is true if $\beta \to \infty$. Unclear if this is the right thing to do, if you take the other limit as $\beta \to 0$, you get the maximum reward instead of the mean reward. Anyway, the gradient of the above objective (up to scalars) is
\begin{aligned} \Eof{\tau \sim \pigen} {\nabla \log \pi_{\theta}(\tau)(R(\tau) - \bar{R}) - \beta\nabla_{\theta}\frac{1}{2}\left(\log \frac{\pitheta(\tau)}{\piref{\tau}}\right)^2} \end{aligned}
Note that this is pretty close to the policy gradient we derived earlier. This is the REINFORCE gradient with no importance sampling, using mean reward as a baseline (only because that's our $\beta \log Z$ heuristic), using an $\ell_2$ regularization instead of KL divergence.
RANDOM NOTE: the $\ell_2$ regularizer is an unbiased estimator of the gradient of the KL divergence when the sampler is $\pitheta$, as shown in Appendix B.
Here are a couple of ideas that didn't pan out.
It seems like clipping is introducing a lot of issues by introducing bias and breaking shift-invariance. Is there a better way to do off-policy RL? It is well known that importance sampling isn't necessarily the lowest variance estimator. Here, we consider using self-normalized importance sampling for estimating the policy gradient. This corresponds to
\begin{aligned} \nabla_\theta J(\theta) = \Eof{\tau \sim \pigen} {\frac{\frac{\pitheta(\tau)}{\pigen(\tau)}}{\Eof{\tau' \sim \pigen} {\frac{\pitheta(\tau')}{\pigen(\tau')}}} \nabla_\theta \log \pitheta(\tau) {R(\tau)}} \end{aligned}
In practice, for a batch of trajectories, one would estimate the denominator using the mean of the importance sampling weights for the batch. Though this weighting scheme is biased (unlike importance sampling which replaces the denominator with 1), it is consistent (i.e. it achieves the true policy gradient in the limit of infinite samples). Additionally, it is often much lower variance than the importance sampled policy gradient.
Unfortunately, this did not work in the toy setting of interest.
TODO MAYBE, doesn't work
TODO, reduces variances but also leads to the wrong loss minimizer. Maybe okay with multiple iterations.
One derivation that comes up a lot is showing $\Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau)} = 0$. Writing it out for personal reference:
\begin{aligned} \Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau)} &= \int_\tau \nabla_\theta \log \pitheta(\tau) \pitheta(\tau) d\tau && \text{[definition]} \\ &= \int_\tau \nabla_\theta \pitheta(\tau) d\tau && \text{[log-gradient trick]} \\ &= \nabla_\theta \int_\tau \pitheta(\tau) d\tau && \text{[exchange order]} \\ &= \nabla_\theta 1 && \text{[integral of density is 1]} \\ &= 0 && \text{[derivative of constant]} \\ \end{aligned}
One common way to increase stability (while changing your target policy) is to modify the objective by adding KL regularization with respect to a reference policy. Interestingly, according to Tang and Munos, 2025, most people do this wrong in practice. To do it right, let's first find the derivative of the KL divergence.
\begin{aligned} & \nabla_\theta D_{KL}(\pitheta || \piref) \\ &= \nabla_\theta \Eof{\tau \sim \pitheta} {\log \pitheta(\tau) - \log \piref(\tau)} && \text{[definition]} \\ &= \int_\tau \nabla_\theta \left( \pitheta(\tau) \left( \log \pitheta(\tau) - \log \piref(\tau) \right) \right) && \text{[unwrap expectation]} \\ &= \int_\tau \nabla_\theta \pitheta(\tau) \left( \log \pitheta(\tau) - \log \piref(\tau) \right) + \int_\tau \pitheta(\tau) \nabla_\theta \log \pitheta(\tau) && \text{[product rule]} \\ &= \int_\tau \nabla_\theta \pitheta(\tau) \left( \log \pitheta(\tau) - \log \piref(\tau) \right) && \text{[Appendix A]} \\ &= \int_\tau \pitheta(\tau) \nabla_\theta \log \pitheta(\tau) \left( \log \pitheta(\tau) - \log \piref(\tau) \right) && \text{[log-gradient trick]} \\ &= \Eof{\tau \sim \pitheta} {\nabla_\theta \log \pitheta(\tau) \left( \log \pitheta(\tau) - \log \piref(\tau) \right)} && \text{[wrap expectation]} \\ \end{aligned}
Similar to how there are two ways to set up the reward for the policy gradient, there are two quantities that yield the same KL gradient.
\begin{aligned} \Eof{\tau \sim \pitheta} {\frac{1}{2} \left(\log \pitheta(\tau) - \log \piref(\tau) \right)^2} \end{aligned}
This estimator, introduced by John Schulman in his blog, is actually a biased estimator of the KL divergence and achieves low error when the policies are close to each other (Schulman et al, 2017). Interestingly, it is always an unbiased estimator of the KL divergence.\begin{aligned} \Eof{\tau \sim \pitheta} {\log\pitheta(\tau) \texttt{ detach}\left(\log\pitheta(\tau) - \log\piref(\tau)\right)} \end{aligned}
This estimator exploits autograd for a simpler expression, similar to our surrogate reward for policy gradient.Note that the naive unbiased estimator of $\Eof{\tau \sim \pitheta} {\log \pitheta(\tau) - \log \piref(\tau)}$ does not yield the correct derivative: in fact, its value is zero by Appendix A.
These unbiased estimators can incur high variance. Therefore, we can add baselines to reduce variance. For example, we can normalize $\log \pitheta(\tau) - \log \piref(\tau)$ by its expected value using leave-one-out.
Note that both of these estimators require sampling from the reference policy. To make these principled, need to apply importance sampling, TODO but that incurs high variance. TODO how to solve this?