Authors: Zhenghao Xu, Qin Lu, Changlong Yu, Tuo Zhao
Georgia Tech, Amazon (Work in Progress)
Last updated: February 10, 2026 | [Code] | [Paper]
<aside> 💡
TL;DR
Attaining stability and efficiency in on-policy gradient methods requires various off-policy correction/trust-region tricks. We revisit a simple off-policy method, policy mirror descent (PMD), adopted by Kimi K1.5/K2, and find that it is not simply doing KL regularization with a moving anchor.
Reinforcement Learning (RL) is now the standard paradigm for post-training LLMs, yet scaling these frameworks efficiently and stably remains a challenge. Most current approaches attempt to push on-policy gradient methods (GRPO, REINFORCE/RLOO, GSPO, CISPO, etc) to their limit using asynchronous frameworks like AReaL or PipelineRL. While these systems prevent long-tail rollouts from blocking updates, their rollouts from older policies automatically create a training/inference mismatch (even if there is no infra-level mismatch) that may lead to training collapse or worse performance. This necessitates various off-policy correction methods based on importance sampling (IS), many of which are already part of the algorithm (e.g., PPO/GRPO, GSPO, CISPO). There are so many variants using different combinations of token/seq/seq-mean IS with clipping/masking/stop-grad/truncation (not to mention more advanced variants that handle positive and negative samples differently), making it difficult to analyze and find the best option.
On-policy gradient methods need on-policy samples to perform well; then, what about off-policy methods? Typically, off-policy is assumed to induce inferior performance. However, there exists an example of an off-policy algorithm training good models: policy mirror descent (PMD) was adopted to train Kimi K1.5 and K2. Ideally, this algorithmic shift allows for larger rollout batches to amortize the inference cost (the main efficiency bottleneck in RL training), and can attain further efficiency via partial rollouts and full co-location of training and inference engines (as Kimi’s report suggested).
To our surprise, there is little research in the open-source community that implements Kimi’s PMD and compares it with popular RL algorithms such as GRPO. Most discussions simply take it as a variant of KL-penalized PG with a periodically updated KL anchor. Interestingly, our analysis suggests that this is not the full story: Kimi’s PMD is implicitly solving a stronger regularized subproblem with an additional $\chi^2$ penalty. This additional regularization has several implications, including less aggressive updates and robustness to finite-sample error when the number of rollout samples is limited. In this blog, we walk through these underlying mechanisms of PMD, along with empirical results that show its competitive performance to GRPO and other PG variants.
Suppose $\mathcal{X}$ is the state space (prompts) and $\mathcal{Y}$ is the action space (responses). Ideally, in the policy space, each global step of PMD is to solve the following KL-regularized subproblem.
$$ \pi_{t+1}(\cdot\mid x)=\argmax_{\pi(\cdot\mid x)\in\Delta(\mathcal{Y})}\mathbb{E}_{y\sim\pi(\cdot\mid x)}[r(x,y)-\tau\mathrm{KL}(\pi(\cdot\mid x)~\|~\pi_t(\cdot\mid x))],\forall x\in\mathcal{X}, $$
which has the closed-form solution:
$$ \begin{align}\pi_{t+1}(y\mid x)=\frac{\pi_t(y\mid x)\exp(r(x,y)/\tau)}{Z_t(x)},~\forall x\in\mathcal{X}, \end{align} $$
where $Z_t(x)=\mathbb{E}{y\sim\pi_t(\cdot\mid x)}[\exp(r(x,y)/\tau)]$ is the partition function. While on-policy gradient methods can also be used to solve the KL-regularized subproblem (by moving the KL into the reward or adding a KL loss), this deviates from the goal of finding an off-policy alternative. A direct off-policy method is to fit the target $\pi{t+1}$ for each global step by minimizing the regression loss on rollout samples from $\pi_t$:
$$ \mathcal{L}\mathrm{partition}(\theta)=\mathbb{E}{x\sim\mathcal{D}}\mathbb{E}{y\sim\pi_t(\cdot\mid x)}\left[\frac{1}{2}\left(\log\frac{\pi\theta(y\mid x)}{\pi_t(y\mid x)}-\frac{r(x,y)-\tau\log Z_t(x)}{\tau}\right)^2\right]. $$
Assume $\pi_\theta$ ranges the whole (interior of) policy space, and we have sufficiently many samples, then minimizing $\mathcal{L}\mathrm{partition}$ indeed gives $\pi{t+1}$. The method is off-policy by nature since all samples are from the old policy $\pi_t$ instead of $\pi_\theta$. We call it PMD-partition.
So far, everything looks straightforward as the KL-regularized RL problem has been extensively studied, except that Kimi’s PMD does not minimize this loss. In particular, an approximation $\tau\log Z\approx\mathbb{E}[r]$ has been introduced, resulting in the following PMD-mean (or PMD-kimi) loss: