Reward guidance, also known as posterior sampling, is a popular method for test-time adaptation and post-training in continuous diffusion models. In this paper, we study reward guidance for discrete diffusion language models; one cannot differentiate through the natural outputs of the model because they are discrete tokens.
We introduce EntRGi: Entropy aware Reward Gidance to address this issue. EntRGi dynamically interpolates between continuous token relaxations and sampled hard tokens, on a token-by-token basis, using the diffusion model's predictive entropy. We demonstrate that EntRGi maintains both reward model reliability and optimization accuracy, while existing approaches sacrifice one for the other.
We empirically validate our approach on 7B-parameter diffusion language models across two settings: (1) test-time adaptation, and (2) RGRL: Reward Guided Reinforcement Learning, our recipe for post-training on reward-guided data, showing consistent improvements over state-of-the-art methods.
EntRGi interpolates between soft embeddings and hard tokens based on entropy: lower entropy emphasizes continuous relaxation for gradient accuracy, while higher entropy relies on hard tokens via STE for reward model reliability. This adaptive weighting resolves the fundamental tension in gradient-based steering of discrete diffusion models.
Concretely, at each masked position $l$, the input to the reward model is constructed as: $$\hat{e}^l \;=\; \tilde{e}^l \;+\; \text{sg}\!\Big(w^l\,(\bar{e}^l - \tilde{e}^l)\Big)$$ where $\tilde{e}^l = \sum_i q_i^l\, \mathbf{E}^R_i$ is the soft embedding (continuous relaxation), $\bar{e}^l = \mathbf{E}^R[x^l]$ is the sampled hard token embedding, and $w^l = H(\mathbf{q}^l)/\log K$ normalizes the entropy coefficient $w^l$ to $[0,1]$. When the model is confident ($w \!\approx\! 0$), the input reduces to the soft embedding, preserving gradient accuracy. When uncertain ($w \!\approx\! 1$), it reverts to STE, ensuring the reward model receives realistic discrete tokens.
We use Dream-v0-Instruct-7B as the base diffusion language model and Skywork-Reward-V2-Qwen3-1.7B as the reward model, evaluated on three multi-skill benchmarking suites: Reward-Bench-2, RM-Bench, and JudgeBench. These datasets contain prompts that measure fine-grained chatbot abilities such as precise instruction following, safety, factuality, and knowledge, with some coverage of math and code.
We report the maximum reward value across samples per prompt (Top@1) at sampling temperature $\tau\!=\!0.7$. We compare against Best-of-$N$ (BoN), a gradient-free baseline that samples $N$ independent trajectories and selects the one with the highest reward score.
Test-Time Adaptation — Dream-v0-7B-Instruct · Skywork-Reward-V2-Qwen3-1.7B · Top@1 (τ = 0.7)
RGRL generates completions via EntRGi or APS reward-gradient guidance, then iteratively updates the dLLM to maximize their likelihood. This dense reward feedback consistently outperforms diffu-GRPO, which relies only on scalar rewards. On WildChat-IF with Dream we observe up to a 70% relative improvement (+0.90 absolute) over diffu-GRPO.
Each animation shows the same prompt processed with standard sampling (Base, top) and EntRGi-guided generation (bottom), highlighting differences in token selection order and final output quality.
@article{tejaswi2026entrgi,
title = {Entropy Aware Reward Guidance for Diffusion Language Model Alignment},
author = {Tejaswi, Atula and Rout, Litu and Caramanis, Constantine and
Shakkottai, Sanjay and Sanghavi, Sujay},
journal = {arXiv preprint arXiv:2602.05000},
year = {2026}
}