Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens.
Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits.
Our key innovation is to go beyond this tradeoff by introducing EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model.
EntRGi interpolates between soft embeddings and hard tokens based on entropy: lower entropy emphasizes continuous relaxation for gradient accuracy, while higher entropy relies on hard tokens via STE for reward model reliability. This adaptive weighting resolves the fundamental tension in gradient-based steering of discrete diffusion models.
Concretely, at each masked position $l$, the input to the reward model is constructed as: $$\hat{e}^l \;=\; \tilde{e}^l \;+\; \text{sg}\!\Big(w^l\,(\bar{e}^l - \tilde{e}^l)\Big)$$ where $\tilde{e}^l = \sum_i q_i^l\, \mathbf{E}^R_i$ is the soft embedding (continuous relaxation), $\bar{e}^l = \mathbf{E}^R[x^l]$ is the sampled hard token embedding, and $w^l = H(\mathbf{q}^l)/\log K$ normalizes the entropy coefficient $w^l$ to $[0,1]$. When the model is confident ($w \!\approx\! 0$), the input reduces to the soft embedding, preserving gradient accuracy. When uncertain ($w \!\approx\! 1$), it reverts to STE, ensuring the reward model receives realistic discrete tokens.
We use Dream-v0-Instruct-7B as the base diffusion language model and Skywork-Reward-V2-Qwen3-1.7B as the reward model, evaluated on three multi-skill benchmarking suites: Reward-Bench-2, RM-Bench, and JudgeBench. These datasets contain prompts that measure fine-grained chatbot abilities such as precise instruction following, safety, factuality, and knowledge, with some coverage of math and code.
We report the maximum reward value across samples per prompt (Top@1) at sampling temperature $\tau\!=\!0.7$. We compare against Best-of-$N$ (BoN), a gradient-free baseline that samples $N$ independent trajectories and selects the one with the highest reward score.
Each animation shows the same prompt processed with standard sampling (Base, top) and EntRGi-guided generation (bottom), highlighting differences in token selection order and final output quality.
@article{tejaswi2026entrgi,
title = {EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models},
author = {Tejaswi, Atula and Rout, Litu and Caramanis, Constantine and
Shakkottai, Sanjay and Sanghavi, Sujay},
journal = {arXiv preprint arXiv:XXXX.XXXXX},
year = {2026}
}