EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

Abstract

Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens.

Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits.

Our key innovation is to go beyond this tradeoff by introducing EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model.

Method

EntRGi interpolates between soft embeddings and hard tokens based on entropy: lower entropy emphasizes continuous relaxation for gradient accuracy, while higher entropy relies on hard tokens via STE for reward model reliability. This adaptive weighting resolves the fundamental tension in gradient-based steering of discrete diffusion models.

Concretely, at each masked position $l$, the input to the reward model is constructed as: $$\hat{e}^l \;=\; \tilde{e}^l \;+\; \text{sg}\!\Big(w^l\,(\bar{e}^l - \tilde{e}^l)\Big)$$ where $\tilde{e}^l = \sum_i q_i^l\, \mathbf{E}^R_i$ is the soft embedding (continuous relaxation), $\bar{e}^l = \mathbf{E}^R[x^l]$ is the sampled hard token embedding, and $w^l = H(\mathbf{q}^l)/\log K$ normalizes the entropy coefficient $w^l$ to $[0,1]$. When the model is confident ($w \!\approx\! 0$), the input reduces to the soft embedding, preserving gradient accuracy. When uncertain ($w \!\approx\! 1$), it reverts to STE, ensuring the reward model receives realistic discrete tokens.

Overview of EntRGi pipeline showing entropy-weighted interpolation between soft and hard embeddings — **Figure 1.** Overall pipeline of Entropy-aware Reward Guidance (EntRGi). The embeddings provided to the reward model at masked positions are constructed as an entropy-weighted interpolation between a continuous relaxation and sampled hard token embeddings.

Results

We use Dream-v0-Instruct-7B as the base diffusion language model and Skywork-Reward-V2-Qwen3-1.7B as the reward model, evaluated on three multi-skill benchmarking suites: Reward-Bench-2, RM-Bench, and JudgeBench. These datasets contain prompts that measure fine-grained chatbot abilities such as precise instruction following, safety, factuality, and knowledge, with some coverage of math and code.

We report the maximum reward value across samples per prompt (Top@1) at sampling temperature $\tau\!=\!0.7$. We compare against Best-of-$N$ (BoN), a gradient-free baseline that samples $N$ independent trajectories and selects the one with the highest reward score.

Reward-Bench-2

3.91

BoN: 2.99

JudgeBench

2.44

BoN: 1.65

RM-Bench

5.70

BoN: 5.11

Citation

@article{tejaswi2026entrgi,
  title   = {EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models},
  author  = {Tejaswi, Atula and Rout, Litu and Caramanis, Constantine and 
             Shakkottai, Sanjay and Sanghavi, Sujay},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}

Abstract

Method

Results

Examples

Citation