Preprint 2026

EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models

Test-time steering of discrete diffusion LLMs using entropy-modulated gradients

Atula Tejaswi*, Litu Rout*, Constantine Caramanis, Sanjay Shakkottai, Sujay Sanghavi

* Equal contribution

The University of Texas at Austin

Abstract

Reward guidance has been applied to great success in the test-time adaptation of continuous diffusion models; it updates each denoising step using the gradients from a downstream reward model. We study reward guidance for discrete diffusion language models, where one cannot differentiate through the natural outputs of the model because they are discrete tokens.

Existing approaches either replace these discrete tokens with continuous relaxations, or employ techniques like the straight-through estimator. In this work, we show the downsides of both these methods. The former degrades gradient feedback because the reward model has never been trained with continuous inputs. The latter involves incorrect optimization because the gradient evaluated at discrete tokens is used to update continuous logits.

Our key innovation is to go beyond this tradeoff by introducing EntRGi: Entropy aware Reward Guidance that dynamically regulates the gradients from the reward model. By modulating the continuous relaxation using the model's confidence, our approach substantially improves reward guidance while providing reliable inputs to the reward model.

Method

EntRGi interpolates between soft embeddings and hard tokens based on entropy: lower entropy emphasizes continuous relaxation for gradient accuracy, while higher entropy relies on hard tokens via STE for reward model reliability. This adaptive weighting resolves the fundamental tension in gradient-based steering of discrete diffusion models.

Concretely, at each masked position $l$, the input to the reward model is constructed as: $$\hat{e}^l \;=\; \tilde{e}^l \;+\; \text{sg}\!\Big(w^l\,(\bar{e}^l - \tilde{e}^l)\Big)$$ where $\tilde{e}^l = \sum_i q_i^l\, \mathbf{E}^R_i$ is the soft embedding (continuous relaxation), $\bar{e}^l = \mathbf{E}^R[x^l]$ is the sampled hard token embedding, and $w^l = H(\mathbf{q}^l)/\log K$ normalizes the entropy coefficient $w^l$ to $[0,1]$. When the model is confident ($w \!\approx\! 0$), the input reduces to the soft embedding, preserving gradient accuracy. When uncertain ($w \!\approx\! 1$), it reverts to STE, ensuring the reward model receives realistic discrete tokens.

Overview of EntRGi pipeline showing entropy-weighted interpolation between soft and hard embeddings
Figure 1. Overall pipeline of Entropy-aware Reward Guidance (EntRGi). The embeddings provided to the reward model at masked positions are constructed as an entropy-weighted interpolation between a continuous relaxation and sampled hard token embeddings.

Results

We use Dream-v0-Instruct-7B as the base diffusion language model and Skywork-Reward-V2-Qwen3-1.7B as the reward model, evaluated on three multi-skill benchmarking suites: Reward-Bench-2, RM-Bench, and JudgeBench. These datasets contain prompts that measure fine-grained chatbot abilities such as precise instruction following, safety, factuality, and knowledge, with some coverage of math and code.

We report the maximum reward value across samples per prompt (Top@1) at sampling temperature $\tau\!=\!0.7$. We compare against Best-of-$N$ (BoN), a gradient-free baseline that samples $N$ independent trajectories and selects the one with the highest reward score.

Reward-Bench-2
3.91
BoN: 2.99
JudgeBench
2.44
BoN: 1.65
RM-Bench
5.70
BoN: 5.11

Examples

Each animation shows the same prompt processed with standard sampling (Base, top) and EntRGi-guided generation (bottom), highlighting differences in token selection order and final output quality.

Base vs EntRGi generation: The map was more accurate than expected
Base vs EntRGi generation: What if cats ruled the world?
Base vs EntRGi generation: Explain why the sky is blue to a 5-year-old
Base vs EntRGi generation: The last human on Earth heard a knock

Citation

@article{tejaswi2026entrgi,
  title   = {EntRGi: Entropy Aware Reward Guidance for Diffusion Language Models},
  author  = {Tejaswi, Atula and Rout, Litu and Caramanis, Constantine and 
             Shakkottai, Sanjay and Sanghavi, Sujay},
  journal = {arXiv preprint arXiv:XXXX.XXXXX},
  year    = {2026}
}