Entropy Aware Reward Guidance for Diffusion Language Model Alignment

Abstract

Reward guidance, also known as posterior sampling, is a popular method for test-time adaptation and post-training in continuous diffusion models. In this paper, we study reward guidance for discrete diffusion language models; one cannot differentiate through the natural outputs of the model because they are discrete tokens.

We introduce EntRGi: Entropy aware Reward Gidance to address this issue. EntRGi dynamically interpolates between continuous token relaxations and sampled hard tokens, on a token-by-token basis, using the diffusion model's predictive entropy. We demonstrate that EntRGi maintains both reward model reliability and optimization accuracy, while existing approaches sacrifice one for the other.

We empirically validate our approach on 7B-parameter diffusion language models across two settings: (1) test-time adaptation, and (2) RGRL: Reward Guided Reinforcement Learning, our recipe for post-training on reward-guided data, showing consistent improvements over state-of-the-art methods.

Method

EntRGi interpolates between soft embeddings and hard tokens based on entropy: lower entropy emphasizes continuous relaxation for gradient accuracy, while higher entropy relies on hard tokens via STE for reward model reliability. This adaptive weighting resolves the fundamental tension in gradient-based steering of discrete diffusion models.

Concretely, at each masked position $l$, the input to the reward model is constructed as: $$\hat{e}^l \;=\; \tilde{e}^l \;+\; \text{sg}\!\Big(w^l\,(\bar{e}^l - \tilde{e}^l)\Big)$$ where $\tilde{e}^l = \sum_i q_i^l\, \mathbf{E}^R_i$ is the soft embedding (continuous relaxation), $\bar{e}^l = \mathbf{E}^R[x^l]$ is the sampled hard token embedding, and $w^l = H(\mathbf{q}^l)/\log K$ normalizes the entropy coefficient $w^l$ to $[0,1]$. When the model is confident ($w \!\approx\! 0$), the input reduces to the soft embedding, preserving gradient accuracy. When uncertain ($w \!\approx\! 1$), it reverts to STE, ensuring the reward model receives realistic discrete tokens.

Overview of EntRGi pipeline showing entropy-weighted interpolation between soft and hard embeddings — **Figure 1.** **Overview of Entropy-aware Reward Guidance (EntRGi).** Given logits q from a diffusion language model (dLLM) at masked positions [m], the goal is to update them to maximize a reward R defined by a reward model. Prior work either compromises reward reliability (Murata et al., 2024; Tae et al., 2025), or gradient accuracy (Rout et al., 2025c). EntRGi addresses these limitations by constructing inputs as a dynamic *entropy-weighted interpolation* between continuous token embeddings and sampled hard tokens.

Results

We use Dream-v0-Instruct-7B as the base diffusion language model and Skywork-Reward-V2-Qwen3-1.7B as the reward model, evaluated on three multi-skill benchmarking suites: Reward-Bench-2, RM-Bench, and JudgeBench. These datasets contain prompts that measure fine-grained chatbot abilities such as precise instruction following, safety, factuality, and knowledge, with some coverage of math and code.

We report the maximum reward value across samples per prompt (Top@1) at sampling temperature $\tau\!=\!0.7$. We compare against Best-of-$N$ (BoN), a gradient-free baseline that samples $N$ independent trajectories and selects the one with the highest reward score.

Test-Time Adaptation — Dream-v0-7B-Instruct · Skywork-Reward-V2-Qwen3-1.7B · Top@1 (τ = 0.7)

Reward-Bench-2

3.91

BoN: 2.99

JudgeBench

2.44

BoN: 1.65

RM-Bench

5.70

BoN: 5.11

RGRL: Post-Training

RGRL generates completions via EntRGi or APS reward-gradient guidance, then iteratively updates the dLLM to maximize their likelihood. This dense reward feedback consistently outperforms diffu-GRPO, which relies only on scalar rewards. On WildChat-IF with Dream we observe up to a 70% relative improvement (+0.90 absolute) over diffu-GRPO.

RGRL training curves on WildChat-IF and lmsys-chat-1m for Dream-v0-7B-Instruct and LLaDA-8B-Instruct — **Figure 2.** Training reward curves on WildChat-IF and lmsys-chat-1m with Skywork-Reward-V2-Qwen3-0.6B. RGRL-EntRGi and RGRL-APS consistently outperform diffu-GRPO across both Dream-v0-7B-Instruct and LLaDA-8B-Instruct (tokenizer-mismatched).

Citation

@article{tejaswi2026entrgi,
  title   = {Entropy Aware Reward Guidance for Diffusion Language Model Alignment},
  author  = {Tejaswi, Atula and Rout, Litu and Caramanis, Constantine and 
             Shakkottai, Sanjay and Sanghavi, Sujay},
  journal = {arXiv preprint arXiv:2602.05000},
  year    = {2026}
}

Abstract

Method

Results

RGRL: Post-Training

Examples

Citation