Editorial illustration for SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals
SFT and RL Reweight Pretrained Distributions via...
SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals
Why does it matter whether post‑training merely pulls out what’s already there or actually expands what a model can do? The debate over large language model fine‑tuning often collapses into two tidy boxes: supervised fine‑tuning (SFT) as pure imitation, reinforcement learning (RL) as pure discovery. That framing, however, glosses over a subtler question. Does a training step raise the odds of behaviors the pretrained model could already generate, or does it shift the frontier of what the model can practically reach?
While the distinction sounds academic, it has concrete implications for how researchers evaluate progress. The authors propose “accessible support” – the set of behaviors a model can produce given realistic compute and data budgets – as the yardstick. Reweighting probabilities inside that set counts as capability elicitation; expanding the set itself counts as capability creation.
To make the case, they lean on a free‑energy perspective, treating post‑training as a process that can either reshape existing distributions or alter the underlying support. The paper argues that future work should keep the two phenomena separate, lest we conflate extraction with genuine extension.
SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.
Why this matters
We see a nuanced view emerging: supervised fine‑tuning (SFT) and reinforcement learning (RL) are not fundamentally different tricks but two ways of reweighting the same pretrained distribution, each guided by a distinct external signal. Demonstrations carve out low‑energy regions for SFT; reward functions do the same for RL. This framing challenges the common narrative that SFT merely imitates while RL “discovers” new abilities.
Yet the authors caution that when updates stay near the original model, the primary impact is a modest reshaping of behavior rather than a wholesale capability expansion. It remains unclear how far reweighting can push a model beyond what the base already encodes, especially as the distance from the reference distribution grows. For developers, this suggests that careful choice of signal—demonstration versus reward—may matter more than the algorithmic label attached to it.
Researchers should probe the boundary where reweighting transitions from elicitation to genuine capability creation, a line the paper flags but does not yet map. Our takeaway: the debate over SFT versus RL needs finer granularity, and practical outcomes will depend on how tightly future updates cling to the pretrained core.
Further Reading
- On the Generalization of SFT: A Reinforcement Learning Perspective - arXiv
- Self-Rewarding PPO: Aligning Large Language Models with Demonstrations Only - OpenReview (COLM 2025)
- Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM - NeurIPS 2024
- Dense Reward Recovery: Treating SFT as Inverse Q-Learning for Post-SFT RL - Emergent Mind