Skip to main content
Conceptual illustration showing machine learning model optimization through demonstration and reward signals, comparing SFT a

Editorial illustration for SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals

SFT and RL Reweight Pretrained Distributions via...

SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals

2 min read

Why does it matter whether post‑training merely pulls out what’s already there or actually expands what a model can do? The debate over large language model fine‑tuning often collapses into two tidy boxes: supervised fine‑tuning (SFT) as pure imitation, reinforcement learning (RL) as pure discovery. That framing, however, glosses over a subtler question. Does a training step raise the odds of behaviors the pretrained model could already generate, or does it shift the frontier of what the model can practically reach?

While the distinction sounds academic, it has concrete implications for how researchers evaluate progress. The authors propose “accessible support” – the set of behaviors a model can produce given realistic compute and data budgets – as the yardstick. Reweighting probabilities inside that set counts as capability elicitation; expanding the set itself counts as capability creation.

To make the case, they lean on a free‑energy perspective, treating post‑training as a process that can either reshape existing distributions or alter the underlying support. The paper argues that future work should keep the two phenomena separate, lest we conflate extraction with genuine extension.

SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.

Why this matters

We see a nuanced view emerging: supervised fine‑tuning (SFT) and reinforcement learning (RL) are not fundamentally different tricks but two ways of reweighting the same pretrained distribution, each guided by a distinct external signal. Demonstrations carve out low‑energy regions for SFT; reward functions do the same for RL. This framing challenges the common narrative that SFT merely imitates while RL “discovers” new abilities.

Yet the authors caution that when updates stay near the original model, the primary impact is a modest reshaping of behavior rather than a wholesale capability expansion. It remains unclear how far reweighting can push a model beyond what the base already encodes, especially as the distance from the reference distribution grows. For developers, this suggests that careful choice of signal—demonstration versus reward—may matter more than the algorithmic label attached to it.

Researchers should probe the boundary where reweighting transitions from elicitation to genuine capability creation, a line the paper flags but does not yet map. Our takeaway: the debate over SFT versus RL needs finer granularity, and practical outcomes will depend on how tightly future updates cling to the pretrained core.

Further Reading