Conceptual illustration showing machine learning model optimization through demonstration and reward signals, comparing SFT a

Editorial illustration for SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals

SFT and RL Reweight Pretrained Distributions via...

SFT and RL Reweight Pretrained Distributions via Demonstration and Reward Signals

By AI Daily Post Edited by Brian Petersen, Editor-in-Chief

May 12, 2026 • 2 min read

Why does it matter whether post‑training merely pulls out what’s already there or actually expands what a model can do? The debate over large language model fine‑tuning often collapses into two tidy boxes: supervised fine‑tuning (SFT) as pure imitation, reinforcement learning (RL) as pure discovery. That framing, however, glosses over a subtler question. Does a training step raise the odds of behaviors the pretrained model could already generate, or does it shift the frontier of what the model can practically reach?

While the distinction sounds academic, it has concrete implications for how researchers evaluate progress. The authors propose “accessible support” – the set of behaviors a model can produce given realistic compute and data budgets – as the yardstick. Reweighting probabilities inside that set counts as capability elicitation; expanding the set itself counts as capability creation.

To make the case, they lean on a free‑energy perspective, treating post‑training as a process that can either reshape existing distributions or alter the underlying support. The paper argues that future work should keep the two phenomena separate, lest we conflate extraction with genuine extension.

SFT and RL can both be seen as reweighting a pretrained reference distribution, only with different external signals. Demonstration signals define low-energy behavior for SFT, and reward signals define low-energy behavior for RL. When the update remains close to the base model, the main effect is local reweighting, not capability creation. Within this framework, the central question is no longer whether post-training is framed as SFT or RL, but whether it reweights behaviors already within reach, or instead expands the model's reachable behavioral space through search, interaction, tool use, or the incorporation of new information.

On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective - ArXiv AI (cs.AI)

Why this matters

We see a nuanced view emerging: supervised fine‑tuning (SFT) and reinforcement learning (RL) are not fundamentally different tricks but two ways of reweighting the same pretrained distribution, each guided by a distinct external signal. Demonstrations carve out low‑energy regions for SFT; reward functions do the same for RL. This framing challenges the common narrative that SFT merely imitates while RL “discovers” new abilities.

Yet the authors caution that when updates stay near the original model, the primary impact is a modest reshaping of behavior rather than a wholesale capability expansion. It remains unclear how far reweighting can push a model beyond what the base already encodes, especially as the distance from the reference distribution grows. For developers, this suggests that careful choice of signal—demonstration versus reward—may matter more than the algorithmic label attached to it.

Researchers should probe the boundary where reweighting transitions from elicitation to genuine capability creation, a line the paper flags but does not yet map. Our takeaway: the debate over SFT versus RL needs finer granularity, and practical outcomes will depend on how tightly future updates cling to the pretrained core.

SFT and RL Reweight Pretrained Distributions via...

Further Reading

Latest News

Anthropic's Mythos struggles deepen as cybersecurity ties with Trump wane

OpenAI postpones GPT‑5.6 rollout after Trump administration request

Calibration uses NVIDIA Triton Llama-3-8B A10 and vLLM Qwen2.5-7B RTX 4090 data

Meta says AI moderators make 13% fewer errors than humans, defends rollout speed

NVIDIA TensorRT Enables Context Parallelism for Multi‑GPU AI Inference

DeepReinforce releases Ornith-1.0 open-source model with state‑of‑the‑art results

Grok AI's traffic over 50% adult content as xAI expands porn generation

TokenSpeed-Kernel Delivers Top Performance on AMD GPT-OSS 120B via Gluon Kernels

OpenAI and Deepseek chatbots remain left‑leaning despite anti‑woke push

Survey frames Industrial Continual Learning for LLMs as closed-loop update cycle

Further Reading

Related Reading

LWiAI Podcast #228: OpenAI unveils GPT-5.2, Runway rolls out first world model

OpenAI's Codex powers Lovable AI, letting millions create apps from text

Google releases FunctionGemma, a tiny model for natural-language mobile control

Spatial priming beats semantic prompting in chart data extraction study

GraphDC Uses Divide‑and‑Conquer Agents to Scale Graph Reasoning