Skip to main content
Conceptual illustration of QPILOTS implementing Q-steering during test-time for flow policies, preventing gradient loss in re

Editorial illustration for QPILOTS Offers Test‑Time Q‑Steering for Flow Policies, Avoiding Gradient Loss

QPILOTS Offers Test‑Time Q‑Steering for Flow Policies,...

QPILOTS Offers Test‑Time Q‑Steering for Flow Policies, Avoiding Gradient Loss

2 min read

Flow‑matching and diffusion‑based policies can generate rich action sequences, yet pulling them into temporal‑difference reinforcement learning has proved tricky. The core obstacle is that the critic’s gradient, which should guide policy improvement, becomes numerically volatile when back‑propagated through the many denoising steps required to turn noise into a clean action. Prior work sidesteps the issue by either ignoring the gradient, collapsing the multi‑step process into a single‑shot actor, or continuously re‑training the denoiser as the critic evolves.

QPILOTS takes a different tack: it keeps the original generative policy intact and adjusts the denoising trajectory only when the model is actually being used. The approach offers two flavors—QPILOTS‑U, which relies on a rapid point estimate, and QPILOTS‑M, which samples differentiable posteriors via an auxiliary network. On a standard offline‑to‑online RL suite, the method tops the leaderboard with a 90 % average success rate across 50 tasks.

It also proves effective when paired with a frozen Vision‑Language Action foundation model, matching or beating existing inference‑time techniques on six manipulation benchmarks.

Existing methods work around this either by discarding gradient information, distilling the policy into a simpler one-step actor, or repeatedly fine-tuning the denoising policy as the critic improves. We propose QPILOTS, a method that leaves the original policy unmodified and steers the denoising process at inference time. At each denoising step, instead of evaluating the critic on the noisy intermediate action where critic predictions are unreliable, we first project that intermediate state to an estimate of the final clean action and compute the critic gradient there. We introduce two variants: QPILOTS-U uses a fast single-point approximation, while QPILOTS-M draws differentiable posterior samples via a learned auxiliary network.

Why this matters

QPILOTS shows we can keep a diffusion‑style policy intact while nudging its actions at test time, a detail that could ease integration for teams already invested in flow‑matching generators. The method sidesteps the instability of back‑propagating through multi‑step denoising, something earlier approaches struggled with by either dropping the critic’s gradient, compressing the policy into a one‑step actor, or repeatedly fine‑tuning as the critic evolves. For developers, this means a potentially lighter engineering burden: no need to redesign the policy architecture or maintain a separate distilled actor.

Founders may appreciate that the original model remains usable, preserving any performance gains from extensive pre‑training. Researchers, however, should note that the proposal leaves open how well the steering works across diverse tasks or whether it scales when critics become more complex. Is the test‑time steering robust enough for safety‑critical applications?

Unclear whether the approach will generalise beyond the benchmarks reported. We’ll watch how the community validates QPILOTS in broader settings.

Further Reading