Illustration for: Brumby-14B-Base Qwen3 variant uses Power Retention, avoids full training cost
LLMs & Generative AI

Brumby-14B-Base Qwen3 variant uses Power Retention, avoids full training cost

2 min read

When we look at the 14-billion-parameter Brumby-14B-Base, the first thing that jumps out is the price tag - a budget that barely covers a single training run. The Qwen-3 architecture is often praised for raw capability, but Brumby takes a different route. Instead of spending months on a full-scale pre-training, its team reused the weight matrix from an earlier generation and then fine-tuned it for the new tasks.

This “Power Retention” trick seems to let the model hit performance levels that are surprisingly close to a brand-new run, all while avoiding the huge cost of starting from scratch. Buckman, one of the leads, thinks this could change how fast new models reach researchers. He says building on existing weights “is a critical accelerant for the adoption of a new model.” In other words, the savings aren’t just a bonus - they’re why Brumby can exist at that price.

“Brumby could not be trained from scratch for that price.”

"Brumby could not be trained from scratch for that price." Still, Buckman emphasized the significance of that result: "The reason this is important is that the ability to build on the weights of the previous generation of model architectures is a critical accelerant for the adoption of a new modeling paradigm." He argues this demonstrates how attention-free systems can catch up to transformer performance "for orders-of-magnitude less" investment. In the loss curves released by Manifest AI, Brumby's training loss quickly converges to that of the Qwen3 baseline within 3,000 training steps, even as the architecture diverges significantly from its transformer origins.

Related Topics: #Brumby-14B-Base #Qwen-3 #Power Retention #transformer #attention-free #fine-tuning #pre‑training #Manifest AI

Can a single technique really trim the bill? Brumby-14B-Base, a Qwen3 off-shoot, leans on Power Retention to dodge the cost of a full-scale training run. It simply re-uses the weights from its predecessor - something Buckman calls a key accelerant for getting new models into the wild.

The write-up points out, though, that Brumby couldn’t be trained from scratch at that price, which highlights the very budget wall Power Retention hopes to knock down. By piggy-backing on the same attention backbone that has powered every major LLM since 2017, the method feels familiar. Still, the wider ripple effect is hazy; we don’t know if other teams will copy it or whether hidden performance hits will surface.

It’s a handy shortcut, yet without third-party benchmarks the promised savings remain unproven. In the field the model looks like it can match capability while skipping the massive training tab. Whether this reshapes the whole development pipeline or stays a niche trick is something I’m still watching.

Common Questions Answered

What is the Power Retention method used in Brumby-14B-Base?

Power Retention is a technique that re‑uses the weight matrix from an earlier generation model and fine‑tunes it for a new task set. By avoiding a full‑scale pre‑training run, it dramatically reduces computational cost while preserving performance.

Why does Buckman consider Power Retention a critical accelerant for new model adoption?

Buckman argues that building on the weights of a previous architecture speeds up development and lowers financial barriers, enabling faster adoption of novel modeling paradigms. This accelerant effect is especially important for attention‑free systems trying to match transformer performance.

How does Brumby-14B-Base compare to training a Qwen‑3 model from scratch in terms of cost?

The article states that Brumby‑14B‑Base could not be trained from scratch for the same price, highlighting the prohibitive expense of full‑scale training. Power Retention allows the model to achieve comparable results at a fraction of the cost.

What does the article suggest about the performance of attention‑free systems versus transformers?

It suggests that attention‑free systems, when using techniques like Power Retention, can catch up to transformer performance while requiring orders‑of‑magnitude less investment. This challenges the notion that transformers are the only viable high‑performance architecture.