Illustration for: Google's nested learning, based on brain's fast‑slow circuits curbs LLM forgetting
LLMs & Generative AI

Google's nested learning, based on brain's fast‑slow circuits curbs LLM forgetting

3 min read

Google’s latest tweak to large‑language models tackles a problem that’s been nagging researchers for years: when a model learns new data, it often overwrites what it already knows. The issue, known as catastrophic forgetting, can make an LLM’s performance wobble after each update, forcing engineers to retrain from scratch or resort to cumbersome workarounds. In a paper titled “Google’s Nested Learning aims to stop LLMs from catastrophic forgetting,” the team proposes a hierarchical training scheme that layers a fast‑learning component on top of a slower, more stable one.

The approach promises to keep previously acquired knowledge intact while still absorbing fresh information. But why look to biology for a solution? Neuroscience has long shown that the brain processes information on multiple timescales, preserving essential patterns while letting the rest fade.

That observation underpins the new method, suggesting that mimicking the brain’s timing could give AI a steadier memory. The following quote explains exactly how the researchers translate those neural principles into code.

How nested learning borrows from the brain Like many machine learning advances, nested learning is inspired by neuroscience. The brain runs at different speeds: fast circuits handle the present, slower ones consolidate important patterns into long-term memory. Most experiences fade quickly; only a few become lasting memories, thanks to neuroplasticity--the brain's ability to rewire itself while preserving essential information.

The authors contrast this with current LLMs, whose knowledge remains limited to their context window or static pretraining. Nested learning treats every part of an AI model--including the optimizer and training algorithm--as memory. Backpropagation stores links between data and errors, and the optimizer's state, like momentum, acts as memory too.

The Continuum Memory System (CMS) splits memory into modules that update at different rates, giving the model temporal depth. HOPE: Nested Learning in practice Google's HOPE architecture puts this to work. HOPE uses long-term memory modules called Titans, which store information based on how surprising it is to the model.

It layers different types of memory and uses CMS blocks for larger context windows. Fast layers process live input, slower layers distill what's important for long-term storage, and the system can adapt its update rules as it learns. This goes beyond the typical "pretrain and freeze" model.

The team tested HOPE on language modeling and reasoning. With models at 1.3 billion parameters trained on 100 billion tokens, HOPE outperformed Transformer++ and newer models like RetNet and DeltaNet.

Related Topics: #Google #nested learning #catastrophic forgetting #large-language models #LLM #neuroscience #fast‑slow circuits #backpropagation #optimizer

Can a brain‑inspired architecture really keep a language model from forgetting? Google’s nested learning proposes exactly that, borrowing the fast‑slow circuit motif from neuroscience to separate immediate processing from longer‑term consolidation. The NeurIPS 2025 paper points out that current LLMs store nothing beyond the context window or the static pre‑training weights, and that simply widening the window or periodic retraining merely postpones the loss of information, akin to treating amnesia with a bandage.

By embedding a slower‑learning module, the authors aim to create a durable memory trace that survives subsequent updates. Early experiments suggest the approach mitigates catastrophic forgetting, yet the evidence is limited to the authors’ benchmarks. It is unclear whether the method scales to the diverse tasks and data streams encountered in real‑world deployment.

Moreover, the long‑term computational cost of maintaining dual learning rates has not been quantified. The concept is intriguing, but further independent validation will be needed before its practical impact can be judged.

Further Reading

Common Questions Answered

How does Google's nested learning architecture aim to prevent catastrophic forgetting in LLMs?

Nested learning introduces a hierarchical training scheme that separates fast, immediate processing from slower, long‑term consolidation, mirroring the brain's fast‑slow circuits. By consolidating important patterns into a stable memory layer, the model retains previously learned knowledge while still integrating new data.

What neuroscience concepts inspired the design of nested learning for large‑language models?

The approach draws on the brain's fast‑slow circuit motif, where rapid neural activity handles present stimuli and slower circuits consolidate lasting memories through neuroplasticity. This analogy guides the separation of short‑term updates from long‑term knowledge storage in LLMs.

Why do widening the context window or periodic retraining only postpone information loss in current LLMs?

Current LLMs store information solely in the context window and static pre‑training weights, so expanding the window or retraining merely delays the inevitable overwrite of older knowledge. Without a dedicated long‑term memory mechanism, these tactics cannot fundamentally stop catastrophic forgetting.

What evidence does the NeurIPS 2025 paper provide about the effectiveness of nested learning?

The NeurIPS 2025 paper reports that models using nested learning maintain higher performance on previously learned tasks after successive updates, compared to baseline LLMs that suffer noticeable degradation. These results suggest that the fast‑slow consolidation strategy successfully mitigates forgetting.