Editorial illustration for Google Cloud AI launches ReasoningBank with MaTTS memory-aware scaling
Google Cloud AI's ReasoningBank: LLM Reasoning Decoded
Google Cloud AI launches ReasoningBank with MaTTS memory-aware scaling
Google Cloud AI’s research group has unveiled ReasoningBank, a new framework designed to capture how large‑language agents succeed—or stumble—when they reason. By cataloguing the steps that lead to correct answers and the missteps that cause failures, the system builds a reusable memory of strategies that can be called on later. The aim is to make future agents more efficient, especially when they face tasks that demand multi‑step logic.
While the concept of storing reasoning traces isn’t new, the team’s emphasis on a structured “memory” that can be queried at run‑time sets this work apart. That raises a practical question: can such a memory be coupled with the dynamic scaling techniques already used to boost performance on math and coding problems? The answer, according to the authors, lies in a method that ties ReasoningBank to test‑time compute adjustments, promising a tighter link between stored knowledge and on‑the‑fly resource allocation.
MaTTS: Pairing Memory with Test-Time Scaling The research team goes further and introduces memory-aware test-time scaling (MaTTS), which links ReasoningBank with test-time compute scaling -- a technique that has already proven powerful in math reasoning and coding tasks. The insight is simple but important: scaling at test time generates multiple trajectories for the same task. Instead of just picking the best answer and discarding the rest, MaTTS uses the full set of trajectories as rich contrastive signals for memory extraction.
Parallel scaling generates k independent trajectories for the same query, then uses self-contrast -- comparing what went right and wrong across all trajectories -- to extract higher-quality, more reliable memory items. Sequential scaling iteratively refines a single trajectory using self-refinement, capturing intermediate corrections and insights as memory signals. The result is a positive feedback loop: better memory guides the agent toward more promising rollouts, and richer rollouts forge even stronger memory.
The paper notes that at k=5, parallel scaling (55.1% SR) edges out sequential scaling (54.5% SR) on WebArena-Shopping -- sequential gains saturate quickly once the model reaches a decisive success or failure, while parallel scaling keeps providing diverse rollouts that the agent can contrast and learn from. Results Across Three Benchmarks Tested on WebArena (a web navigation benchmark spanning shopping, admin, GitLab, and Reddit tasks), Mind2Web (which tests generalization across cross-task, cross-website, and cross-domain settings), and SWE-Bench-Verified (a repository-level software engineering benchmark with 500 verified instances), ReasoningBank consistently outperforms all baselines across all three datasets and all tested backbone models. On WebArena with Gemini-2.5-Flash, ReasoningBank improved overall success rate by +8.3 percentage points over the memory-free baseline (40.5% → 48.8%), while reducing average interaction steps by up to 1.4 compared to no-memory and up to 1.6 compared to other memory baselines.
What does ReasoningBank actually achieve? By distilling reasoning strategies from both successes and failures, the framework attempts to give agents a foothold in long‑term context, a capability that many current systems lack. The memory‑aware design promises to curb the “amnesia” that forces agents to repeat the same mistakes across similar tasks.
Moreover, the MaTTS component ties this memory to test‑time compute scaling, a technique already shown to boost performance on math‑reasoning and coding benchmarks. Yet the description stops short of detailing how the scaling interacts with diverse workloads beyond those domains. It remains unclear whether the combined approach will generalise to more open‑ended or real‑world applications where task boundaries are fuzzy.
The research team’s claim that the insight is “simple but …” hints at underlying complexities that have yet to be disclosed. In short, ReasoningBank and MaTTS represent a concrete step toward more persistent AI agents, but their broader impact and robustness across varied scenarios are still uncertain.
Further Reading
- ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory - arXiv
- ReasoningBank: Enabling agents to learn from experience - Google Research Blog
- Google Unveils ReasoningBank to Enhance AI Learning from Experience - Phemex
- Google Launches ReasoningBank to Enhance AI Agent Learning from Success and Failure - KuCoin
Common Questions Answered
How does ReasoningBank capture and improve AI reasoning strategies?
ReasoningBank catalogues the steps that lead to correct answers and the missteps that cause failures, creating a reusable memory of reasoning strategies. By storing these traces, the framework aims to help future AI agents become more efficient at solving complex multi-step logical tasks.
What is MaTTS and how does it enhance AI reasoning?
MaTTS (Memory-aware Test-Time Scaling) links ReasoningBank with test-time compute scaling, generating multiple reasoning trajectories for the same task. Instead of discarding alternative paths, MaTTS uses the full set of trajectories to improve AI performance, particularly in math reasoning and coding tasks.
What problem does ReasoningBank aim to solve in AI systems?
ReasoningBank attempts to address the 'amnesia' problem in AI agents, where systems repeatedly make the same mistakes across similar tasks. By creating a long-term context memory that captures both successful and unsuccessful reasoning strategies, the framework seeks to help AI agents learn and improve over time.