Editorial illustration for MRAgent beats RAG, A-MEM, MemoryOS, LangMem, Mem0 with 118K tokens/query
MRAgent beats RAG, A-MEM, MemoryOS, LangMem, Mem0 with...
MRAgent beats RAG, A-MEM, MemoryOS, LangMem, Mem0 with 118K tokens/query
MRAgent is the latest entry in a crowded field of agentic memory frameworks. While A‑MEM relies on a graph‑based approach and MemoryOS layers memory hierarchically, LangMem and Mem0 also promise persistent context across long interactions. The researchers put these systems to the test on two industry benchmarks—LoCoMo and LongMemEval—using Gemini 2.5 Flash and Claude Sonnet 4.5 as backbone models.
Across every question type and both models, MRAgent posted higher scores than standard RAG, A‑MEM, MemoryOS, LangMem and Mem0. But the numbers that matter most to enterprise developers are the token and runtime costs. In LongMemEval, MRAgent kept prompt tokens to 118 k per sample; A‑MEM needed 632 k, and LangMem burned through 3.26 million.
The runtime dropped from 1,122 seconds with A‑MEM to 586 seconds for MRAgent. The efficiency stems from on‑demand tag evaluation, pruning of irrelevant paths and an autonomous stop‑condition that prevents redundant searches. In short, the framework trims both context size and compute time while delivering stronger benchmark performance.
The system was tested against standard RAG, A-MEM, MemoryOS, LangMem, and Mem0.
MRAgent consistently outperformed every baseline across both models and all question types by a significant margin.
However, for enterprise developers, the most critical metric is often computational cost. In the LongMemEval tests, MRAgent slashed prompt token consumption to just 118k per sample. By comparison, A-Mem consumed 632k tokens, and LangMem burned through 3.26 million tokens per query. MRAgent also effectively halved the runtime compared to A-Mem, dropping from 1,122 seconds to 586 seconds.
What makes MRAgent efficient in practice is its on-demand behavior.
Why this matters
We see MRAgent handling 118 K tokens per query while still beating RAG, A‑MEM, MemoryOS, LangMem and Mem0 across models and question types. That performance gap is notable, especially when LangMem expends 3.26 M tokens for comparable work. Yet the headline numbers don’t tell the whole story.
Enterprise teams care most about computational cost, and the article stops short of giving any concrete cost analysis or hardware requirements. Without those details, it’s unclear whether the token efficiency translates into lower latency or cheaper cloud bills. Moreover, the benchmarks mentioned are “standard,” but the exact datasets and task complexities remain unspecified, leaving room for doubt about real‑world applicability.
We appreciate the clear win in raw token usage, but we remain cautious until we see transparent cost metrics and broader testing across diverse workloads. For developers and founders, the takeaway is that a new memory framework shows promise, yet further evidence is needed before committing resources to replace existing pipelines.
Further Reading
- New agentic memory framework uses 118K tokens per query ... LangMem burns through 3.26M - VentureBeat
- Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory - ArXiv
- AI Agent Memory Compared: Mem0, OpenAI, LangMem, MemGPT - Deepak Gupta
- Benchmarked OpenAI Memory vs LangMem vs MemGPT vs Mem0 for Long-Term Memory - Mem0 Blog
- AI Agent Memory Systems in 2026: Mem0, Zep, Hindsight, Memvid and Everything in Between Compared - DevGenius