Skip to main content
Databricks AI research: multi-step agents outperform single-turn RAG by 21-38% on STaRK, improving accuracy.

Editorial illustration for Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

Multi-Step AI Agents Outperform RAG by 38% in New Study

Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK

3 min read

Databricks’ latest study pits its most capable language model against a newly‑designed multi‑step reasoning agent across three semi‑structured retrieval tasks. While both approaches draw on external data—tables of Amazon listings, a sprawling academic citation graph, and a curated biomedical knowledge base—their interaction patterns differ. The single‑turn retrieval‑augmented generation (RAG) pipeline fetches a document, feeds it to the model, and returns an answer in one pass.

In contrast, the multi‑step agent can query, refine, and re‑query before producing its final response, effectively looping through the data sources. Researchers measured accuracy on each domain, looking for any edge the larger model might have when forced into a single‑shot format. The results show a consistent advantage for the iterative method, even when the underlying model is the most powerful one in the suite.

This gap widens noticeably between the academic and biomedical sections of the benchmark, prompting a closer look at how complexity and domain specificity affect retrieval‑driven answering.

The stronger model still lost to the multi-step agent by 21% on the academic domain and 38% on the biomedical domain. STaRK is a benchmark published by Stanford researchers covering three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph and a biomedical knowledge base. How the Supervisor Agent handles what RAG cannot Databricks built the Supervisor Agent as the production implementation of this research approach, and its architecture illustrates why the gains are consistent across task types.

The approach includes three core steps: Parallel tool decomposition. Rather than issuing one broad query and hoping the results cover both structured and unstructured needs, the agent fires SQL and vector search calls simultaneously, then analyzes the combined results before deciding what to do next. That parallel step is what allows it to handle queries that cross data type boundaries without requiring the data to be normalized first.

When an initial retrieval attempt hits a dead end, the agent detects the failure, reformulates the query and tries a different path. On a STaRK benchmark task that requires finding a paper by an author with exactly 115 prior publications on a specific topic, the agent first queries both SQL and vector search in parallel.

What does the data suggest? Multi-step agents consistently outpace single-turn RAG on the STaRK benchmark, delivering improvements between 21 % and 38 % across three semi‑structured domains. The study examined nine enterprise knowledge tasks, each requiring a blend of structured databases and unstructured documents such as sales figures with customer reviews or citation counts with academic papers.

When the stronger single‑turn model was pitted against the multi‑step approach, it fell short by 21 % in the academic domain and 38 % in the biomedical domain, underscoring a persistent failure mode for one‑shot retrieval‑augmented generation. Yet the results are confined to the STaRK suite, which includes Amazon product data, the Microsoft Academic Graph, and a biomedical knowledge set; it remains unclear whether comparable gaps exist in other enterprise settings. Moreover, the research doesn't address how the agents perform under varying query complexity or real‑time constraints.

In short, the findings highlight a measurable advantage for multi‑step reasoning in the tested scenarios, while leaving open questions about broader applicability and operational trade‑offs.

Further Reading

Common Questions Answered

How did the multi-step agent outperform single-turn RAG in the Databricks study?

The multi-step agent demonstrated superior performance by achieving 21% improvement in the academic domain and 38% improvement in the biomedical domain on the STaRK benchmark. This approach allows for more sophisticated reasoning and interaction with external data sources compared to traditional single-turn retrieval-augmented generation (RAG) methods.

What domains were examined in the STaRK benchmark used by Databricks?

The STaRK benchmark, published by Stanford researchers, covered three semi-structured retrieval domains: Amazon product data, the Microsoft Academic Graph, and a biomedical knowledge base. These domains represent diverse and complex information landscapes that test the capabilities of advanced language models and reasoning agents.

What makes the Supervisor Agent different from traditional RAG approaches?

The Supervisor Agent implements a multi-step reasoning approach that goes beyond single-turn retrieval, allowing for more nuanced and iterative information gathering and analysis. Unlike traditional RAG methods that fetch a document and generate an answer in one pass, this approach enables more sophisticated interaction with external data sources and knowledge bases.