Editorial illustration for Two-Stage RAG Pipeline Uses Initial LLM Call to Match TOC Sections
Two-Stage RAG Pipeline Uses Initial LLM Call to Match...
Two-Stage RAG Pipeline Uses Initial LLM Call to Match TOC Sections
Why does this matter? Enterprise retrieval‑augmented generation (RAG) hinges on more than raw vector similarity; it needs a clear anchor that tells a downstream model where a match lands and what context to expand. In the system described, two structured tables—line_df and toc_df—store each candidate’s anchor and its surrounding text.
The pipeline that creates those anchors runs in three stages. First, a keyword detector fires instantly on both tables; the article notes it’s “free” and starts from the first millisecond. Second, an embedding‑based retriever runs in parallel, optional but useful when vocabulary gaps appear; with pre‑computed indices the query‑time cost is measured in microseconds.
Finally, a single LLM call ranks the aggregated hits, supplying reasons that auditors can later inspect. The example query, “How is attention computed?” on the Transformer paper, yields six candidate pages. Only one mentions softmax, query, key and d_k together, sitting under the TOC heading “Scaled Dot‑Product Attention.” Neither keyword nor embedding alone could pinpoint that page; the third step reads the candidates side‑by‑side, selects the right one, and records its justification.
4.1 Reason-then-match (two-LLM-call alternative) A two-stage pipeline that uses an extra LLM call up front: the LLM reads the TOC, picks the relevant sections, returns a short list of section IDs, then keyword retrieval runs only on the lines within those sections. When this is worth the extra call: A 100-page contract has 50 sections; the LLM picks 2 to 3 in one call; keyword retrieval then operates on a few hundred lines instead of the full 15,000. The trade-off versus the single-arbiter pattern: you pay two LLM calls instead of one, but the second-stage keyword search runs over a much smaller pool, which matters when the pool is huge (think: a 500-page regulatory filing).
Why this matters
We see a concrete shift toward structuring retrieval as a filtered process rather than a blanket search. The three‑stage pipeline described—parallel keyword detection and embeddings, aggregation to a structural unit, then a single LLM call—offers a clear reduction in LLM usage. Yet the trade‑off is that the final LLM must infer relevance from aggregated hits, which may miss nuanced context.
The alternative two‑stage approach adds an upfront LLM read of the table of contents, returning a short list of section IDs before keyword retrieval. This extra call could tighten focus, but it also introduces latency and cost. For developers, the decision hinges on whether the marginal gain in precision justifies the additional inference step.
Founders might ask if the architecture scales when document collections grow beyond modest sizes. Researchers are left with an open question: does anchoring retrieval to TOC sections improve downstream generation quality, or merely shift complexity? Until empirical results are shared, the practical benefit remains uncertain.
Further Reading
- Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End - Towards Data Science
- Zero-Shot Document Understanding using Pseudo Table of Contents - arXiv
- Boost RAG Accuracy with Two-Stage Retrieval Pattern - LinkedIn
- Two-Stage Multi-Pass Retrieval for Context Retention in Azure AI Search RAG Pipelines - gopenai Blog
- Common Challenges in RAG and How to Solve Them in Production - Unstructured.io