Skip to main content
AI Agent Composer interface displayed on a screen, showcasing Contextual AI's top FACTS benchmark score. [allaccessible.org](

Editorial illustration for Contextual AI unveils Agent Composer, hits top FACTS benchmark score

Contextual AI Launches Agent Composer for Enterprise AI

Contextual AI unveils Agent Composer, hits top FACTS benchmark score

2 min read

Here’s the thing: Contextual AI just rolled out Agent Composer, a tool built to push enterprise Retrieval‑Augmented Generation from prototype to production‑grade AI agents. While the tech is impressive, the real test is whether it can deliver answers that stay on target and don’t drift into hallucination. The company tackled that problem by taking Meta’s open‑source Llama models and fine‑tuning them on Google Cloud’s Vertex AI platform, zeroing in on the tendency of large language models to generate unfounded claims. If the numbers hold up, the effort could set a new bar for grounded, reliable outputs in business‑critical settings.

According to a Google Cloud case study, Contextual AI achieved the highest performance on Google's FACTS benchmark for grounded, hallucination‑resistant results. The company fine‑tuned Meta's open‑source Llama models on Google Cloud's Vertex AI platform, focusing specifically on reducing the tendenc

According to a Google Cloud case study, Contextual AI achieved the highest performance on Google's FACTS benchmark for grounded, hallucination-resistant results. The company fine-tuned Meta's open-source Llama models on Google Cloud's Vertex AI platform, focusing specifically on reducing the tendency of AI systems to invent information. Inside Agent Composer, the platform that promises to turn complex engineering workflows into minutes of work Agent Composer extends Contextual AI's existing platform with orchestration capabilities -- the ability to coordinate multiple AI tools across multiple steps to complete complex workflows. The platform offers three ways to create AI agents.

Related Topics: #Retrieval-Augmented Generation #Llama models #Agent Composer #FACTS benchmark #Google Cloud #Vertex AI #Enterprise AI #Hallucination #Meta AI

Agent Composer is now live. The platform promises production‑grade agents. Backed by Bezos Expeditions and Bain Capital Ventures, the two‑and‑a‑half‑year‑old startup has positioned Agent Composer as a bridge between research‑grade retrieval‑augmented generation and the rigorous reliability demands of aerospace and semiconductor engineering teams.

According to a Google Cloud case study the system achieved the highest score on the FACTS benchmark, a metric intended to gauge grounded, hallucination‑resistant results. The company fine‑tuned Meta’s open‑source Llama models on Google Cloud’s Vertex AI, explicitly targeting reduced hallucination tendencies. While the benchmark result is concrete, it is unclear whether the same performance will hold across diverse enterprise workloads.

The claim that model quality, not the models themselves, is the primary barrier to adoption remains to be validated in real‑world deployments. Nonetheless, investor backing, a focused engineering audience and documented benchmark success suggest a measured step forward. Future evaluations will need to confirm whether Agent Composer can consistently deliver production‑ready AI agents beyond the test environments described.

Further Reading

Common Questions Answered

What is the FACTS benchmark, and why is it important for Contextual AI's Grounded Language Model (GLM)?

The FACTS benchmark is a comprehensive evaluation tool designed to measure how accurately large language models ground their responses in provided source materials and avoid hallucinations. [deepmind.google](https://deepmind.google/discover/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/) developed this benchmark with 1,719 carefully crafted examples to test long-form responses. Contextual AI's GLM achieved top performance on this benchmark, demonstrating its ability to minimize hallucinations and provide precise, attributable responses.

How does Contextual AI's Grounded Language Model (GLM) differ from traditional foundation models in handling enterprise AI applications?

Unlike traditional foundation models that may hallucinate or prefer their own parametric knowledge, the GLM is specifically engineered to minimize hallucinations in Retrieval-Augmented Generation (RAG) and agentic use cases. [contextual.ai](https://contextual.ai/blog/introducing-grounded-language-model) notes that the model provides inline attributions, directly citing the sources of retrieved knowledge within its responses. This approach addresses critical enterprise risks by delivering precise responses that are strongly grounded in specific retrieved source data.

What unique features does the FACTS Grounding dataset provide for evaluating language models?

The FACTS Grounding dataset includes 1,719 examples designed to test long-form responses, with documents ranging up to 32,000 tokens in length. [deepmind.google](https://deepmind.google/discover/blog/facts-grounding-a-new-benchmark-for-evaluating-the-factuality-of-large-language-models/) created the dataset with a public set of 860 examples and a private set of 859 examples to prevent benchmark contamination. The dataset includes a comprehensive leaderboard to track progress in language model factuality and grounding.