OpenAGI agent says it beats OpenAI and Anthropic; study deems over‑optimistic
OpenAGI has just stepped out of stealth, boasting an AI agent that “crushes” the offerings from OpenAI and Anthropic. The claim sparked immediate interest because the market has been awash with headlines touting ever‑higher benchmark scores. Yet, those numbers often come from automated tests that can miss real‑world nuance.
That’s why a group of researchers at Ohio State decided to take a different approach. When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems —
Here’s the thing: the study’s findings run counter to the hype surrounding the newest agents, prompting the researchers to caution that earlier results may have been overly rosy. The quote that follows captures that tension, underscoring why the community should pause before accepting the latest brag‑ging press releases at face value.
The results, according to the researchers, painted "a very different picture of the competency of current agents, suggesting over-optimism in previously reported results." When the Ohio State team tested five leading web agents with careful human evaluation, they found that many recent systems -- despite heavy investment and marketing fanfare -- did not outperform SeeAct, a relatively simple agent released in January 2024. Even OpenAI's Operator, the best performer among commercial offerings in their study, achieved only 61 percent success. "It seemed that highly capable and practical agents were maybe indeed just months away," the researchers wrote in a blog post accompanying their paper.
"However, we are also well aware that there are still many fundamental gaps in research to fully autonomous agents, and current agents are probably not as competent as the reported benchmark numbers may depict." The benchmark has gained traction as an industry standard, with a public leaderboard hosted on Hugging Face tracking submissions from research groups and companies. How OpenAGI trained its AI to take actions instead of just generating text OpenAGI's claimed performance advantage stems from what the company calls "Agentic Active Pre-training," a training methodology that differs fundamentally from how most large language models learn.
OpenAGI’s Lux arrives with a bold claim: an 83.6 percent success rate at controlling desktop applications, and a price tag that undercuts OpenAI and Anthropic’s offerings. Yet the Ohio State evaluation of five leading web agents, conducted with careful human judgment, paints a more cautious picture. Their findings suggest that recent reports may have overstated agent competence, and they flag a gap between laboratory metrics and real‑world performance.
Lux’s ability to interpret screenshots and issue actions autonomously is intriguing, but the study stops short of confirming superiority over established models. Moreover, the limited scope of the Ohio State test—focused on web agents rather than the specific desktop tasks Lux targets—leaves open questions about direct comparability. As the company moves from stealth to public scrutiny, the evidence for Lux’s edge remains mixed.
Whether Lux can consistently deliver the promised efficiency gains across diverse environments is still unclear, and further independent benchmarking will be necessary to substantiate the startup’s assertions.
Further Reading
- Papers with Code Benchmarks - Papers with Code
- Chatbot Arena Leaderboard - LMSYS
Common Questions Answered
What success rate does OpenAGI's Lux claim for controlling desktop applications?
OpenAGI's Lux claims an 83.6 percent success rate when controlling desktop applications. The figure is presented as evidence of the agent's superior real‑world capability compared to competitors.
How did the Ohio State researchers evaluate the competency of leading web agents?
The Ohio State team conducted careful human evaluation of five leading web agents, assessing their performance on real‑world tasks rather than relying solely on automated benchmark scores. Their approach revealed a gap between laboratory metrics and actual usability.
Which agent did the Ohio State study find to outperform newer systems like OpenAI's Operator?
The study found that SeeAct, a relatively simple agent released in January 2024, outperformed many newer systems, including OpenAI's Operator, which was the best among commercial offerings but still lagged behind SeeAct in human‑judged tasks.
What concern does the Ohio State evaluation raise about previously reported benchmark results?
Researchers warned that earlier benchmark results may have been overly optimistic, as they often rely on automated tests that miss nuanced, real‑world interactions. This suggests that reported scores could overstate the true competence of current AI agents.