Editorial illustration for Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy
Gemini‑SQL2 leads BIRD benchmark with 80.04% execution...
Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy
Google Research has rolled out Gemini‑SQL2, a text‑to‑SQL system built on the Gemini 3.1 Pro foundation. It takes plain‑language questions and turns them into SQL queries that actually run. On the BIRD benchmark—a test of how often generated queries execute correctly—Gemini‑SQL2 achieved an 80.04 percent execution accuracy, enough to claim first place.
By contrast, OpenAI’s GPT‑5.5‑xhigh scored about 72.8 percent and Anthropic’s Claude Opus 4.6 landed near 70.9 percent. Models from Databricks, AWS, Tencent and Alibaba all fell further behind. Why does this matter?
Translating natural language into valid SQL is notoriously tricky; data structures are often layered and the queries must respect intricate business logic. Google says the system not only produces syntactically sound SQL but also runs successfully against real databases. Better SQL understanding could feed into broader natural‑language features across Google’s data services, the team notes.
The research group has not announced a public release, nor published a paper, leaving the rollout timeline unclear.
On the BIRD benchmark, which measures how accurately these translations work, Gemini-SQL2 hits an execution accuracy of 80.04 percent, putting it in first place, according to Google. OpenAI's GPT-5.5-xhigh scores about 72.8 percent, and Anthropic's Claude Opus 4.6 lands around 70.9 percent. Models from Databricks, AWS, Tencent, and Alibaba all trail well behind.
Google Research points out that turning natural language into correct SQL is especially hard because data is often layered and queries need to account for complex business logic. The generated SQL queries both look correct and execute successfully, the company says. Better SQL understanding could improve natural language features across Google's data services more broadly, according to Google.
Why this matters
Gemini‑SQL2’s 80.04 % execution accuracy on the BIRD benchmark shows a clear lead over OpenAI’s GPT‑5.5‑xhigh (≈72.8 %) and Anthropic’s Claude Opus 4.6 (≈70.9 %). For developers building natural‑language interfaces to databases, the gap suggests a potentially more reliable translation layer, at least within the confines of this test set. Yet the benchmark reflects a single, curated workload; we don’t yet know how the model handles noisy queries, complex schemas, or production‑scale latency constraints.
Founders eyeing AI‑augmented analytics should ask whether the reported numbers translate into cost‑effective deployments, or if additional engineering will be required to meet real‑world robustness. Researchers may find the Gemini‑3.1 Pro foundation a useful reference point, but the article offers no insight into training data, model size, or inference efficiency—details that often shape practical adoption. In short, the result is encouraging, but unclear whether the advantage persists beyond BIRD, and whether it will survive the varied demands of everyday enterprise environments.
Further Reading
- Google Releases Gemini-SQL2: Gemini 3.1 Pro Text-to-SQL Scores 80.04% on BIRD Single-Model Leaderboard - MarkTechPost
- How to get Gemini to deeply understand your database - Google Cloud Blog
- Google Research giới thiệu Gemini-SQL2: Mô hình Text-to-SQL mới ... - Facebook
- Gemini-SQL2 Hits 80% BIRD: Data Query Shift 2026 - Apple Podcasts