Gemini 3.0, Claude, and Grok rank GPT‑5.1 top in Karpathy’s LLM Council
Andrej Karpathy has turned his personal AI lineup into a makeshift peer‑review panel. By feeding the same prompts to Gemini 3.0, Anthropic’s Claude and xAI’s Grok, he watches the models argue over which answer feels most coherent, accurate or useful. The latest round focused on a handful of GPT‑5.1 outputs, and the three independent systems all tipped the scales in favor of OpenAI’s newest release.
It isn’t just a vanity contest; the experiment offers a rare glimpse into how LLMs judge each other when they’re asked to rank competing responses. Karpathy’s “LLM Council” has become a sandbox for testing whether models can act as unbiased critics, or whether they simply echo one another’s biases. The results, he says, raise questions about the reliability of self‑assessment in AI.
"Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally," said Karpathy. "For example, reading book chapters together with my LLM Council today, the models consistently praise GPT…
"Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally," said Karpathy. "For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between." Karpathy's experiment setup is a three-step loop. First, the user's query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what. Next, each model sees the others' responses, still anonymised, and ranks them based on accuracy and insight.
Did the LLM Council prove decisive? Karpathy’s experiment sent a single query to several models, let them anonymously rank each other’s answers, and then compiled a consensus. In that setting, OpenAI’s GPT‑5.1 emerged as the top‑ranked model, according to Gemini 3.0, Claude and Grok.
The three judges repeatedly chose GPT‑5.1’s response over their own, a pattern the researcher described as “surprisingly willing to select another LLM’s response as superior to their own.” This outcome contrasts with earlier benchmarks that suggested Google’s Gemini 3.0 had overtaken OpenAI overall. Yet the test involved only one question and a limited pool of models, so whether the ranking holds across broader tasks is unclear. The council’s anonymous voting mechanism offers an intriguing evaluation angle, but its scalability and resistance to bias remain to be demonstrated.
As the results stand, GPT‑5.1 currently holds the highest rank within this specific experiment, though broader conclusions about superiority are not yet established.
Further Reading
- Gemini 3 vs GPT-5 vs Claude 4.5 vs Grok 4.1 - The Ultimate Reasoning Performance Battle - Vertu
- Grok 4.1 vs Gemini 3.0 : 2025 Frontier AI Showdown - Skywork.ai
- Claude vs. GPT-4.5 vs. Gemini: A Comprehensive Comparison - Evolution.ai
- AI Models Comparison 2025: Claude, Grok, GPT & More - Collabnix - Collabnix
- The Best AI in October 2025? We Compared ChatGPT, Claude, Grok, Gemini & Others - FelloAI
Common Questions Answered
What is the purpose of Andrej Karpathy's LLM Council experiment as described in the article?
Karpathy created a makeshift peer‑review panel by feeding identical prompts to Gemini 3.0, Anthropic’s Claude, and xAI’s Grok. The models then anonymously rank each other's responses to gauge which answer feels most coherent, accurate, or useful. This setup provides a rare glimpse into how large language models evaluate one another.
Which model was consistently ranked as the top performer in Karpathy’s latest round?
In the most recent experiment, all three judges—Gemini 3.0, Claude, and Grok—tipped the scales in favor of OpenAI’s GPT‑5.1. The consensus ranking placed GPT‑5.1’s response above the others, highlighting it as the most insightful and accurate among the tested outputs.
How did the LLM Council models behave when comparing their own answers to those of other models?
The models were surprisingly willing to select another LLM’s response as superior to their own, often choosing GPT‑5.1 over their own generated answer. This pattern was noted by Karpathy as an interesting evaluation strategy, showing that the models can objectively recognize higher‑quality outputs regardless of origin.
Which model was repeatedly identified as the worst performer by the LLM Council?
Claude was consistently praised as the lowest‑ranked model in Karpathy’s experiment. Both Gemini 3.0 and Grok, along with the overall consensus, placed Claude’s responses at the bottom of the quality hierarchy.