Gemini 3.0, Claude, and Grok rank GPT‑5.1 top in Karpathy’s LLM Council
Andrej Karpathy has basically turned his own AI toolbox into a makeshift peer-review panel. He throws the same prompt at Gemini 3.0, Anthropic’s Claude and xAI’s Grok, then watches the three models squabble over which answer feels most coherent, accurate or useful. In the latest round he fed them a handful of GPT-5.1 outputs; all three independent systems seemed to lean toward OpenAI’s newest release.
It isn’t just a vanity stunt - the experiment gives a rare peek at how LLMs judge each other when asked to rank competing replies. Karpathy’s “LLM Council” now serves as a sandbox to see whether the models can act as unbiased critics, or whether they simply echo one another’s quirks. The outcomes, he notes, leave the reliability of self-assessment in AI a bit murky.
“Quite often, the models are surprisingly willing to pick another LLM’s response as better than their own, which makes this an interesting evaluation strategy more generally,” Karpathy said. “For example, when I read book chapters with my LLM Council today, the models kept praising GPT…”
"Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally," said Karpathy. "For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between." Karpathy's experiment setup is a three-step loop. First, the user's query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what. Next, each model sees the others' responses, still anonymised, and ranks them based on accuracy and insight.
Karpathy’s little test let a single question bounce between a handful of LLMs, then asked each model to rank the others’ answers without knowing who wrote what. In that round-robin, GPT-5.1 from OpenAI kept coming out on top, according to the votes of Gemini 3.0, Claude and Grok. Strangely, the three judges often preferred the GPT-5.1 reply even over their own, a detail Karpathy called “surprisingly willing to select another LLM’s response as superior to their own.” That’s a shift from earlier benchmarks that had suggested Gemini 3.0 might be ahead of OpenAI overall.
Still, the experiment only covered one query and a small set of models, so it’s hard to say whether the same pattern would appear on a broader slate of tasks. The anonymous voting idea is interesting, but we still need to see if it scales and stays free of bias. For now, GPT-5.1 holds the highest rank in this narrow test, but any claim of overall superiority would be premature.
Further Reading
- Gemini 3 vs GPT-5 vs Claude 4.5 vs Grok 4.1 - The Ultimate Reasoning Performance Battle - Vertu
- Grok 4.1 vs Gemini 3.0 : 2025 Frontier AI Showdown - Skywork.ai
- Claude vs. GPT-4.5 vs. Gemini: A Comprehensive Comparison - Evolution.ai
- AI Models Comparison 2025: Claude, Grok, GPT & More - Collabnix - Collabnix
- The Best AI in October 2025? We Compared ChatGPT, Claude, Grok, Gemini & Others - FelloAI
Common Questions Answered
What is the purpose of Andrej Karpathy's LLM Council experiment as described in the article?
Karpathy created a makeshift peer‑review panel by feeding identical prompts to Gemini 3.0, Anthropic’s Claude, and xAI’s Grok. The models then anonymously rank each other's responses to gauge which answer feels most coherent, accurate, or useful. This setup provides a rare glimpse into how large language models evaluate one another.
Which model was consistently ranked as the top performer in Karpathy’s latest round?
In the most recent experiment, all three judges—Gemini 3.0, Claude, and Grok—tipped the scales in favor of OpenAI’s GPT‑5.1. The consensus ranking placed GPT‑5.1’s response above the others, highlighting it as the most insightful and accurate among the tested outputs.
How did the LLM Council models behave when comparing their own answers to those of other models?
The models were surprisingly willing to select another LLM’s response as superior to their own, often choosing GPT‑5.1 over their own generated answer. This pattern was noted by Karpathy as an interesting evaluation strategy, showing that the models can objectively recognize higher‑quality outputs regardless of origin.
Which model was repeatedly identified as the worst performer by the LLM Council?
Claude was consistently praised as the lowest‑ranked model in Karpathy’s experiment. Both Gemini 3.0 and Grok, along with the overall consensus, placed Claude’s responses at the bottom of the quality hierarchy.