Karpathy points to a digital board showing Gemini 3.0, Claude and Grok logos, with GPT‑5.1 highlighted at #1

Gemini 3.0, Claude, and Grok rank GPT‑5.1 top in Karpathy’s LLM Council

November 24, 2025 • 2 min read

Andrej Karpathy has basically turned his own AI toolbox into a makeshift peer-review panel. He throws the same prompt at Gemini 3.0, Anthropic’s Claude and xAI’s Grok, then watches the three models squabble over which answer feels most coherent, accurate or useful. In the latest round he fed them a handful of GPT-5.1 outputs; all three independent systems seemed to lean toward OpenAI’s newest release.

It isn’t just a vanity stunt - the experiment gives a rare peek at how LLMs judge each other when asked to rank competing replies. Karpathy’s “LLM Council” now serves as a sandbox to see whether the models can act as unbiased critics, or whether they simply echo one another’s quirks. The outcomes, he notes, leave the reliability of self-assessment in AI a bit murky.

“Quite often, the models are surprisingly willing to pick another LLM’s response as better than their own, which makes this an interesting evaluation strategy more generally,” Karpathy said. “For example, when I read book chapters with my LLM Council today, the models kept praising GPT…”

"Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally," said Karpathy. "For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between." Karpathy's experiment setup is a three-step loop. First, the user's query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what. Next, each model sees the others' responses, still anonymised, and ranks them based on accuracy and insight.

GPT 5.1 Is The Best, As Declared By Gemini 3.0, Claude & Grok On Andrej Karpathy’s ‘LLM Council’ - Analytics India Magazine

Related Topics: #Gemini 3.0 #Claude #Grok #GPT‑5.1 #AI #LLM #OpenAI #Andrej Karpathy #LLM Council

Karpathy’s little test let a single question bounce between a handful of LLMs, then asked each model to rank the others’ answers without knowing who wrote what. In that round-robin, GPT-5.1 from OpenAI kept coming out on top, according to the votes of Gemini 3.0, Claude and Grok. Strangely, the three judges often preferred the GPT-5.1 reply even over their own, a detail Karpathy called “surprisingly willing to select another LLM’s response as superior to their own.” That’s a shift from earlier benchmarks that had suggested Gemini 3.0 might be ahead of OpenAI overall.

Still, the experiment only covered one query and a small set of models, so it’s hard to say whether the same pattern would appear on a broader slate of tasks. The anonymous voting idea is interesting, but we still need to see if it scales and stays free of bias. For now, GPT-5.1 holds the highest rank in this narrow test, but any claim of overall superiority would be premature.

Common Questions Answered

What is the purpose of Andrej Karpathy's LLM Council experiment as described in the article?

Karpathy created a makeshift peer‑review panel by feeding identical prompts to Gemini 3.0, Anthropic’s Claude, and xAI’s Grok. The models then anonymously rank each other's responses to gauge which answer feels most coherent, accurate, or useful. This setup provides a rare glimpse into how large language models evaluate one another.

Which model was consistently ranked as the top performer in Karpathy’s latest round?

In the most recent experiment, all three judges—Gemini 3.0, Claude, and Grok—tipped the scales in favor of OpenAI’s GPT‑5.1. The consensus ranking placed GPT‑5.1’s response above the others, highlighting it as the most insightful and accurate among the tested outputs.

How did the LLM Council models behave when comparing their own answers to those of other models?

The models were surprisingly willing to select another LLM’s response as superior to their own, often choosing GPT‑5.1 over their own generated answer. This pattern was noted by Karpathy as an interesting evaluation strategy, showing that the models can objectively recognize higher‑quality outputs regardless of origin.

Which model was repeatedly identified as the worst performer by the LLM Council?

Claude was consistently praised as the lowest‑ranked model in Karpathy’s experiment. Both Gemini 3.0 and Grok, along with the overall consensus, placed Claude’s responses at the bottom of the quality hierarchy.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

Gemini 3.0, Claude, and Grok rank GPT‑5.1 top in Karpathy’s LLM Council

Further Reading

Common Questions Answered

What is the purpose of Andrej Karpathy's LLM Council experiment as described in the article?

Which model was consistently ranked as the top performer in Karpathy’s latest round?

How did the LLM Council models behave when comparing their own answers to those of other models?

Which model was repeatedly identified as the worst performer by the LLM Council?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds

Further Reading

Related Reading

OpenAI says AI saves knowledge workers 40‑80 minutes; use yields five‑fold gains

Grok Chat: AI for debugging, building, testing web apps with voice and images

Samsung adds Vision AI Companion, an AI Bixby, to TVs for real‑time queries

Google AI lets shoppers call stores, browse 50 B listings, get side‑by‑side charts

Google expands AI partnership with Tel Aviv University, infrastructure for Gemma

Common Questions Answered

What is the purpose of Andrej Karpathy's LLM Council experiment as described in the article?

Which model was consistently ranked as the top performer in Karpathy’s latest round?

How did the LLM Council models behave when comparing their own answers to those of other models?

Which model was repeatedly identified as the worst performer by the LLM Council?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds