Skip to main content
Karpathy points to a digital board showing Gemini 3.0, Claude and Grok logos, with GPT-5.1 highlighted at #1

Editorial illustration for Karpathy's LLM Council Ranks GPT-5.1 Top Amid AI Model Comparison

LLM Council Ranks GPT-5.1 Top in Revolutionary AI Evaluation

Gemini 3.0, Claude, and Grok rank GPT-5.1 top in Karpathy’s LLM Council

Updated: 2 min read

AI researchers have a new twist on model evaluation, and it's surprisingly collaborative. Andrej Karpathy's latest experiment involves creating an "LLM Council" where different AI models assess and rank each other's performance - with intriguing results.

The unconventional approach sees models like Gemini 3.0, Claude, and Grok participating in a unique comparative assessment that crowned GPT-5.1 as the top performer. But this isn't just another benchmarking exercise.

What makes the method fascinating is its inherent transparency. By allowing models to critique and compare responses, Karpathy has potentially uncovered a more nuanced way of understanding AI capabilities.

The implications are significant. Traditional AI evaluations often rely on human-designed metrics, which can introduce bias or miss subtle performance differences. This peer-review approach suggests AI might be better at assessing its own limitations than we previously thought.

Curious how these models actually judge each other? Karpathy's insights reveal a surprisingly candid interaction that challenges our understanding of artificial intelligence.

"Quite often, the models are surprisingly willing to select another LLM's response as superior to their own, making this an interesting model evaluation strategy more generally," said Karpathy. "For example, reading book chapters together with my LLM Council today, the models consistently praise GPT 5.1 as the best and most insightful model, and consistently select Claude as the worst model, with the other models floating in between." Karpathy's experiment setup is a three-step loop. First, the user's query is sent to all models separately, and their answers are shown side-by-side without revealing who wrote what. Next, each model sees the others' responses, still anonymised, and ranks them based on accuracy and insight.

Andrej Karpathy's latest experiment with the LLM Council reveals a fascinating self-assessment approach in AI model comparison. The methodology, where language models evaluate each other's responses, introduces an intriguing peer-review dynamic in artificial intelligence.

GPT-5.1 emerged as the top performer, with models consistently selecting it as the most insightful. Claude, interestingly, was unanimously ranked at the bottom of the evaluation.

The experiment highlights an unexpected willingness among AI models to critically assess and potentially downgrade their own capabilities. Karpathy's three-step evaluation loop suggests a more nuanced approach to understanding AI performance beyond traditional benchmarking.

While the results are provocative, they also raise questions about the reliability of AI self-evaluation. Can models truly be objective when comparing themselves? The consistent rankings across different book chapters hint at potential patterns in AI model perception.

This approach could represent a novel method for assessing large language model capabilities. Still, more research is needed to validate the consistency and reproducibility of such self-referential evaluation techniques.

Common Questions Answered

How does Andrej Karpathy's LLM Council approach model evaluation differently from traditional benchmarking?

Karpathy's LLM Council involves language models directly assessing and ranking each other's performance, creating a unique peer-review dynamic. Unlike traditional benchmarking, this method allows AI models to collaboratively evaluate responses, with models often willing to acknowledge another model's superior performance.

Which AI model was ranked top in Karpathy's LLM Council experiment?

GPT-5.1 emerged as the top performer in the LLM Council experiment, consistently being praised by other models as the most insightful. The models participating in the evaluation unanimously selected GPT-5.1 as the best model in the comparative assessment.

What surprising outcome did Karpathy observe during the LLM Council evaluation?

Karpathy noted that the models were surprisingly willing to select another language model's response as superior to their own, creating an innovative self-assessment approach. Additionally, the models consistently ranked Claude as the worst performer in the evaluation, while other models were ranked in between.