Editorial illustration for Open ASR Leaderboard Tests 60+ Speech Recognition Models for Accuracy and Speed
Research & Benchmarks

Open ASR Leaderboard Tests 60+ Speech Recognition Models for Accuracy and Speed

6 min read

When you ask Siri for the fastest route or tell Alexa to spin a playlist, you’re already leaning on speech-recognition tech. It’s everywhere now, yet the experience can be all over the place - some systems catch what you say almost instantly, others stumble over a thick accent or a noisy kitchen. It’s still kind of fuzzy which open-source models actually deliver the best results, especially when they’re built by labs and startups across the globe.

Enter the Open ASR Leaderboard, a sort of standardized quiz for these voice-AI engines. A handful of folks from Hugging Face, Nvidia, the University of Cambridge and Mistral AI put it together, and they’ve already run more than 60 models through the same set of tests, checking both how accurate they are and how fast they run. I think the real value lies in the side-by-side numbers - they let developers spot the right fit for their product and nudge the whole community toward faster, cleaner tech. It also gives tiny research projects a chance to be measured against big commercial kits, which should help keep the field transparent and, hopefully, a bit more innovative.

A research group from Hugging Face, Nvidia, the University of Cambridge, and Mistral AI has released the Open ASR Leaderboard, an evaluation platform for automatic speech recognition systems. The leaderboard is meant to provide a clear comparison of open source and commercial models. According to the project's study, more than 60 models from 18 companies have been tested so far.

The evaluation covers three main categories: English transcription, multilingual recognition (German, French, Italian, Spanish, and Portuguese), and long audio files over 30 seconds. The last category highlights how some systems perform differently on long versus short recordings. Two main benchmarks are used: - Word Error Rate (WER) measures the number of incorrect words.

- Inverse Real-Time Factor (RTFx) measures speed. For example, an RTFx of 100 means one minute of audio is transcribed in 0.6 seconds. To keep comparisons fair, transcripts are normalized before scoring.

The process removes punctuation and capitalization, spells out numbers, and drops filler words like "uh" and "mhm." This matches the normalization standard used by OpenAI's Whisper. speed The leaderboard shows clear differences between model types in English transcription. Systems built on large language models deliver the most accurate results.

Nvidia's Canary Qwen 2.5B leads with a WER of 5.63 percent. However, these accurate models are slower to process audio. Simpler systems, like Nvidia's Parakeet CTC 1.1B, transcribe audio 2,728 times faster than real time, but only rank 23rd in accuracy.

Multilingual models lose some specialization Tests across several languages show a trade-off between versatility and accuracy. Models narrowly trained on one language outperform broader multilingual models for that language, but struggle with others. Whisper models trained only on English beat the multilingual Whisper Large v3 at English, but can't reliably transcribe other languages.

In multilingual tests, Microsoft's Phi-4 multimodal instruct leads in German and Italian.

Related Topics: #Open ASR Leaderboard #speech recognition #Hugging Face #Nvidia #University of Cambridge #Mistral AI #automatic speech recognition #accuracy #speed #English transcription #multilingual recognition #long audio files

Seeing the Open ASR Leaderboard appear feels like a small step toward more openness in a space that’s usually full of closed-door benchmarks and hype. It gives us a public, standardized place to compare models, so it’s not just a score sheet, it’s a shared reference point for anyone trying to move forward. Including both speed and accuracy seems important, because most deployments need a mix of the two rather than just the highest numbers.

I think this could push innovation a bit faster by pointing out where big, general-purpose models excel and where niche systems have an edge. For developers and companies, having an independent check-point probably cuts down the guesswork when picking a solution and might even spark a bit more healthy competition. In the end, the real payoff may be that the industry starts looking beyond raw accuracy and begins to value the kinds of trade-offs that actually make speech-recognition useful in the messy, varied settings we work in.

Common Questions Answered

Which organizations collaborated to create the Open ASR Leaderboard?

The Open ASR Leaderboard was developed by a research group from Hugging Face, Nvidia, the University of Cambridge, and Mistral AI. This collaboration brought together expertise from academia and industry to create a standardized evaluation platform for automatic speech recognition systems.

How many models have been tested on the Open ASR Leaderboard so far?

According to the project's study, more than 60 models from 18 different companies have been evaluated on the Open ASR Leaderboard. This extensive testing provides a broad comparison of both open-source and commercial speech recognition systems.

What are the main evaluation categories covered by the Open ASR Leaderboard?

The leaderboard's evaluation covers three primary categories: English transcription, multilingual recognition (such as German), and overall system performance. These categories help assess how well models handle different languages and accents.

Why is the inclusion of both speed and accuracy metrics important for the Open ASR Leaderboard?

Including both speed and accuracy metrics is crucial because real-world applications require a balance between performance and practicality. This comprehensive approach acknowledges that users need systems that are not only accurate but also responsive in various usage scenarios.