NVIDIA and UMD researchers unveil AF-Next audio model, outperforming Phi-4-mm on Arabic speech.

Editorial illustration for NVIDIA, UMD release AF-Next audio model, beats Phi-4-mm by 12 points on Arabic

NVIDIA's AF-Next Audio Model Beats Phi-4-mm Benchmark

NVIDIA, UMD release AF-Next audio model, beats Phi-4-mm by 12 points on Arabic

April 14, 2026 • 2 min read

NVIDIA and researchers at the University of Maryland have just put a new name on the table for large‑scale audio‑language work. Their model, dubbed Audio Flamingo Next (AF‑Next), is billed as the first fully open audio‑language system that can operate at internet‑scale. The collaboration promises a “super‑powerful” LALM that can ingest raw sound and generate text across many languages, a capability that has been hard to achieve without proprietary data.

While earlier models have shown promise, AF‑Next aims to push the envelope on both size and openness, letting anyone experiment with a truly massive audio understanding engine. The team highlights five takeaways, the first being that AF‑Next scales audio understanding like no prior open model. That claim matters because benchmark results on the CoVoST2 speech‑translation suite reveal a striking gap over the competing Phi‑4‑mm system, especially for Arabic.

The numbers suggest a concrete leap in performance, setting the stage for the detailed comparison that follows.

On CoVoST2 speech translation, AF-Next shows a particularly notable 12-point improvement over Phi-4-mm on Arabic EN→X translation (21.9 vs. Key Takeaways Here are 5 key takeaways: - A Fully Open Audio-Language Model at Internet Scale: AF-Next is considered the first LALM to scale audio understanding to internet-scale data -- approximately 108 million samples and 1 million hours of audio. - Temporal Audio Chain-of-Thought Solves Long-Audio Reasoning: Rather than reasoning like prior CoT approaches, AF-Next explicitly anchors each intermediate reasoning step to a timestamp in the audio before producing an answer.

NVIDIA and the University of Maryland Researchers Released Audio Flamingo Next (AF-Next): A Super Powerful and Open Large Audio-Language Model - MarkTechPost

AF‑Next arrives as the newest entry in the Audio Flamingo family, and its creators present it as the most capable model to date. While the paper highlights a 12‑point gain over Phi‑4‑mm on Arabic speech‑translation tasks (21.9 versus 9.9), the broader implications for other languages and domains remain uncertain. The researchers stress that audio‑language modeling has lagged behind vision, and that this fully open, internet‑scale system is the first large‑audio‑language model to claim such breadth.

Yet, the evaluation cited focuses solely on CoVoST2, leaving open questions about robustness to longer recordings, noisy environments, or music‑centric queries. Moreover, the claim of “most capable” rests on a single benchmark, and it is unclear whether comparable gains will appear on tasks beyond speech translation. Still, the release marks a concrete step toward more open audio‑language research, and the community now has a publicly available model to probe.

Whether AF‑Next will translate into practical applications or sustain its reported performance across diverse settings remains to be seen.

Common Questions Answered

How does AF-Next improve upon previous audio-language models in Arabic speech translation?

AF-Next demonstrates a significant 12-point improvement over Phi-4-mm, achieving a score of 21.9 in CoVoST2 speech translation for Arabic EN→X translation. This breakthrough represents a major advancement in handling complex audio-to-text translation tasks, particularly for Arabic language processing.

What makes AF-Next unique in the field of large-scale audio-language modeling?

AF-Next is the first fully open audio-language model designed to operate at internet scale, processing approximately 108 million audio samples and 1 million hours of audio. The model introduces a novel Temporal Audio Chain-of-Thought approach that enhances reasoning capabilities for long-form audio content.

What collaborative effort led to the development of the AF-Next audio model?

NVIDIA and researchers from the University of Maryland collaborated to create the Audio Flamingo Next (AF-Next) model, aiming to develop a powerful audio-language system capable of processing raw sound and generating text across multiple languages. This partnership represents a significant step forward in open-source audio-language modeling.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

NVIDIA's AF-Next Audio Model Beats Phi-4-mm Benchmark

Further Reading

Common Questions Answered

How does AF-Next improve upon previous audio-language models in Arabic speech translation?

What makes AF-Next unique in the field of large-scale audio-language modeling?

What collaborative effort led to the development of the AF-Next audio model?

Most Popular

Intuit turns months of tax code work into hours with proprietary DSL

Two new AI sandbox architectures limit credential exposure after prompt injection

Developers Claim Measured Drop in Claude's Performance, Sparking Nerf Debate

Google Vids adds Veo, Lyria AI models and directable avatars for flyers, reels

Alibaba’s Tongyi Lab launches VimRAG, a memory‑graph multimodal RAG framework

Guide to Building Document Intelligence Pipelines with LangExtract and OpenAI

Meta's structured prompting lifts LLM code review accuracy to 93%

Nvidia unveils Agentforce AI platform with Adobe, Salesforce, SAP at GTC 2026

Sam Altman proposes new AI 'social contract' in You.com guide

Anthropic ends free OpenClaw access to Claude, adds extra fee April 4

Further Reading

Related Reading

Hyperparameter Tuning Reaches 0.9617 Accuracy in 64.59 Seconds

Pharma Cautious as AI Promises Faster Drug Discovery and Smarter Trials

Google AI Advisors Let Users Probe Performance with Conversational “Why” Queries

Nvidia's NVentures: 21 Deals in 2023 Fuel AI Ecosystem Expansion

HCLTech and NVIDIA Open AI Innovation Lab in Santa Clara Using Full NVIDIA Stack

Developers Claim Measured Drop in Claude's Performance, Sparking Nerf Debate

NVIDIA PhysicsNeMo Tutorial Maps k(x,y) to u(x,y) for Darcy Flow

NVIDIA launches AITune v0.2.0 with KV‑cache support for LLM inference

NVIDIA KVPress Enables Long‑Context LLM Inference with KV Cache Compression

Common Questions Answered

How does AF-Next improve upon previous audio-language models in Arabic speech translation?

What makes AF-Next unique in the field of large-scale audio-language modeling?

What collaborative effort led to the development of the AF-Next audio model?

Most Popular

Intuit turns months of tax code work into hours with proprietary DSL

Two new AI sandbox architectures limit credential exposure after prompt injection

Developers Claim Measured Drop in Claude's Performance, Sparking Nerf Debate

Google Vids adds Veo, Lyria AI models and directable avatars for flyers, reels

Alibaba’s Tongyi Lab launches VimRAG, a memory‑graph multimodal RAG framework

Guide to Building Document Intelligence Pipelines with LangExtract and OpenAI

Meta's structured prompting lifts LLM code review accuracy to 93%

Nvidia unveils Agentforce AI platform with Adobe, Salesforce, SAP at GTC 2026

Sam Altman proposes new AI 'social contract' in You.com guide

Anthropic ends free OpenClaw access to Claude, adds extra fee April 4