Research & Benchmarks - Page 5 of 28

Academic AI research, performance benchmarks, scientific breakthroughs, and peer-reviewed studies advancing artificial intelligence frontiers.

547 articles View complete article list

Comparison of GLM-5.2 and GPT-5.5 performance on SWE-bench Pro, showing GLM-5.2 achieving 62.1 vs 58.6 with 1/6th the cost, h

GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) for 1/6 cost

For one-sixth the cost, GLM-5.2 just punched above its weight on SWE-bench Pro, scoring 62.1 to GPT-5.5’s 58.6. That single number, though, is only the start.

June 16, 2026

• 3 min read

AMD showcases Llama 3.1 8B pretraining benchmark on MLPerf, demonstrating AI model training with random weights for machine l

AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights

AMD has submitted a training benchmark for a model that doesn't learn. For the latest MLPerf results, the company pretrained Meta's Llama 3.1 8B architecture using entirely random weights. No data, no real gradients.

June 16, 2026

• 3 min read

AMD MI355X CDNA4 GPU benchmarking AI training performance in MLPerf v6.0, showcasing competitive results with high-speed data

AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0

AMD just matched Nvidia’s top chip. In a head-to-head sprint, the company's new MI355X accelerator fine-tuned a Llama 2 70B model in the same time as a Nvidia B200 GPU, using an eight-accelerator setup.

June 16, 2026

• 3 min read

NVIDIA Blackwell GPU architecture showcasing leading performance in MLPerf Training 6.0 benchmark with full-stack AI training

NVIDIA Blackwell Leads MLPerf Training 6.0 with Full‑Stack Scale

Nvidia won everything. The MLPerf Training 6.0 benchmark results are in, and the company's Blackwell platform took first place in every single test. That clean sweep isn't about a chip.

June 16, 2026

• 3 min read

Conceptual illustration of DR-DCI technology enabling agent-callable retrieval to expand local workspace efficiently, showcas

DR-DCI Enables Agent-Callable Retrieval to Expand Local Workspace Efficiently

Most AI research papers sell you a new way to fail. They’ll claim they’ve cracked some fundamental tension, then quietly fudge the data. This one’s different. DR-DCI fixes a real problem.

June 16, 2026

• 4 min read

Advanced fused kernels accelerating Mixture of Experts (MoE) training with improved forward and backward passes, achieving up

Fused kernels boost MoE training, forward and backward passes up to 1.3×

Training a large Mixture-of-Experts model often feels like herding cats on a supercomputer. The GPU is constantly starting and stopping tiny tasks, stuck waiting for messages between experts, and never really working at full capacity.

June 15, 2026

• 3 min read

AI agent tri-evolution model showcasing hybrid deep research innovation with interconnected neural pathways and evolving data

Hybrid Open-Ended Tri-Evolution Improves Deep Research for AI Agents

Deep research is where AI agents usually fail. They can pull up facts, but they can't learn from them. Their knowledge is frozen. Meanwhile, a separate line of work, called agent evolution, has shown real promise.

June 15, 2026

• 3 min read

Microsoft Research Mirage technology demonstrating AI-generated video with persistent spatial memory, showcasing advanced vid

Microsoft Research Mirage adds persistent spatial memory to video generation

Video generation has long suffered from a quiet, expensive flaw: every time a model renders a new viewpoint, it must rebuild the world from scratch, pixel by pixel, memory bleeding away with each frame.

June 14, 2026

• 4 min read

White House bans Anthropic's AI model Fable following Amazon security research concerns, highlighting AI safety and governmen

Amazon security research prompts White House ban on Anthropic Fable

Amazon's security team found a hole. The White House answered the phone. Now some of the best researchers in the country can't touch their own work. Andy Jassy got a call from Washington after his team flagged something in Anthropic's Fable model.

June 14, 2026

• 3 min read

Study reveals AI coding agents finding correct files but missing critical bug lines in code review, highlighting limitations

Study: AI coding agents locate correct file but miss key lines in bugs

AI coding assistants are great at finding the file. They're terrible at reading it.

June 14, 2026

• 3 min read

OpenAI CEO Sam Altman announces partnership while state attorneys general investigate AI regulation and oversight in a press

OpenAI confirms cooperation as state attorneys general launch investigation

OpenAI just beat Elon Musk in court. Now it's facing a different kind of fight. State attorneys general are investigating the company, and OpenAI says it's cooperating. It won't say who or what they're asking for. This isn't happening in a vacuum.

June 13, 2026

• 3 min read

Gemini-SQL2 benchmarking results showing 80.04% execution accuracy lead in the BIRD benchmark for AI-powered database query p

Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy

Google's Gemini-SQL2 just hit 80.04% execution accuracy on the BIRD benchmark. That's a specific, hard number. For context, OpenAI's GPT-5.5-xhigh sits at 72.8%. Claude Opus 4.6 is at 70.9%.

June 13, 2026

• 3 min read

NVIDIA achieves top performance in AA-AgentPerf benchmark using Vera Rubin Observatory platform, showcasing AI and computatio

NVIDIA tops AA‑AgentPerf benchmark, credits Vera Rubin platform

Leaderboards are usually marketing noise. This one is different. NVIDIA just topped the first major benchmark for AI agent performance, and the margin isn't close.

June 12, 2026

• 3 min read

AI agent network connecting deep-research tasks across 20+ models via Gemini, illustrating Perplexity’s multi-model collabora

Perplexity routes deep‑research subtasks across 20+ models using Gemini agent

Most AI search is still just a fancy text predictor. Perplexity decided to build a factory instead. Its new system breaks a single query into pieces and farms them out to over twenty different AI models, all working at once.

June 12, 2026

• 3 min read

Editorial photo showing a visual model demonstrating Chinese character similarities between 打, 拍, 拉, alongside a text model a

Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings

Being able to see is not the same as being able to read. A new comparison of training methods for Chinese characters proves it. The visual model gets a head start. It recognizes that the characters 打, 拍, and 拉 all share the same hand-shaped radical.

June 12, 2026

• 3 min read

Scientist reviewing groundbreaking arXiv paper on AI agent decision-making strategies with futuristic tech interface and rese

New arXiv Paper Introduces Strategic Decision Support for AI Agents

We used to think of support as something a computer gave a person. Now the person is often the backup for the computer. This is a real problem. An AI agent booking a flight or managing a supply chain doesn't have a bad day.

June 12, 2026

• 3 min read

LSEG executive Max Grigoryev discusses integrating verified financial data into ChatGPT workflows, enhancing AI-driven insigh

LSEG integrates trusted data into ChatGPT workflows, says Max Grigoryev

Everyone selling AI says it's trustworthy. Most are hoping you don't check. The London Stock Exchange Group is taking a different, more literal approach: they're feeding their own vetted financial data directly into ChatGPT's machinery.

June 11, 2026

• 3 min read

Hermes Agent Builder dashboard showcasing unified identity management, AI model integration, skill optimization, and server o

Hermes Agent Builder Unites Identity, Model, Skills, Servers in One Dashboard

The dashboard is the command center. Hermes Agent’s Profile Builder collapses fragmentation into one coherent pane of glass: identity, model, skills, and MCP servers all flow together.

June 11, 2026

• 3 min read

SciConBench launch event showcasing 9,110 AI scientific synthesis questions for evaluating advanced AI models in research and

SciConBench launches with 9.11K questions to test AI scientific synthesis

Forget whether AI can write your emails. The real question is whether it can do science. A new, brutally difficult benchmark called SciConBench makes it clear that the answer, for now, is a hard no.

June 11, 2026

• 4 min read

Conceptual illustration comparing mandatory and opportunistic language agent gate modes, showing a locked gate labeled "Manda

Language Agents Self‑Gate Clarification: Mandatory vs Opportunistic Modes

Most AI that talks to you is guessing. It’s making a probabilistic wager on what you meant. A new paper suggests the smartest move a language model can make isn't a better guess, but knowing when to stop guessing altogether.

June 11, 2026

• 4 min read

Browse Other Categories

LLMs & Generative AI AI Tools & Apps Business & Startups Policy & Regulation Market Trends Open Source Industry Applications

Research & Benchmarks - Page 5 of 28

GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) for 1/6 cost

AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights

AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0

NVIDIA Blackwell Leads MLPerf Training 6.0 with Full‑Stack Scale

DR-DCI Enables Agent-Callable Retrieval to Expand Local Workspace Efficiently

Fused kernels boost MoE training, forward and backward passes up to 1.3×

Hybrid Open-Ended Tri-Evolution Improves Deep Research for AI Agents

Microsoft Research Mirage adds persistent spatial memory to video generation

Amazon security research prompts White House ban on Anthropic Fable

Study: AI coding agents locate correct file but miss key lines in bugs

OpenAI confirms cooperation as state attorneys general launch investigation

Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy

NVIDIA tops AA‑AgentPerf benchmark, credits Vera Rubin platform

Perplexity routes deep‑research subtasks across 20+ models using Gemini agent

Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings

New arXiv Paper Introduces Strategic Decision Support for AI Agents

LSEG integrates trusted data into ChatGPT workflows, says Max Grigoryev

Hermes Agent Builder Unites Identity, Model, Skills, Servers in One Dashboard

SciConBench launches with 9.11K questions to test AI scientific synthesis

Language Agents Self‑Gate Clarification: Mandatory vs Opportunistic Modes

Featured Resources & Reviews

No Code MBA Course Review

AI Tools & Resources

Weekly AI Digest

Browse Other Categories