Research & Benchmarks - Latest AI News & Updates
Academic AI research, performance benchmarks, scientific breakthroughs, and peer-reviewed studies advancing artificial intelligence frontiers.
Academic AI research, performance benchmarks, scientific breakthroughs, and peer-reviewed studies advancing artificial intelligence frontiers.
The paper arXiv:2606.18557v1 introduces DeFAb, a benchmark built to test defeasible abduction in foundation models.
OpenAI researchers have unveiled a new approach to gauge how often an AI model will slip up once it reaches users.
Why does this matter? Because robots still stumble on the kind of grasping a human takes for granted.
Why does model choice matter in computational pharmacovigilance? While the InferBERT framework promises to blend transformer architectures with Do‑calculus for causal inference, its performance still hinges on the classifier underneath.
OpenAI has rolled out a new safety checkpoint it calls Deployment Simulation. The premise is straightforward: before a model reaches customers, OpenAI runs a rehearsal of how it would behave in the wild.
Why does this matter? Because an open‑weights model is now posting scores that challenge the big proprietary players. Z.ai’s GLM‑5.2 hit 62.1 on SWE‑bench Pro, edging out OpenAI’s GPT‑5.5, which logged 58.6, while costing roughly one‑sixth as much.
AMD has posted a detailed guide for anyone wanting to reproduce its MLPerf Training v6.0 results.
AMD has laid out its MLPerf Training v6.0 results, showcasing how the latest Instinct GPUs perform on three high‑profile benchmarks.
NVIDIA just swept the latest MLPerf Training v6.0 results, a benchmark suite run by the MLCommons consortium. Why does this matter?
Agentic search over massive text collections still leans on retriever‑mediated front ends—think BM25 or ColBERT—to pull in candidate passages.
Mixture‑of‑experts models are now a staple of large‑scale AI, letting engineers expand capacity while only a slice of parameters fires for each token.
Why does this matter now? Researchers have long split AI progress into two strands: pulling together scattered data to answer complex queries, and letting agents learn by trial and error.
Here's the thing: generating video that stays coherent as the camera pans has long been a pain point.
Why does this matter? A Wall Street Journal report says an Amazon security paper helped spark a White House export‑control order that forced Anthropic to cut off foreign‑national access to its Fable 5 and Mythos 5 models.
A new benchmark is pulling back the curtain on a blind spot in AI‑driven bug fixing.
A coalition of state attorneys general has opened an investigation into OpenAI, and the company was served with a subpoena from New York’s attorney general on Friday, the Wall Street Journal reports.
Google Research has rolled out Gemini‑SQL2, a text‑to‑SQL system built on the Gemini 3.1 Pro foundation. It takes plain‑language questions and turns them into SQL queries that actually run.
AI agents have upended how we think about inference workloads. While the hype is loud, the industry has long lacked a clear yardstick for measuring performance under these new conditions.
Perplexity has shifted its Deep Research capability into Computer, the company’s new multi‑model orchestration platform that debuted in late February 2026. Why does that matter?
Three renditions of 人工智能—full, 80 % retained, 50 % retained—appear side by side. You can read each instantly, even though the latter two show only a slice of the original image.