Research & Benchmarks - Page 5 of 24
Academic AI research, performance benchmarks, scientific breakthroughs, and peer-reviewed studies advancing artificial intelligence frontiers.
Academic AI research, performance benchmarks, scientific breakthroughs, and peer-reviewed studies advancing artificial intelligence frontiers.
Peter Steinberger, the mind behind the open‑source project OpenClaw, has built a tiny team that leans heavily on AI.
Modern generators such as Sora 2, Seedance 2.0 and Veo 3.1 can now churn out clips that look almost photorealistic. But a new benchmark from Tsinghua University shows that looking good isn’t the same as getting it right.
The Allen Institute for AI and UC Berkeley have unveiled EM O, a mixture‑of‑experts language model that builds internal modules around content domains instead of pure syntax.
Why does this matter? Researchers have introduced RecursiveMAS, a framework that reshapes how multiple language models collaborate.
ArXiv is tightening the screws on AI‑generated text. Starting now, any preprint that shows “incontrovertible evidence” the authors skipped a sanity check on large‑language‑model output will trigger a one‑year ban from the repository.
Here’s the thing: building an AI agent that only shows the tools it truly needs isn’t a fantasy—it’s the core of this tutorial.
Why does the reliability of AI agent benchmarks matter now? Because they steer everything from research funding to real‑world deployments.
“How do you know your agent isn’t hallucinating patient symptoms?” That question haunted a team that already had unit tests, integration tests and a model that shone on demo data—yet lacked any way to gauge hallucination rates, context faithfulness,...
The mouse pointer has been the quiet workhorse of personal computing for more than half a century—tracking position, registering clicks, and otherwise staying invisible. Google DeepMind researchers say that’s about to change.
LoRA has become the go‑to method for trimming the cost of fine‑tuning massive pretrained models.
We launched Parameter Golf to see how a tightly bounded problem would stir the machine‑learning community.
Here's the thing: Tilde Research just dropped Aurora, a new optimizer that patches a hidden flaw in Muon.
OpenAI just rolled out Daybreak, a cybersecurity program that leans on its latest AI models, the Codex Security agent, and a growing roster of security partners.
Why does this matter? Because platforms that let people voice opinions in full sentences are increasingly common, yet the algorithms that group those inputs still rely on representations tuned for meaning, not for agreement.
Baidu has rolled out Ernie 5.1, a distilled version of its earlier Ernie 5.0. While the new model runs on roughly a third of the total parameters and uses about half the active parameters per query, Baidu says pre‑training costs dropped to just six...
Why does this matter? As of May 10, 2026, Hermes Agent—an open‑source model from Nous Research—has claimed the top spot on OpenRouter’s global daily app and agent rankings.
Palisade Research has put AI agents through a practical test that reads like a cyber‑war scenario.
Why does this matter? As AI systems grow more capable, the gap between what they can do and what humans can verify widens.
You've probably typed a query into a search bar and got results that matched your words but missed the point. That's the gap most users feel: exact‑match engines versus meaning‑aware retrieval.
Remember the days when a data scientist’s pride came from squeezing that extra 0.7 % accuracy out of an XGBoost model after an overnight grid search?
Learn to build AI-powered apps without coding. Our comprehensive review of No Code MBA's course.
Curated collection of AI tools, courses, and frameworks to accelerate your AI journey.
Get the week's most important AI news delivered to your inbox every week.