📂 Category
Research & Benchmarks Articles - Complete AI News Archive
475 articles in this category • Page 1 of 5
- 1. Google DeepMind uses MITRE ATT&CK to monitor AI agents as rogue employees
- 2. DeFAb Benchmark Enforces Polynomial-Time Checks for Logical Rigor
- 3. OpenAI researchers aim to forecast AI model failure rates pre‑launch
- 4. Nvidia AI Agent Trains Robots Autonomously, Editing Code from Papers
- 5. XGBoost, ALBERT, BioBERT, Med‑LLaMA evaluated for pharmacovigilance
- 6. OpenAI's Deployment Simulation Beats Baseline, Adds Risk Checks to Agentic Code
- 7. GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) for 1/6 cost
- 8. AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights
- 9. AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0
- 10. NVIDIA Blackwell Leads MLPerf Training 6.0 with Full‑Stack Scale
- 11. DR-DCI Enables Agent-Callable Retrieval to Expand Local Workspace Efficiently
- 12. Fused kernels boost MoE training, forward and backward passes up to 1.3×
- 13. Hybrid Open-Ended Tri-Evolution Improves Deep Research for AI Agents
- 14. Microsoft Research Mirage adds persistent spatial memory to video generation
- 15. Amazon security research prompts White House ban on Anthropic Fable
- 16. Study: AI coding agents locate correct file but miss key lines in bugs
- 17. OpenAI confirms cooperation as state attorneys general launch investigation
- 18. Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy
- 19. NVIDIA tops AA‑AgentPerf benchmark, credits Vera Rubin platform
- 20. Perplexity routes deep‑research subtasks across 20+ models using Gemini agent
- 21. Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings
- 22. New arXiv Paper Introduces Strategic Decision Support for AI Agents
- 23. LSEG integrates trusted data into ChatGPT workflows, says Max Grigoryev
- 24. Hermes Agent Builder Unites Identity, Model, Skills, Servers in One Dashboard
- 25. SciConBench launches with 9.11K questions to test AI scientific synthesis
- 26. Language Agents Self‑Gate Clarification: Mandatory vs Opportunistic Modes
- 27. Study Defines Privacy-Utility Frontier for Agent Memory via PR and AER
- 28. Model 5 tops penalized PR-AUC, recall and F1-score in scoring model training
- 29. NVIDIA Nsight Designer Streams ONNX Editing and TensorRT Engine Build
- 30. AI moves beyond automation to plan, optimize and execute business initiatives
- 31. NVIDIA FLARE Auto-FL Enables Agent-Led Coding in Controlled Experiments
- 32. Multiverse reduces inference cost by favoring low‑cost prefill over decoding
- 33. AI agents solve neuroscience pipeline tasks on datasets larger than benchmarks
- 34. ML models predict World Cup outcomes, but miss draws, capture team strength
- 35. Reddit releases AI comment archive to study LLM persuasion tactics
- 36. Nvidia plans PC reboot, Apple unveils smart glasses on Vergecast
- 37. Open LLM v2, 12‑benchmark suite, LiveBench show d_eff 2.86‑4.80
- 38. NSF renews MIT AI‑physics institute, adds museum and hackathon outreach
- 39. From Prompt Tools to Workflow‑Driven AI: Managing Learning Curves
- 40. Geospatial ML Models Show Uneven Reliability Across Sparse Strata
- 41. Agents automate data retrieval, cleaning, analysis, modeling and reporting
- 42. Explainable ML Classifies Alzheimer's Early in 1,641 ADNI Subjects
- 43. MIT researchers train AI to read charts, streamlining downstream workflows
- 44. Lightweight CNN Boosts Adversarial Robustness in EEG‑Based Brain‑Computer Interfaces
- 45. Hundreds sign Leiden Declaration as AI threatens mathematicians' profession
- 46. Transformer tops Gait2Hip-60 benchmark with 0.819 R² in hip force prediction
- 47. QASM-Eval Introduces First Dataset for Training LLMs on OpenQASM-3
- 48. Parallax adds learned covariance correction to linear attention, retains softmax
- 49. Men use AI coding agents over twice as often as women; economists at 39%
- 50. Molecule-trained AI gives better chicken pairing suggestions than recipe AI
- 51. AI search agents favor confirming hits, sideline gut answers, study finds
- 52. OpenAI gives free life‑sciences AI model to aid government pandemic prep
- 53. Review paper claims code defines AI agents' reasoning and behavior
- 54. NVIDIA research moves robotics simulation to reality, revealing robot confusion
- 55. CVPR 2026 Friday Session: STARFlow‑V Video Modeling Poster #178, 4‑6 PM
- 56. USD E^3USD ‑Agent splits fast router from LLM meta‑controller for edge inference
- 57. Sakana AI's DiffusionBlocks Apply Uniform [4,4,4] Layers Across Three Blocks
- 58. AI Agent Auto-Identifies Unreadable Model Parameters from CSV Files
- 59. Learn to Build AI Projects: n8n Automation, Financial Data, Summaries, Reports
- 60. AI Agents Falter in Production as Backward Design Overburdens Model
- 61. Hugging Face releases LeRobot Humanoid: 3D‑printable legs for robot research
- 62. Synthetic 1,000‑Customer Dataset Uses Gender and Income to Test Bias
- 63. SciAtlas Introduces Large-Scale Knowledge Graph to Aid Automated Research
- 64. Google outperforms OpenAI on math benchmark, winning 9 to 1 ratio
- 65. ByteDance study: LMMs answer questions better than full-page transcription
- 66. Language Models Forecast Research Success Using 11,488 Comparative Idea Pairs
- 67. Researchers use triplet loss to train high-quality Horn logic embeddings
- 68. Positive-IC 46.4% indicates negative bias; |IC| just under 0.02 after two runs
- 69. AgentNLQ released as a general‑purpose NL2SQL agent; accuracy lags human writers
- 70. SuperAI Conference Highlights Growing AI Startup and Infrastructure Scene in Asia
- 71. CODEX Agent Adds AI‑Q Deep Research Skill from GitHub Repository
- 72. AI models learn chemistry; talent and collaborations offset location concerns
- 73. Real-Time Diffusion on Apple M3 Ultra: CoreML, Quantization, Neural Engine
- 74. Evaluating AI Agents: Does the Engine Grasp Instructions and Reason Facts?
- 75. AgentWall adds runtime safety layer for local AI agents' actions
- 76. LangSmith Engine automates agent debugging; OpenAI's Frontier offers platform
- 77. VideoWorld paper links prediction, simulation, reasoning in robotics
- 78. Channel-independent tolerate modalities but falter on within-modality gaps
- 79. Study Finds Current ToM Benchmarks Overlook First‑Person, Dynamic Interaction
- 80. Graph‑Enhanced RAG Architecture Cuts Latency in Meta‑Scale Production
- 81. OpenClaw founder runs 100 AI agents for USD 1.3M/month code, review PRs, find bugs
- 82. New benchmark shows AI video generators look realistic but lack reasoning
- 83. Researchers train AI model achieving near-full performance using 12.5% of experts
- 84. RecursiveMAS cuts multi-agent inference time 2.4×, slashes token use 75%
- 85. ArXiv to ban authors of papers with unchecked LLM‑generated content
- 86. Building an MCP-Routed AI Agent: Dynamic Tool Exposure via Keywords, Tags, and Constraints
- 87. BenchJack proposes secure-by-design AI benchmark audit with eight flaw taxonomy
- 88. 12‑Metric AI Agent Eval Harness Built in 9‑14 Days Across 100+ Deployments
- 89. Google DeepMind adds Gemini-powered cursor to Chrome for visual queries
- 90. BaLoRA adds Bayesian uncertainty to low‑rank adaptation, but lags fine‑tuning
- 91. Community review tools guide novices in AI research, study finds
- 92. Tilde Research's Aurora optimizer beats Muon and NorMuon at 340M scale
- 93. OpenAI unveils Daybreak to secure Codex, with industry and government rollout
- 94. New embeddings prioritize preferential similarity over semantics for clustering
- 95. Baidu's Ernie 5.1 Cuts 94% Pre‑Training Costs Using Once‑For‑All Framework
- 96. Hermes Agent tops use as Nous Research’s self‑improving model leads OpenRouter
- 97. Palisade Research: Open‑weight AI like Qwen boost autonomous hacking
- 98. Study proposes method to curb AI reward hacking in safety tests
- 99. Build Python Vector Search with Cosine Similarity for Scale‑Invariant Matching
- 100. AI success shifts from 95% accuracy to latency, cost, and reliability