📂 Category
Research & Benchmarks Articles - Complete AI News Archive
474 articles in this category • Page 1 of 5
- 1. DeFAb Benchmark Enforces Polynomial-Time Checks for Logical Rigor
- 2. OpenAI researchers aim to forecast AI model failure rates pre‑launch
- 3. Nvidia AI Agent Trains Robots Autonomously, Editing Code from Papers
- 4. XGBoost, ALBERT, BioBERT, Med‑LLaMA evaluated for pharmacovigilance
- 5. OpenAI's Deployment Simulation Beats Baseline, Adds Risk Checks to Agentic Code
- 6. GLM-5.2 beats GPT-5.5 on SWE-bench Pro (62.1 vs 58.6) for 1/6 cost
- 7. AMD builds Llama 3.1 8B pretraining benchmark for MLPerf, using random weights
- 8. AMD's MI355X CDNA4 GPU Shows Competitive Training Times in MLPerf v6.0
- 9. NVIDIA Blackwell Leads MLPerf Training 6.0 with Full‑Stack Scale
- 10. DR-DCI Enables Agent-Callable Retrieval to Expand Local Workspace Efficiently
- 11. Fused kernels boost MoE training, forward and backward passes up to 1.3×
- 12. Hybrid Open-Ended Tri-Evolution Improves Deep Research for AI Agents
- 13. Microsoft Research Mirage adds persistent spatial memory to video generation
- 14. Amazon security research prompts White House ban on Anthropic Fable
- 15. Study: AI coding agents locate correct file but miss key lines in bugs
- 16. OpenAI confirms cooperation as state attorneys general launch investigation
- 17. Gemini‑SQL2 leads BIRD benchmark with 80.04% execution accuracy
- 18. NVIDIA tops AA‑AgentPerf benchmark, credits Vera Rubin platform
- 19. Perplexity routes deep‑research subtasks across 20+ models using Gemini agent
- 20. Visual model exploits similarity of 打, 拍, 拉; text model starts from embeddings
- 21. New arXiv Paper Introduces Strategic Decision Support for AI Agents
- 22. LSEG integrates trusted data into ChatGPT workflows, says Max Grigoryev
- 23. Hermes Agent Builder Unites Identity, Model, Skills, Servers in One Dashboard
- 24. SciConBench launches with 9.11K questions to test AI scientific synthesis
- 25. Language Agents Self‑Gate Clarification: Mandatory vs Opportunistic Modes
- 26. Study Defines Privacy-Utility Frontier for Agent Memory via PR and AER
- 27. Model 5 tops penalized PR-AUC, recall and F1-score in scoring model training
- 28. NVIDIA Nsight Designer Streams ONNX Editing and TensorRT Engine Build
- 29. AI moves beyond automation to plan, optimize and execute business initiatives
- 30. NVIDIA FLARE Auto-FL Enables Agent-Led Coding in Controlled Experiments
- 31. Multiverse reduces inference cost by favoring low‑cost prefill over decoding
- 32. AI agents solve neuroscience pipeline tasks on datasets larger than benchmarks
- 33. ML models predict World Cup outcomes, but miss draws, capture team strength
- 34. Reddit releases AI comment archive to study LLM persuasion tactics
- 35. Nvidia plans PC reboot, Apple unveils smart glasses on Vergecast
- 36. Open LLM v2, 12‑benchmark suite, LiveBench show d_eff 2.86‑4.80
- 37. NSF renews MIT AI‑physics institute, adds museum and hackathon outreach
- 38. From Prompt Tools to Workflow‑Driven AI: Managing Learning Curves
- 39. Geospatial ML Models Show Uneven Reliability Across Sparse Strata
- 40. Agents automate data retrieval, cleaning, analysis, modeling and reporting
- 41. Explainable ML Classifies Alzheimer's Early in 1,641 ADNI Subjects
- 42. MIT researchers train AI to read charts, streamlining downstream workflows
- 43. Lightweight CNN Boosts Adversarial Robustness in EEG‑Based Brain‑Computer Interfaces
- 44. Hundreds sign Leiden Declaration as AI threatens mathematicians' profession
- 45. Transformer tops Gait2Hip-60 benchmark with 0.819 R² in hip force prediction
- 46. QASM-Eval Introduces First Dataset for Training LLMs on OpenQASM-3
- 47. Parallax adds learned covariance correction to linear attention, retains softmax
- 48. Men use AI coding agents over twice as often as women; economists at 39%
- 49. Molecule-trained AI gives better chicken pairing suggestions than recipe AI
- 50. AI search agents favor confirming hits, sideline gut answers, study finds
- 51. OpenAI gives free life‑sciences AI model to aid government pandemic prep
- 52. Review paper claims code defines AI agents' reasoning and behavior
- 53. NVIDIA research moves robotics simulation to reality, revealing robot confusion
- 54. CVPR 2026 Friday Session: STARFlow‑V Video Modeling Poster #178, 4‑6 PM
- 55. USD E^3USD ‑Agent splits fast router from LLM meta‑controller for edge inference
- 56. Sakana AI's DiffusionBlocks Apply Uniform [4,4,4] Layers Across Three Blocks
- 57. AI Agent Auto-Identifies Unreadable Model Parameters from CSV Files
- 58. Learn to Build AI Projects: n8n Automation, Financial Data, Summaries, Reports
- 59. AI Agents Falter in Production as Backward Design Overburdens Model
- 60. Hugging Face releases LeRobot Humanoid: 3D‑printable legs for robot research
- 61. Synthetic 1,000‑Customer Dataset Uses Gender and Income to Test Bias
- 62. SciAtlas Introduces Large-Scale Knowledge Graph to Aid Automated Research
- 63. Google outperforms OpenAI on math benchmark, winning 9 to 1 ratio
- 64. ByteDance study: LMMs answer questions better than full-page transcription
- 65. Language Models Forecast Research Success Using 11,488 Comparative Idea Pairs
- 66. Researchers use triplet loss to train high-quality Horn logic embeddings
- 67. Positive-IC 46.4% indicates negative bias; |IC| just under 0.02 after two runs
- 68. AgentNLQ released as a general‑purpose NL2SQL agent; accuracy lags human writers
- 69. SuperAI Conference Highlights Growing AI Startup and Infrastructure Scene in Asia
- 70. CODEX Agent Adds AI‑Q Deep Research Skill from GitHub Repository
- 71. AI models learn chemistry; talent and collaborations offset location concerns
- 72. Real-Time Diffusion on Apple M3 Ultra: CoreML, Quantization, Neural Engine
- 73. Evaluating AI Agents: Does the Engine Grasp Instructions and Reason Facts?
- 74. AgentWall adds runtime safety layer for local AI agents' actions
- 75. LangSmith Engine automates agent debugging; OpenAI's Frontier offers platform
- 76. VideoWorld paper links prediction, simulation, reasoning in robotics
- 77. Channel-independent tolerate modalities but falter on within-modality gaps
- 78. Study Finds Current ToM Benchmarks Overlook First‑Person, Dynamic Interaction
- 79. Graph‑Enhanced RAG Architecture Cuts Latency in Meta‑Scale Production
- 80. OpenClaw founder runs 100 AI agents for USD 1.3M/month code, review PRs, find bugs
- 81. New benchmark shows AI video generators look realistic but lack reasoning
- 82. Researchers train AI model achieving near-full performance using 12.5% of experts
- 83. RecursiveMAS cuts multi-agent inference time 2.4×, slashes token use 75%
- 84. ArXiv to ban authors of papers with unchecked LLM‑generated content
- 85. Building an MCP-Routed AI Agent: Dynamic Tool Exposure via Keywords, Tags, and Constraints
- 86. BenchJack proposes secure-by-design AI benchmark audit with eight flaw taxonomy
- 87. 12‑Metric AI Agent Eval Harness Built in 9‑14 Days Across 100+ Deployments
- 88. Google DeepMind adds Gemini-powered cursor to Chrome for visual queries
- 89. BaLoRA adds Bayesian uncertainty to low‑rank adaptation, but lags fine‑tuning
- 90. Community review tools guide novices in AI research, study finds
- 91. Tilde Research's Aurora optimizer beats Muon and NorMuon at 340M scale
- 92. OpenAI unveils Daybreak to secure Codex, with industry and government rollout
- 93. New embeddings prioritize preferential similarity over semantics for clustering
- 94. Baidu's Ernie 5.1 Cuts 94% Pre‑Training Costs Using Once‑For‑All Framework
- 95. Hermes Agent tops use as Nous Research’s self‑improving model leads OpenRouter
- 96. Palisade Research: Open‑weight AI like Qwen boost autonomous hacking
- 97. Study proposes method to curb AI reward hacking in safety tests
- 98. Build Python Vector Search with Cosine Similarity for Scale‑Invariant Matching
- 99. AI success shifts from 95% accuracy to latency, cost, and reliability
- 100. Apple Workshop Shows ML with Homomorphic Encryption, Georgia Institute, CISPA