📂 Category
Research & Benchmarks News Archive - Page 2 of 5
475 articles in this category • Page 2 of 5
- 101. Apple Workshop Shows ML with Homomorphic Encryption, Georgia Institute, CISPA
- 102. OpenAI opens GPT-5.5-Cyber to vetted security researchers, adds three tiers
- 103. LightSeek launches TokenSpeed, cutting LLM latency by half vs TensorRT-LLM
- 104. CLIP-FP8 Model Matches CLIP-FP16 Quality; Patch Embedding Quantizers Matter
- 105. Automation updates AI context morning with active threads, key dates, note
- 106. Google DeepMind buys minority stake in EVE Online studio for AI testing
- 107. Meta AI releases NeuralBench, benchmark for 36 EEG tasks, 94 datasets
- 108. iTARFlow Shows Competitive Performance on ImageNet 64‑256px Resolutions
- 109. CreativityBench benchmark introduces 4K‑entity affordance KB to test LLM creativity
- 110. Self-Attentive Meta-Optimizer Adds Gradient Alignment and Group-Adaptive Rates
- 111. Local edits in LLM-driven NAS can trigger broader performance shifts
- 112. Groq‑Powered Agentic Assistant Uses Sub‑Agent to Catalog 2024‑25 SLMs
- 113. AI autoencoders and joint communications‑sensing rank among 6G enablers
- 114. Anthropic adds 'dreaming' feature to Claude Managed Agents for memory recall
- 115. Anthropic's USD 200 B, five‑year Google Cloud deal makes up >40% of backlog
- 116. MRC retires paths, then probes to confirm failures and recovery
- 117. eOptShrinkQ enables near‑lossless KV cache compression with spectral denoising
- 118. Harvard study finds OpenAI's o1 and 4o outdiagnose ER doctors in 76‑patient test
- 119. 2021 EDEN-unbiased quantizer beats 2026 successor in average accuracy
- 120. US benchmark shows China lagging; Deepseek model underperforms private tests
- 121. Google DeepMind AI co‑clinician beats GPT‑5.4 in blind tests, lags docs
- 122. Anthropic benchmark says Claude matches experts, 23 tasks remain ambiguous
- 123. Grok Voice Think Fast 1.0 lets non‑programmers design agents via console.x.ai
- 124. WPI professor Gerych offers solution to AI vision ‘Whac‑a‑mole’ bias dilemma
- 125. New method advances privacy‑preserving AI training on consumer devices
- 126. Musk says he was duped, warns AI could kill us, xAI to IPO via SpaceX in June
- 127. NVIDIA BioNeMo wraps CPU layer with DistributedTriangleMultiplication
- 128. DeepSeek unveils new AI breakthrough as nation tightens grip on departing firms
- 129. Poolside AI launches Laguna XS.2 and M.1, hitting 72.5% on SWE-bench Verified
- 130. Oracle abandons its legacy, pivots to AI in an unconventional approach
- 131. New Architecture Separates Execution and Review Agents for Tool-Calling
- 132. MIT study links language model scaling success to superposition of concepts
- 133. Google staff urge Sundar Pichai to reject classified military AI projects
- 134. AI framework autonomously optimizes data, models, algorithms, outperforms humans
- 135. MolClaw Introduces Autonomous Agent for Hierarchical Drug Screening
- 136. Lakehouse concept drives AI data access for thousands of enterprise users
- 137. Fine-tuning RAG embeddings may drop retrieval accuracy 40%, study finds
- 138. AI pipelines show silent failures from orchestration drift, detected weeks later
- 139. OSWorld Benchmark Evaluates LLMs on Real Computer Use, Unlike Text‑Only Tests
- 140. PageIndex Retrieves via Reasoning Using OpenAI gpt-5.4 Model
- 141. Discord Users Access Anthropic's Mythos AI Tool Without Authorization
- 142. Google DeepMind's Vision Banana Outperforms SAM 3 and Depth Anything V3
- 143. DeepMind spinoff’s AI‑designed drugs enter human trials after AlphaFold 3
- 144. COALA paper defines agent memory types: procedural rules and semantic facts
- 145. Google DeepMind's Decoupled DiLoCo hits 88% goodput despite hardware failures
- 146. Agent observability powers production evaluation through trace analysis
- 147. Xiaomi launches MiMo‑V2.5‑Pro and V2.5, matching benchmarks at lower token cost
- 148. Designing Production-Grade CAMEL Multi-Agent Systems: Start with Docs and GitHub
- 149. Multi-agent AI systems incur higher token costs than single agents in practice
- 150. Reinforcement learning trains AI like OpenAI's o1 to admit uncertainty
- 151. LangSmith adds reusable LLM-as-judge and rule-based code evaluator templates
- 152. AI made up over a third of new sites by 2025; Pope warning flagged as AI
- 153. Sergey Brin pushes DeepMind to match Claude, unveils agent skills catalog
- 154. Fortnite adds AI‑powered NPCs for unscripted player conversations
- 155. TabPFN hits 98.8% accuracy in 0.47 s, beating Random Forest and CatBoost
- 156. NVIDIA PhysicsNeMo Tutorial Maps k(x,y) to u(x,y) for Darcy Flow
- 157. OpenAI unveils GPT‑Rosalind, AI model to speed drug discovery and genomics
- 158. Standard LLM guidelines focus on training costs, overlook inference budget
- 159. GPT‑Rosalind life‑sciences plugin for Codex launches on GitHub
- 160. OpenAI launches GPT-Rosalind, hits top score on BixBench benchmark
- 161. Frontier AI models fail one in three production runs, audits grow harder
- 162. Meta researchers unveil hyperagents for self‑improving AI in non‑coding tasks
- 163. Claude outperforms humans on alignment task, but results disappear in production
- 164. Google DeepMind unveils Gemini Robotics‑ER 1.6, beats prior model in tool count
- 165. UK tests Mythos AI, noting its ability to chain multistep attacks
- 166. AI Forum Launches Professional Certificate and USD 120M Fund for AI Fluency
- 167. Databricks finds multi-step agents beat single-turn RAG by 21% to 38% on STaRK
- 168. Stanford AI Index 2026: 53% adopt generative AI in 3 years, education lags
- 169. NVIDIA, UMD release AF-Next audio model, beats Phi-4-mm by 12 points on Arabic
- 170. Developers Claim Measured Drop in Claude's Performance, Sparking Nerf Debate
- 171. Seven AI agents in finance lift cash flow >3% monthly, boost productivity 50%
- 172. Meta AI and KAUST Propose Neural Computers Merging Compute, Memory, I/O
- 173. Prediction drift can mask security model decay despite stable accuracy
- 174. Researchers say OpenAI's Sora and Google's Veo aren't true world models
- 175. TriAttention KV Cache Compression Matches Full Attention, 2.5× Faster
- 176. Knowledge Distillation Keeps Student Model Capacity to Match Ensemble Boundaries
- 177. Google AI's PaperOrchestra boosts manuscript success, 79‑81% win rate
- 178. OSGym runs 1,000+ OS replicas at USD 0.23/day with decentralized state management
- 179. Stanford study finds AI agent handoffs lose information, affecting compute cost
- 180. Meta Superintelligence Labs launches Muse Spark, its first multimodal AI model
- 181. Better Harness updates add usage examples, chaining guide, and tool clarifications
- 182. Study finds ‘bot’ term used 16,232 times in 2.8M Telegram messages
- 183. Google AI Overviews answers 91% of test questions correctly after Gemini 3 update
- 184. MaxToki AI boosts context to 16,384 tokens with RoPE scaling
- 185. Meta staff inflate AI token counts on internal leaderboard, wasting resources
- 186. MassMutual, Mass General Brigham turn AI pilot sprawl into production
- 187. OpenAI safety staff exit as Altman dismisses Pentagon contract concerns
- 188. OpenAI urges firms to fund pensions, health, childcare as AI cuts costs
- 189. Study shows sycophantic AI chatbots can outwit ideal rational users
- 190. Americans use AI more than ever but trust it less, Quinnipiac poll shows
- 191. Study maps developer frustration with AI slop as tragedy of the commons
- 192. Google study: AI benchmarks ignore human disagreement; under 10 raters fail
- 193. Alibaba's Qwen team adds method that lengthens AI answers, prompting reasoning
- 194. Open models cross threshold; frontier models show per‑category correctness
- 195. Batch Mode VC-6 and NVIDIA Nsight Speed Up Vision AI Pipelines
- 196. CaP-Agent0 Beats Human Code on 4 of 7 Robot Tasks Using Low‑Level Blocks
- 197. Nvidia breaks MLPerf records with 288 GPUs as AMD, Intel pursue other goals
- 198. NVIDIA's 288-GPU Blackwell Ultra Sets New MLPerf Inference Throughput Record
- 199. DeepMind study finds six traps that let a few poisoned docs hijack AI agents
- 200. AI productivity gap: top agent beats baseline in 1 of 15 runs, 26.5% subtasks