AI Code Benchmarks Fail the "Vibe Check," Says New DeepMind Study
When you hand an AI a coding prompt, how can you tell if the result is actually useful? For a long time we just checked one thing: does it run? A recent paper says that this yes-or-no view probably skips the stuff developers care about.