Skip to main content
Study reveals AI coding agents finding correct files but missing critical bug lines in code review, highlighting limitations

Editorial illustration for Study: AI coding agents locate correct file but miss key lines in bugs

Study: AI coding agents locate correct file but miss key...

Study: AI coding agents locate correct file but miss key lines in bugs

3 min read

A new benchmark is pulling back the curtain on a blind spot in AI‑driven bug fixing. While most studies have judged an agent solely on whether it eventually patches a defect, the SWE‑Explore suite asks a different question: can the model point to the exact code fragments that matter? Researchers from Shanghai Jiao Tong University and partners built a test set of 848 bugs, then fed each one to powerful models—including GPT‑5.4, Gemini 3 Pro, Claude Sonnet 4.6 and Kimi K2.6.

For every problem they collected at least two successful solution attempts and traced which files and lines the AI actually examined before producing a fix. Passages that multiple independent runs converged on were treated as “useful context.” The benchmark evaluates only the first phase—returning a ranked list of relevant sections after receiving a bug description and the full project. Early results show agents often land in the right neighborhood, selecting the proper file, yet they regularly miss the precise lines that drive the bug.

The work highlights how a single “did it fix the bug?” metric can mask deeper shortcomings.

A bug description like "RuntimeWarning on Overflow" contains terms that show up far more often in a project's templates and docs than in the actual source code. AI agents pull ahead clearly because they search the project step by step instead of sorting all hits at once. Line-level accuracy drops off a cliff At the file level, the agents do fine.

They find the right source file, rank it early, and keep the selection tight. But the moment the test zooms in to individual lines of code, the system falls apart. General coding agents cover only 14 to 19 percent of the lines that actually matter.

Throwing a stronger language model at the problem doesn't fix it. The team ran the same agent with six different models from OpenAI, Anthropic, Google, Moonshot, and Zhipu. The GPT family leads, but the pattern holds.

File hit rates stay consistently higher than actual line coverage. The various agent architectures land strikingly close to each other. Claude Code, Codex, OpenHands, Mini-SWE-Agent, and AweAgent post nearly identical scores across every metric.

It scans code as a network of interconnected building blocks and achieves much higher line coverage. Among the specialized localization systems, AutoCodeRover works precisely but stays conservative, while OrcaLoca produces little noise but misses many relevant spots. Repairs fail below a minimum context threshold In a controlled ablation experiment, the team artificially varied the context.

The repair model saw only 0, 25, 50, 75, or 100 percent of the core regions, sometimes padded with irrelevant non-core code. For the easier tasks in the dataset, a clear threshold effect shows up.

Why this matters We now know AI coding agents can locate the right file but still miss the line that actually needs fixing. This nuance matters because previous benchmarks judged success only by whether the bug disappeared, masking a hidden weakness in the agents’ understanding of code context. For developers, the finding suggests that relying on an AI’s output without a line‑level sanity check could introduce subtle regressions.

Founders building AI‑assisted development tools must consider adding diagnostics that surface search accuracy, not just end‑state correctness. Researchers are left with an open question: will improving step‑by‑step project search translate into better line‑level precision, or does the gap reflect deeper limitations in how models interpret bug descriptions that often echo templates rather than source? The new benchmark offers a clearer target, yet it also reveals that current agents’ performance drops when the metric tightens.

Until we see methods that consistently bridge the search‑to‑fix gap, we should treat AI‑generated patches with cautious optimism and maintain rigorous code review practices.

Further Reading