Editorial illustration for Google AI launches Auto-Diagnose, LLM tool flags 84.3% of reports as ‘Please fix’
Google AI Auto-Diagnose Flags 84% of Dev Test Errors
Google AI launches Auto-Diagnose, LLM tool flags 84.3% of reports as ‘Please fix’
Google’s new Auto‑Diagnose system promises to sift through integration‑test failures using a large language model, then hand its findings back to developers. The tool was trialed with a pool of 437 distinct developers who collectively generated 517 feedback reports. Reviewers—370 in total—were asked to classify each diagnosis, and the resulting interaction pattern reveals how the system is being received on the ground.
While the rollout aimed to streamline bug triage, the data shows a clear tendency toward a single type of response. Moreover, developers themselves rated the usefulness of the feedback, producing a measurable helpfulness ratio. This backdrop frames the next set of numbers, which illustrate just how often reviewers are urging authors to act on the diagnoses.
Across 517 total feedback reports from 437 distinct developers, 436 (84.3%) were "Please fix" from 370 reviewers -- by far the dominant interaction, and a sign that reviewers are actively asking authors to act on the diagnoses. Among dev-side feedback, the helpfulness ratio (H / (H + N) ) is 62.96%, and the "Not helpful" rate (N / (PF + H + N) ) is 5.8% -- well under Google's 10% threshold for keeping a tool live. Across 370 tools that post findings to Critique, Auto-Diagnose ranks #14 in helpfulness, putting it in the top 3.78%.
The manual evaluation also surfaced a useful side effect. Of the seven cases where Auto-Diagnose failed, four were because test driver logs were not properly saved on crash, and three were because SUT component logs were not saved when the component crashed -- both real infrastructure bugs, reported back to the relevant teams. In production, around 20 'more information is needed' diagnoses have similarly helped surface infrastructure issues.
Key Takeaways - Auto-Diagnose hit 90.14% root-cause accuracy on a manual evaluation of 71 real-world integration test failures spanning 39 teams at Google, addressing a problem 6,059 developers ranked among their top five complaints in the EngSat survey.
Google’s Auto‑Diagnose reads integration‑test logs, extracts a root cause and drops a short diagnosis into the relevant code review. In a manual evaluation of 71 real‑world failures, the system produced a diagnosis for each case. Across 517 feedback reports from 437 developers, reviewers responded “Please fix” in 436 instances—84.3% of the total—showing that they are prompting authors to act on the suggestions.
Developers rated the tool helpful in roughly 63% of the interactions, a figure that suggests a modest level of utility. Yet the data set is limited; it is unclear whether the same response rates will hold in larger, more diverse codebases or under different review cultures. The study does not address turnaround time, nor does it measure any downstream impact on defect resolution speed.
Consequently, while the early numbers point to active engagement and a measurable helpfulness ratio, further evidence will be needed to gauge the broader effectiveness of Auto‑Diagnose in everyday development workflows.
Further Reading
Common Questions Answered
How accurate is Google's Auto-Diagnose system in identifying integration-test failures?
Google's Auto-Diagnose system demonstrated high accuracy in flagging integration-test failures, with 84.3% of feedback reports receiving a 'Please fix' response from reviewers. In a manual evaluation of 71 real-world failures, the system successfully produced a diagnosis for each case, indicating its potential effectiveness in automated bug triage.
What percentage of developers found the Auto-Diagnose tool helpful?
According to the study, developers rated the Auto-Diagnose tool helpful in approximately 62.96% of interactions. The tool also maintained a low 'Not helpful' rate of 5.8%, which is well below Google's 10% threshold for maintaining a tool's viability.
How many developers and feedback reports were involved in the Auto-Diagnose trial?
The Auto-Diagnose system was trialed with 437 distinct developers who generated 517 feedback reports. Of these reports, 370 reviewers classified the diagnoses, with 436 reports (84.3%) receiving a 'Please fix' response, demonstrating significant engagement with the tool's suggestions.