Editorial illustration for Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time
Weaker LLMs Accidentally Delete Content, Shrinking...
Weaker LLMs Accidentally Delete Content, Shrinking Documents Over Time
Why does this matter? As AI moves from answering questions to handling whole workflows, we’re trusting models with the very files we rely on—legal contracts, codebases, research notes. A new study shows that trust may be misplaced.
Researchers built a benchmark called DELEGATE‑52, covering 52 professional domains ranging from Python scripts to crystallography data. Using a “round‑trip” test—ask the model to edit a document, then ask it to undo the change—they evaluated 19 large language models. In an ideal run the original file would reappear untouched; in reality, even top‑tier systems like Gemini Pro, Claude Opus and GPT‑5 altered about a quarter of the content after just 20 back‑and‑forth edits.
Weaker models fared worse, approaching a 50 % degradation rate. While the numbers are stark, the study stops short of explaining why the errors happen. Still, the findings raise a practical question: can we safely delegate complex, multi‑step editing to LLMs without risking silent corruption of the documents we depend on?
Weaker models tend to incur deletion: accidentally dropping content, which makes the issue noticeable after several interactions due to an obvious shrinking in the overall document content. In frontier LLMs, however, the root issue is not deletion but corruption: they keep the documents' overall "look and feel", even maintaining a nearly intact word count, but they silently mistype, modify, or replace factual information with fabrications that still sound plausible. Here's the irony: the smarter the model, the more difficult it becomes to detect its corruptive behavior, as the final output still looks legitimate at first glance.
Context Overload and Distractor Attachments In a messy condition -- with a lot of context information or excessive attached documents -- models struggle to keep information structurally intact. As the document size increases or more "distractor files" are included as part of the prompt context, the severity and impact of degradation skyrockets, losing the grip on accurate details and filling gaps based on predictive logic. The model no longer adheres to the source text, as it finds it easier to just guess.
The Importance of Domain Familiarity One last reason why models tend to degrade documents in complex interactions involving delegation relates to the nature of the use case and how familiar the model is with it.
Why this matters
Do we trust a tool that silently trims our work? We've seen weaker LLMs drop paragraphs, causing documents to shrink after a handful of delegated edits. That phenomenon becomes obvious only after several interactions, when the missing sections stand out.
In contrast, frontier models appear to retain length but introduce subtle corruption, mutating phrasing or structure without obvious deletions. Our teams therefore need to treat delegation as a reversible process, keeping snapshots or version control checkpoints before each AI pass. It is unclear whether stronger models will ever eliminate these errors entirely, or if the trade‑off will shift toward more nuanced distortion.
Developers should monitor output for both missing content and altered meaning, especially in legal or technical drafts. Founders might consider building guardrails that flag unexpected reductions in token count. Researchers are left with an open question: can training regimes be adjusted to reduce accidental deletion without sacrificing creativity?
Until we have concrete mitigations, caution remains advisable.