Illustration for: OpenAI, Microsoft sued for scraping newspapers and using text without license
Policy & Regulation

OpenAI, Microsoft sued for scraping newspapers and using text without license

3 min read

A wave of legal challenges is converging on two of the most visible AI players, accusing them of pulling content from news sites without permission. Plaintiffs claim the firms built large language models on material that belongs to newspapers, then offered that same material through a search‑type feature. The complaint frames the behavior as a breach of established publishing rules, pointing to the removal of copyright notices and the absence of any licensing agreement.

By naming both the developer of the AI system and the cloud partner, the suit paints Microsoft not merely as a host but as an active participant in the alleged infringement. The stakes are high: if the allegations hold, the case could reshape how AI services source and reuse copyrighted text. The following excerpt lays out the core accusations in the plaintiffs’ own words.

Advertisement

The lawsuit alleges OpenAI and Microsoft simply ignored these rules. The companies supposedly scraped the sites, stripped the copyright notices, and used the text without a license for both training and direct search results. Microsoft is targeted not just as an infrastructure provider, but as a co-designer of the models and a direct beneficiary of the alleged theft.

The plaintiffs are seeking damages exceeding $10 billion, citing US laws that allow for up to $150,000 per work for willful infringement and up to $25,000 for removing copyright information. They argue that because higher-quality datasets were sampled more frequently during training, professional press content had a disproportionate impact on the models. They also want the nuclear option: the destruction of all GPT models and training sets containing their work, a demand the New York Times also made in late 2023.

The mystery of the deleted book datasets It's not just newspapers. OpenAI faces ongoing litigation from authors and publishers over the books used to train its AI. The dispute focuses on internal datasets dubbed "Books1" and "Books2," which allegedly contain massive amounts of e-books downloaded from the pirate library Library Genesis (LibGen).

According to an opinion and order by Magistrate Judge Ona T.

Related Topics: #OpenAI #Microsoft #large language models #GPT models #New York Times #copyright infringement #scraping #training sets

Will the courts draw a line around AI training data? The nine regional newspapers argue that OpenAI and Microsoft crossed it, filing a lawsuit in New York that seeks damages exceeding $10 billion. At the same time, a federal judge has compelled OpenAI to produce internal emails about its book datasets, which the plaintiffs say came from a pirate library.

The complaint accuses the companies of scraping news sites, stripping copyright notices, and feeding the text into both their models and direct search tools without any license. Microsoft is named not only as a cloud provider but also as a co‑developer of services such as Copilot. If the allegations prove accurate, the case could reshape how large language models acquire content.

Yet the factual basis for claims that the models can reproduce articles “almost word‑for‑word” remains contested. The litigation is still unfolding, and no ruling has yet clarified liability. As the parties prepare for discovery, the broader question of what constitutes permissible data use for AI training stays unresolved.

Further Reading

Common Questions Answered

What specific actions do the plaintiffs allege OpenAI and Microsoft took with newspaper content?

The plaintiffs claim OpenAI and Microsoft scraped news sites, stripped away copyright notices, and used the text without any licensing agreement for both training their large language models and providing a search‑type feature. This alleged behavior is presented as a direct violation of established publishing rules.

How much in damages are the nine regional newspapers seeking, and what legal provision supports that amount?

The newspapers are seeking damages exceeding $10 billion, citing U.S. copyright law which permits statutory damages of up to $150,000 per infringed work. This high figure reflects the large number of articles they allege were unlawfully used.

Why is Microsoft being targeted in the lawsuit beyond its role as a cloud provider?

Microsoft is accused of acting not only as an infrastructure provider but also as a co‑designer of the AI models and a direct beneficiary of the alleged theft of newspaper content. The complaint suggests that Microsoft’s involvement goes deeper than merely hosting the technology.

What recent court order has added pressure on OpenAI regarding its training data sources?

A federal judge recently ordered OpenAI to produce internal emails concerning its book datasets, which the plaintiffs argue were sourced from a pirate library. This subpoena intensifies scrutiny over how OpenAI obtains and uses copyrighted material for model training.

Advertisement