AI firms scrape the web and millions of books without permission to fuel LLMs
The surge of large‑language models has turned data into a commodity, and the market’s appetite shows no signs of slowing. Companies that once trained on publicly available text now chase ever larger corpora, hoping to squeeze more capability out of each new parameter count. That drive has pushed many firms beyond the readily searchable corners of the internet, prompting them to harvest content at scale.
Legal scholars and publishers have started flagging the practice, arguing that copying entire collections of books without clearance skirts the line between innovation and infringement. Meanwhile, investors watch the same metrics—model size, token count, user engagement—grow year after year, rewarding any approach that promises a competitive edge. As the industry leans into “all‑access” agents that can answer questions, draft prose, or generate code on demand, the source of their knowledge base becomes a pivotal issue.
Fast forward a few years, and data‑hungry AI firms scraped huge swaths of the web and copied millions of books—often without permission or payment—to build the LLMs and generative AI systems they're currently expanding into agents. Having exhausted much of the web, many companies made it their def
Fast forward a few years, and data-hungry AI firms scraped huge swaths of the web and copied millions of books--often without permission or payment--to build the LLMs and generative AI systems they're currently expanding into agents. Having exhausted much of the web, many companies made it their default position to train AI systems on user data, making people opt out instead of opt in. While some privacy-focused AI systems are being developed, and some privacy protections are in place, much of the data processing by agents will take place in the cloud, and data moving from one system to another could cause problems.
Data scraping has become a cornerstone of recent AI development. Over the past two years, firms behind tools like ChatGPT have copied millions of books and vast portions of the web, often without permission or payment. This practice fuels the large language models that now power emerging AI agents.
The article points out that the free services offered by major tech companies already exchange user data for convenience, and the next generation of generative AI appears set to demand even broader access. Having exhausted much of the publicly available web, many companies are turning to other sources, though the legal and ethical boundaries remain blurred. It is unclear whether regulatory frameworks will catch up or how creators of copyrighted material will be compensated.
Meanwhile, users continue to hand over personal information in exchange for cloud‑based tools. The tension between convenience and privacy persists, and the scale of data collection behind AI agents raises questions that have yet to be resolved.
Further Reading
- AI firms are racing to buy up books for training data — and judges are starting to draw a line on ‘pirated’ texts - Tech Policy Press
- Judge: AI Training on Books Is Fair Use, But Piracy Isn’t - The National Law Review
- Federal Court Finds That Training AI on Copyrighted Books is ‘Quintessentially Transformative’ Fair Use - Neal, Gerber & Eisenberg LLP
- Meta’s Massive AI Training Book Heist: What Authors Need to Know - Authors Guild
- Reporter Sues AI Companies for Training Chatbots with Copyright Books - Android Headlines
Common Questions Answered
Why are AI firms increasingly scraping the web and millions of books without permission?
The surge in large‑language models has turned data into a commodity, prompting firms to harvest massive corpora beyond publicly searchable content to boost model capability. This practice often occurs without permission or payment, as companies prioritize scale over licensing.
How does the article describe the shift in how AI companies obtain training data from users?
The article notes that many companies have made it default to train AI systems on user data, requiring users to opt out rather than opt in. This approach leverages free services that exchange convenience for personal data.
What legal and ethical concerns are raised by the practice of copying millions of books for LLM training?
Legal scholars and publishers argue that copying entire books without permission infringes copyright and deprives authors of compensation. The article highlights growing criticism that such data scraping disregards intellectual property rights.
According to the article, what impact does data scraping have on the development of emerging AI agents?
Data scraping has become a cornerstone of recent AI development, fueling the large language models that power new AI agents. The article suggests that the next generation of generative AI will demand even broader data access, intensifying the controversy.