Skip to main content
A tech analyst types at a laptop surrounded by towering stacks of books while binary code streams across a dark screen.

Editorial illustration for AI Firms Scrape Millions of Books and Web Data Without Creator Consent

AI's Data Grab: Scraping Books Without Creator Consent

AI firms scrape the web and millions of books without permission to fuel LLMs

Updated: 3 min read

The quiet heist was years in the making. AI firms scraped the open web wholesale, copied millions of books without a by-your-leave, and paid no royalties, all to feed the ravenous appetites of large language models. They burned through the internet's public bounty.

Then, when the well ran dry, they flipped the default. Your data, your keystrokes, your private conversations became the new feedstock. Opt out if you can find the switch.

Now, as these systems evolve into ubiquitous agents, the data flows into the cloud, and across porous borders between systems. Permission was never the starting point; it became an afterthought.

Fast forward a few years, and data-hungry AI firms scraped huge swaths of the web and copied millions of books--often without permission or payment--to build the LLMs and generative AI systems they're currently expanding into agents. Having exhausted much of the web, many companies made it their default position to train AI systems on user data, making people opt out instead of opt in. While some privacy-focused AI systems are being developed, and some privacy protections are in place, much of the data processing by agents will take place in the cloud, and data moving from one system to another could cause problems.

The web was a commons, not a quarry. The books were written by people, not for machines to digest without a nod. Now the default is permission by omission, and the agents are coming.

Every data transfer, every cloud-bound query, becomes a potential spill. This is not a technical inevitability. It is a choice, one made in back rooms, not parliaments.

The silence of the user has been mistaken for a nod. We can reverse that. Demand that consent be active, not buried in a terms-of-service labyrinth.

Before the agents make that decision for us.

Common Questions Answered

How are AI companies obtaining training data for large language models?

AI firms are scraping massive datasets from across the internet, including books, websites, and creative works without explicit permission or compensation. This aggressive data collection strategy involves harvesting millions of texts and creative materials, often treating intellectual property as an open resource for training AI systems.

What ethical concerns arise from AI companies' current data collection practices?

The current data collection approach raises serious questions about creator rights and consent, with tech companies prioritizing corporate ambition over individual creators' ownership of their work. Many AI firms have adopted a default strategy of collecting data and forcing creators to opt out, rather than seeking prior permission or offering compensation.

Why are AI companies scraping such large volumes of web and book content?

AI companies are desperately seeking training data to build powerful language models and generative AI systems, having already exhausted much of the publicly available web content. This 'digital gold rush' involves collecting massive datasets to improve the capabilities of artificial intelligence technologies, often without considering the ethical implications of such widespread data extraction.

LIVE20:27pxpipe hides text in PNGs to cut Claude token costs by up to 70%