May 18, 2023

Why Harry Potter is the copyright timebomb under generative AI models

Court cases roll in and regulators are beginning to act

Earlier this month, the EU greenlit proposals for new copyright rules for generative AI models — a world first, if the bill is passed into law. 

The legislation would force​​ companies making AI models like GPT-4 to disclose any copyrighted material used to develop their products. 

AI companies developing large language models (LLMs) — which are fed massive datasets to train them to mimic human cognition — are already dealing with copyright complaints. But so far, they've been lawsuits brought by data-holding parties over copyrighted images and code, like stock photo company Getty Images’ case against Stability AI. And, with no cases reaching a verdict yet, it's an untested legal battleground.


The EU legislation would potentially put all LLM companies in the line of fire, because many of their models are trained on copyrighted writing. And the proposed law would impact companies globally, as anyone offering products and services in any EU country would have to comply. Cribbing off Harry Potter might be GenAI’s next copyright timebomb.

The Pile

Though GenAI models are often described as “black boxes”, their ingredients will be surprisingly familiar to many. 

British startup StabilityAI says its new StableLM is trained on “a new experimental dataset built on The Pile”. The Pile is an open-source dataset containing the text of more than 190k pirated books, including JK Rowling’s Harry Potter and George RR Martin’s Game of Thrones. 

Companies such as Aleph Alpha and Microsoft-backed OpenAI, meanwhile, have relied on a dataset called Common Crawl, which also contains copyrighted material. Even if it's impossible to pinpoint copyrighted work in what ChatGPT spits out, writers and publishers are furious about the free use of their material in training LLMs.

“Where it starts is always at the low end of the market. With music it starts with AI-generated jingles and background music. Where it starts with text is marketing copy,” says Peter Schoppert, director of the National University of Singapore Press, who also runs a Substack on AI and copyright issues, who says that LLMs are “narrowing the market” for writers (ie. affecting their ability to work).

In the US, the National Writers Union testified to the country’s Copyright Office in April, saying that its “members have created works which have been scraped from the internet, copied and used for training generative AI without permission or payment and without respect for our moral rights”. Earlier this month, textbooks giant Pearson announced it had sent a cease and desist letter to an unnamed AI company over the use of its IP.

For the courts to decide

While no major lawsuits have been filed over copyrighted writing, the half a dozen or so ongoing copyright cases against GenAI companies suggest that the odds aren't in their favour. 

In addition to the Getty case, Stability AI is in the docks with competitors Midjourney and DeviantArt in a separate class action case. Microsoft, Github and OpenAI, too, are facing a court case for copyright infringement of open source code. 

Getty Images' lawsuit against Stability AI — in which it claims the company copied more than 12m copyrighted images — could be crucial in laying the foundations for future cases, says Cerys Wyn Davies, intellectual property law partner at law firm Pinsent Masons.

“The Getty Images case is testing the issue of how copyright works around training AI, and it’s fundamental to that position,” she says. One lawyer who preferred not to be named tells Sifted that the case could take 18 months to conclude.


While the Getty case could be a way off being settled, the stock photo company will probably win, says Lindsay Gledhill, head of intellectual property at law firm Harper James. “It’s likely a judge would say that damages would be adequate remedy for using Getty images,” she tells Sifted.

Simon Portman, commercial intellectual property lawyer at Marks & Clerk law firm in London, tells Sifted that writers have similar grounds for legal complaint as visual artists. “I think there's going to be a challenge in the same way [visual] artists are complaining about AI generation,” he says.

‘Crisis in copyright’

This isn’t the first time a new technology has raised questions around copyright, Gledhill points out.

“With every tech innovation, we have a crisis in copyright,” she says. “At first there’s a Wild West period, where anything goes. It was the same when song streaming started 20 years ago.”

While music streaming service Napster was closed down in 2001, following a landmark lawsuit with a number of record companies, it didn’t stop Spotify and Apple Music becoming the dominant music platforms a little further down the line. 

Multiple sources tell Sifted that they believe the future of regulation for generative AI will be some kind of licensing system, where companies like OpenAI and Stability would have to pay copyright holders a fee for using them in training data.

“We will end up with a licensing model,” says Gledhill, who adds that such an outcome could still end up being bad for those who make and distribute copyrighted work, who’ll only see a fractional share of the value created from GenAI. “Content providers will be lower down the food chain.”

European laws in practice 

If the proposed EU law were to be adopted, it’s likely that the law would be enforced by representative bodies in each EU member state in a similar way to GDPR privacy laws, says Matt Hervey, head of AI law at law firm Gowling WLG. 

But while GDPR laws became the de facto standard around the world, the issue of rights ownership and AI is far more nuanced, Hervey tells Sifted. “There could be clashes between different pieces of regulation which pull people and companies in different directions, and companies could have to adapt their AI models by jurisdiction.” 

AI regulation is still in its infancy, but there are already signs that lawmakers around the world are taking different stances. The UK and the US have indicated that they’re opting for a more hands-off approaches than the EU. 

The European Parliament will vote on the legislation in June. The proposal will then enter the last stage of the legislative process, with the EU Council and Commission negotiating the key details of the bill.

Kai Nicol-Schwarz

Kai Nicol-Schwarz is a reporter at Sifted. He covers UK tech and healthtech, and can be found on X and LinkedIn

Tim Smith

Tim Smith is news editor at Sifted. He covers deeptech and AI, and produces Startup Europe — The Sifted Podcast . Follow him on X and LinkedIn