Article

Protect your online content from web scraping by AI providers

Lesley Broos

Lesley Broos Lawyer (Partner)

There are currently lawsuits pending worldwide against providers and developers of (mostly general purpose) AI tools that have trained their systems with large amounts of data that are subject to copyright or database rights of third parties.

Pending procedures

For example, in December 2023, The New York Times initiated proceedings against OpenAI and its partner Microsoft because OpenAI allegedly used millions of news articles from The New York Times without permission to train its AI system. A similar procedure is also pending against the competing AI tools of Google / Alphabet (Bard, Imagen, MusicLM, Duet AI & Gemini).

How to prevent your online content from being used – against your will – for AI training purposes by third parties? And is web scraping allowed under European legislation?

European legislation

The new European AI Act emphasizes (recital 108) that the AI ​​Regulation does not affect the enforcement of copyright rules as provided for under Union law. On this basis, one might think that copyright-protected works or databases – even if published online – are therefore also protected against reproduction by AI developers who “scrape” content from the internet, as long as you as the rightholder have not given permission (“granted a licence”) to copy those works or databases as training material for AI tools. However, this is a misconception; in 2019, an important exception to this old IP law principle was introduced in European legislation on copyright and related rights in the Digital Single Market, namely that (in short) text and data mining of protected material made publicly available online is permitted for commercial purposes, unless the rightholder has expressly reserved such rights in an appropriate manner. Machine-readable means (for example by including rules in a robots.txt file that scraping tools can understand) are considered “appropriate” in this regard. If you do not make such a reservation or do not make it in an appropriate manner, you run the risk of no longer being able to successfully take action against third parties who have lawful access to your online content and make reproductions of your content for text and data mining purposes.

Differences among Member States

The relevant European directive has been implemented in the national legal systems of the EU Member States and is therefore not yet fully harmonized. The law offices affiliated with Ecovis can perfectly help you determine how you can effectively protect your online content IP in the various European Member States, and of course also how AI developers from within or outside Europe can lawfully train their AI systems using third-party sources.

Scraping personal data as well?

While training AI-systems, not only the IP perspective is relevant here. Personal data protection law restrictions need to be taken into account as well. If the online content in question also contains personal data, web scraping is often problematic from that perspective as well. It is not without reason that the Dutch Data Protection Authority wrote earlier this year that scraping content containing personal data is ‘almost always illegal’.

More information

For more information, please contact IP lawyer Lesley Broos from Kienhuis Legal NV.

Do you have any questions?
Please contact us