Copyrights and LLMs

Large Language Models (LLMs) like OpenAI’s GPT, Google’s Gemini, Meta’s LLaMA, and Deepseek’s R1 have transformed AI by creating human-like text, code, and creative content. However, their quick progress has raised serious concerns about copyright infringement. Many of these models are trained on large amounts of copyrighted material—such as books, articles, research papers, and even proprietary media—without explicit permission from authors or publishers.

This ethical and legal dilemma forces us to ask: Should AI companies be allowed to use copyrighted data without compensation? And if they do, shouldn’t they be required to either:

  1. Open-source their models (to ensure transparency and public benefit), or

  2. Pay for the rights to the copyrighted material they use?

The Problem: LLMs Are Built on Copyrighted Works

Most leading LLMs are trained on datasets scraped from the internet, including:

  • Books (fiction, non-fiction, academic)

  • News articles and journalistic content

  • Research papers and technical documentation

  • Proprietary code from platforms like GitHub

Many of these sources are protected under copyright law. However, AI companies claim their use qualifies as "fair use"—a legal rule that permits limited use of copyrighted material for things like research, education, or commentary. This argument becomes less convincing when AI-generated content competes with the original works it was trained on, such as AI-written books replacing those by human authors.

Why Should AI Companies Open-Source Their Models?

If AI firms refuse to pay for copyrighted training data, they should at least open-source their models to:

  • Ensure Transparency: Users and regulators can audit the training data and model behavior.

  • Prevent Monopolization: Closed-source LLMs give tech giants an unfair advantage, stifling competition.

  • Enable Public Benefit: Open models allow researchers, startups, and nonprofits to innovate without corporate restrictions.

Meta’s LLaMA and DeepSeek’s models are steps in this direction, but many leading AI systems remain proprietary.

If Not Open-Source, AI Firms Must Pay for Rights

If companies insist on keeping their models closed, they should negotiate licenses with copyright holders. Some possible approaches:

  • Direct Licensing Deals (e.g., OpenAI partnering with publishers like Axel Springer)

  • Royalty Systems (compensating authors per AI-generated output)

  • Opt-out mechanisms (letting creators exclude their work from training datasets)

The New York Times lawsuit against OpenAI and the Indian Media v/s OpenAI Case raises a question:

if AI models reproduce paywalled content verbatim, should they be liable for copyright violations?

The current practice of scraping copyrighted works without permission is unsustainable. AI companies must choose:

  1. Open-source their models to democratize AI and avoid legal risks, or

  2. Pay for licensed data, ensuring creators are fairly compensated.

Without reform, the AI industry risks legal battles, public backlash, and an erosion of trust. The future of AI should be built on ethical data use, not the unchecked exploitation of copyrighted material.

10
Subscribe to my newsletter

Read articles from Siddhesh Agarwal directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Siddhesh Agarwal
Siddhesh Agarwal

I am a CSE Student from Coimbatore, India.