Rivers of ideas—
each drop traced back to its source,
flowing fair payment.
With every article and podcast episode, we provide comprehensive study materials: References, Executive Summary, Briefing Document, Quiz, Essay Questions, Glossary, Timeline, Cast, FAQ, Table of Contents, Index, Polls, 3k Image, Fact Check and
Spoken Word at the very bottom of the page.
Beyond copyright battles lies a revolutionary economic model that could transform how we value human creativity in the age of AI
While lawyers argue about fair use and tech executives negotiate licensing deals, a more radical possibility lurks beneath the surface of the AI copyright wars: What if we're thinking about this all wrong? What if instead of fighting over who owns what, we created a system that automatically recognizes, acknowledges, and compensates every contribution to our collective intelligence?
The current copyright framework treats creative works like discrete property—you either own a thing or you don't, someone either infringed or they didn't. But human creativity doesn't work that way. Every idea builds on countless others. Every breakthrough emerges from a vast network of influences, conversations, and incremental insights. Our legal system's binary thinking is fundamentally mismatched to how knowledge actually develops.
What we need isn't better copyright enforcement. We need a liquid economy.
Recognition: Making the Invisible Visible
The first principle of a liquid economy is recognition—not in the legal sense of copyright attribution, but in the deeper sense of acknowledging the true complexity of creative influence. Current AI systems are essentially black boxes that consume human creativity and output new content with no visibility into what influenced what. This opacity isn't just a technical limitation; it's a design choice that obscures the human labor underlying AI capabilities.
Imagine instead an AI system that could trace the genealogy of every output back to its sources. Not just the specific texts or images it was trained on, but the conceptual influences, the stylistic elements, the structural patterns that shaped its response. Think of it as a kind of intellectual DNA sequencing that could identify the specific human contributions embedded in any AI-generated content.
This isn't science fiction. Machine learning researchers are already developing techniques for model interpretability and influence tracing. The technology exists to build systems that can identify which training examples most influenced a particular output. What's missing is the economic incentive to do so—and the social framework to act on that information.
A liquid model would make these influences visible and actionable. Every time an AI system generates content, it would simultaneously generate a contribution map showing the human sources that made that generation possible. This transparency would transform AI from a black box that appropriates human creativity into a lens that reveals the intricate web of human collaboration underlying all knowledge.
Acknowledgment: From Attribution to Appreciation
Recognition without acknowledgment is just surveillance. The second principle of a liquid economy involves creating systems that actively celebrate and credit the diverse contributions that make AI possible. This goes far beyond traditional copyright attribution to encompass the full spectrum of human input that shapes AI capabilities.
Current AI systems are trained on more than just copyrighted texts and images. They learn from comments, forum posts, Q&A exchanges, code snippets, product reviews, social media interactions—countless forms of unpaid digital labor that collectively constitute our knowledge commons. A liquid economy would acknowledge all of these contributions, not just the ones that happen to qualify for copyright protection.
This acknowledgment could take many forms. At its simplest, it might involve transparency reports showing the broad categories of human input that enabled a particular AI capability. More ambitiously, it could involve real-time attribution systems that show users the specific human contributions most relevant to each AI interaction.
But acknowledgment in a liquid economy goes beyond just crediting sources. It involves recognizing that human creativity exists in an ecosystem, not in isolation. When AI systems learn from your work, they're not just copying your individual expression—they're participating in the ongoing conversation of human culture. Acknowledgment means honoring that participation and making it visible.
Payment on Use: Economics That Actually Work
The third principle—payment on use—is where things get truly radical. Instead of the current system where AI companies either pay nothing (claiming fair use) or negotiate bulk licensing deals with major rights holders, a liquid economy would create micropayment streams that flow directly to contributors based on actual usage.
Every time an AI system draws on human creativity to generate content, small payments would flow back to the creators who made that generation possible. Not just to the copyright holders, but to everyone whose contributions influenced the output—the forum poster whose explanation clarified a concept, the blogger whose perspective shaped the AI's understanding, the coder whose open-source contribution enabled the functionality.
This isn't about creating a surveillance economy or turning every human interaction into a transaction. It's about recognizing that we already live in an economy where human creativity generates massive value—we're just terrible at distributing that value fairly. Tech companies extract billions from systems trained on human creativity while most creators see nothing. A liquid economy would reverse that flow.
The payments wouldn't need to be large for individual uses. But across millions or billions of AI interactions, these micropayments would aggregate into meaningful compensation. And because the payments would be tied to actual usage rather than upfront licensing fees, they would automatically adjust to reflect the real-world value of different contributions.
Non-Dilution: Preserving Value Through Distribution
The fourth principle addresses a crucial concern about any micropayment system: that splitting value among many contributors would make individual payments so small as to be meaningless. Non-dilution means that recognizing more contributors doesn't decrease the total value being distributed—it increases it.
This principle recognizes that creativity is not a zero-sum game. When we acknowledge the full network of influences behind any creative work, we're not dividing a fixed pie—we're growing the pie. More recognition leads to more participation, more participation leads to richer knowledge commons, and richer knowledge commons create more value for everyone.
In practical terms, non-dilution might work through progressive payment structures where the most directly influential contributors receive larger shares, but the total payment pool grows as the network of acknowledged contributors expands. Or it might involve tiered systems where different types of contributions receive different payment scales while ensuring that no form of legitimate contribution goes completely uncompensated.
The key insight is that current AI systems already demonstrate the compound value of aggregated human creativity. They wouldn't be possible without the vast network of human knowledge they're trained on. A liquid economy would simply create economic structures that reflect this reality rather than obscuring it.
The Technical Foundation: Liquid Language Models
The poetic vision shared alongside the copyright material hints at something profound: AI systems that are "self-reflexive and aware," capable of understanding their own processes and acknowledging their sources. These "liquid language models" would be fundamentally different from current AI systems—not just in their capabilities, but in their relationship to human creativity.
Instead of being black boxes that consume human input and produce opaque output, liquid models would be transparent about their processes, generous in their attributions, and designed from the ground up to participate in rather than exploit human creative ecosystems. They would embody the recognition, acknowledgment, and compensation principles directly in their architecture.
This technical foundation would enable new forms of human-AI collaboration that go beyond the current paradigm of humans creating content for AI to consume. Instead, we could have genuine partnerships where AI systems help humans understand and build upon the vast networks of knowledge they're part of, while ensuring that all participants benefit from the value created.
Beyond the Copyright Wars
The genius of a liquid economy is that it sidesteps the copyright battles entirely. Instead of arguing about fair use or negotiating licensing deals, we create systems that automatically recognize and compensate the full spectrum of human contributions to AI capabilities. This approach dissolves the artificial scarcity that copyright creates while ensuring that creators are fairly compensated for their contributions.
It also addresses the competition concerns that make traditional licensing problematic. Because liquid systems would be based on usage rather than upfront licensing fees, they wouldn't create barriers to entry for smaller AI developers. Everyone could participate in the ecosystem, with payments flowing based on actual usage rather than ability to pay licensing fees upfront.
Most importantly, a liquid economy would align incentives properly. Instead of AI companies having incentives to use as much human creativity as possible while paying as little as possible, they would have incentives to create systems that generate real value for users while fairly compensating the human contributors who make that value possible.
The Path Forward
We're still in the early stages of figuring out how AI and human creativity will coexist. The current copyright battles are just the opening skirmish in a much larger transformation of how we create, share, and value knowledge in an AI-enabled world.
A liquid economy offers a path beyond the current deadlock—a way to harness the incredible potential of AI systems while ensuring that the humans who make them possible share in the benefits. It's a vision of technology that enhances human creativity rather than replacing it, that makes our contributions visible rather than hiding them, and that creates abundance rather than artificial scarcity.
The technical pieces are coming together. Machine learning interpretability is advancing rapidly. Micropayment systems are becoming more feasible. Blockchain and other distributed technologies offer new possibilities for transparent, automated compensation systems. What we need now is the social and economic framework to put these pieces together.
The liquid economy isn't just about AI and copyright. It's about creating economic systems that reflect the true nature of human creativity and knowledge—interconnected, collaborative, and cumulative. It's about building technology that serves human flourishing rather than extracting value from it.
The question isn't whether we can build such systems. The question is whether we will choose to do so before the current extraction-based model becomes too entrenched to change. The copyright wars are just the beginning. The real battle is for the soul of human creativity in the age of artificial intelligence.
In this positive, innovation-generating realm, we don't just harness the power to rejuvenate and overwhelm—we create systems that honor the full spectrum of human contribution to our collective intelligence. With liquid models as our foundation, we can shape an economic future that's not just bold and bright, but genuinely equitable.
Link References
Copyright and Artificial Intelligence
Part 3: Generative AI Training pre-publication version
A REPORT of the Register of copyrights May 2025
US Copywrite Office
See Also:
Recognition, Acknowledgment, Payment On Use, Non-Dilution ( spoken word 2024 )
An important alternate approach that would solve a number of ethical, informatics, and technical problems going forward.
🧩 The Silent Revolution: When AI Learns to Teach Itself
May 15, 2025 • Season 4 • Episode 26
AI art and copyright
October 28, 2024 • Season 1 • Episode 46
Final Report – Governing AI for Humanity
September 30, 2024 • Season 1 • Episode 17
AI Copywrite ( Spoken Word 2024)
Episode Links
Youtube
Other Links to Heliox Podcast
YouTube
Substack
Podcast Providers
Spotify
Apple Podcasts
Patreon
FaceBook Group
STUDY MATERIALS
Briefing
BRIEFING DOCUMENT: U.S. COPYRIGHT OFFICE - COPYRIGHT AND ARTIFICIAL INTELLIGENCE, PART 3: GENERATIVE AI TRAINING (PRE-PUBLICATION)
DATE: May 2025 (Pre-publication version)
SUBJECT: Review of key issues and analysis regarding copyright implications of generative AI training data and processes, based on the U.S. Copyright Office's report.
I. EXECUTIVE SUMMARY
This pre-publication report from the U.S. Copyright Office (USCO) addresses the complex copyright issues surrounding the training of generative Artificial Intelligence (AI) systems. The report delves into the technical aspects of AI training, particularly for large language models (LLMs), and analyzes how these processes interact with established copyright principles, including prima facie infringement and the doctrine of fair use. A significant portion of the report is dedicated to the feasibility and challenges of voluntary and statutory licensing approaches for AI training data. While the report is a pre-publication version, the USCO indicates that no substantive changes are expected in the final version.
Key takeaways include:
The processes of data collection, curation, and the training of generative AI models through copying and processing copyrighted works likely constitute prima facie copyright infringement, specifically implicating the reproduction right and potentially the derivative work right.
The fair use doctrine is the primary legal defense for such activities, but its application is complex and fact-dependent, particularly regarding the transformative nature of the use, commerciality, the amount of copyrighted work used, and the effect on the potential market for copyrighted works.
The report discusses the concept of "memorization" in AI models and how it may relate to copyright infringement.
Voluntary licensing is seen by some stakeholders as feasible and already occurring, while others highlight the challenges of scale and identifying rightsholders.
There is limited support among commenters for statutory licensing approaches, such as compulsory or extended collective licensing, due to concerns about market intervention and unintended consequences.
International approaches to text and data mining (TDM) exceptions, including opt-out mechanisms, are noted and discussed.
II. TECHNICAL ASPECTS OF GENERATIVE AI TRAINING
The report outlines the fundamental processes involved in training generative AI systems:
Training Phases: Generative AI models, especially LLMs, undergo training to learn patterns and relationships within vast datasets. This typically involves "generative pre-training," where the model predicts subsequent tokens (numerical representations of words or parts of words) based on preceding ones in the training data.
"During generative pre-training, text examples serve as both the input and expected output, with performance measured by how well the model predicts each next token (output) based on preceding tokens (input)."
This iterative process adjusts the model's "weights" (parameters) to increase the likelihood of correct predictions based on the training data.
Pre-training can involve massive datasets of text and other media, potentially comprising billions or trillions of tokens.
Memorization: AI models can "memorize" portions of their training data, including protectable expression from copyrighted works. This memorization can be a concern from a copyright perspective.
"Memorization...implicates the right of reproduction for the memorized examples."
Model weights that have memorized protectable expression "may also infringe the derivative work right." The key factor is whether the model has "retained or memorized substantial protectable expression from the work(s) at issue."
Retrieval-Augmented Generation (RAG): The report also briefly discusses RAG systems, which involve accessing external databases of information, often containing copyrighted material, to inform AI outputs.
"RAG also involves the reproduction of copyrighted works."
III. PRIMA FACIE INFRINGEMENT
The report asserts that several stages of the AI training process likely constitute prima facie copyright infringement:
Data Collection and Curation: The initial step of downloading or copying data from various sources to create a training dataset involves making reproductions, which is an exclusive right of the copyright holder.
"In many cases, the first step is downloading data from publicly available locations...but whatever the source, copies are made—often repeatedly."
Training: The act of copying and processing the training dataset during the training process itself also implicates the reproduction right.
RAG: As mentioned above, RAG systems involve the reproduction of copyrighted works by copying material into or retrieving material from a database.
Outputs: While infringement issues related to AI outputs will be addressed in a later report, the report notes that memorized training data within model weights that can be reproduced or perceived can also implicate the reproduction and potentially the derivative work rights.
IV. FAIR USE ANALYSIS
The report thoroughly examines the application of the four fair use factors to AI training, highlighting the complexities and divergent views:
Factor One: Purpose and Character of the Use: This factor considers whether the use is transformative and commercial.
Transformativeness: Whether training an AI model is transformative is a central debate. Some argue that processing copyrighted works to train a model that generates new content is highly transformative. Others contend that if the AI outputs serve the same purpose as the original works (e.g., generating text for reading), the use may not be sufficiently transformative. The report emphasizes that "Even significant alterations will not be enough if the use ultimately serves a purpose similar to that of the original."
The report notes that "Because generative AI models may simultaneously serve transformative and non-transformative purposes, restrictions on their outputs can shape the assessment of the purpose and character of the use." Guardrails and content filters can be used to prevent the generation of infringing material.
Commerciality: The commercial nature of AI development and deployment is a significant consideration. The report notes that "The creation and distribution of a training dataset, the copying of that dataset for training, and the copying and distribution of model weights for use in a system may be conducted by different entities, each of whose activities may or may not be considered ‘commercial.’" Direct monetization of datasets or models through licensing or subscriptions is considered commercial.
Factor Two: Nature of the Copyrighted Work: This factor considers the nature of the original work, particularly whether it is factual or creative, published or unpublished. Creative and unpublished works generally receive stronger protection.
Factor Three: Amount and Substantiality of the Portion Used: This factor examines how much of the copyrighted work was used. Training often involves copying entire works or substantial portions.
"Training often involves copying the entire work or at least a substantial portion, and the amount used is often more than is necessary for the purpose of the use."
The report contrasts this with cases where copying was deemed fair use because only the minimum necessary amount was used for a transformative purpose (e.g., creating snippets for search).
Factor Four: Effect Upon the Potential Market: This factor assesses the impact of the use on the market for the original work and its derivatives. This is a critical and contested area in the context of AI training.
Concerns are raised about "market dilution, and lost licensing opportunities."
Some commenters argue that AI-generated outputs can directly substitute for copyrighted works, leading to lost sales and licensing revenue.
The report acknowledges that the "market for training data itself" is a relevant potential market to consider. There is evidence of a developing market for licensing data for AI training.
Weighing the Factors: The report reiterates that the fair use factors are not applied mechanically and require a holistic assessment.
Public Benefits: The report addresses claims that the public benefits of unlicensed training (e.g., fostering innovation) might tip the fair use balance. However, the report notes that "the more the challenged use affects the author’s revenue streams...the weaker the claim of public benefit."
V. LICENSING FOR AI TRAINING
The report explores different approaches to licensing copyrighted works for AI training:
Voluntary Licensing: This involves direct negotiations between rightsholders and AI developers or collective licensing arrangements.
Feasibility of Voluntary Licensing: Some stakeholders assert that voluntary licensing is feasible and already occurring, citing examples of licenses for music, images, and text data.
"Many AI models are already obtaining licenses this way and it has been the norm across many other examples within the music distribution ecosystem."
"The market is not merely a ‘potential’ or theoretical market the existence or feasibility of which is open to debate; it is an actual market, with great potential for growth. Music companies are currently licensing works for use in training AI models."
"Yes, direct voluntary licensing is feasible and is certainly the case for the publishing industry."
Challenges include identifying and negotiating with a vast number of rightsholders, particularly for diverse datasets.
Ability to Provide Meaningful Compensation: The report notes that voluntary licensing allows for direct compensation to rightsholders.
Possible Legal Impediments to Collective Licensing: Antitrust concerns regarding collective licensing are mentioned.
Statutory Approaches: The report discusses compulsory licensing and extended collective licensing (ECL) as potential alternatives to voluntary licensing.
Compulsory Licensing: This would require AI developers to pay a set royalty for using copyrighted works for training, without needing individual permission. The report notes "little support among commenters for statutory approaches." Compulsory licenses are generally viewed as exceptions to exclusive rights and should only be used to address clear market failures.
Extended Collective Licensing (ECL): Under ECL, a designated collective management organization (CMO) could license works from both its members and non-members, with an opt-out option for rightsholders. There was also "little support among commenters" for ECL, with concerns raised about imposing an opt-out regime on a system of exclusive rights.
International Approaches: The report highlights TDM exceptions in other jurisdictions, such as the EU and Japan, which often include lawful access requirements and potential opt-out mechanisms for rightsholders.
VI. CONCLUSION (from the structure, not explicitly stated in excerpts)
While not present in the provided excerpts, the structure of the report suggests that the USCO is analyzing the current legal framework, the technical realities of AI training, and the perspectives of stakeholders to inform potential future recommendations regarding copyright policy in the age of generative AI. The emphasis on voluntary licensing and the cautious approach to statutory solutions indicate a preference for market-based solutions where feasible, while acknowledging the challenges and potential need for clarification or adaptation of copyright law. The issue of fair use, particularly regarding transformativeness and market effect, remains central to the legal analysis of AI training.
Key Elements
I. Prima Facie Infringement in AI Training
Understanding the actions involved in creating and deploying generative AI systems that potentially infringe on copyright.
Examining Data Collection and Curation: How the process of gathering and preparing data for training implicates copyright.
Training Phase: Analyzing how the act of training a model using copyrighted data can constitute reproduction.
Retrieval-Augmented Generation (RAG): Identifying how accessing and incorporating external data during output generation involves reproduction.
Outputs: Considering when the output generated by an AI system might infringe copyright, particularly in cases of memorization or substantial similarity to training data.
II. Fair Use in the Context of AI Training
Introduction to the Four Factors of Fair Use (Purpose and Character of Use, Nature of Copyrighted Work, Amount and Substantiality of Portion Used, Market Effect).
Factor One: Purpose and Character of the Use:Identifying the specific "use" being evaluated (data collection, training, deployment, output generation).
Transformativeness: Whether the AI's use adds new expression, meaning, or purpose distinct from the original work. The importance of considering the ultimate use of the copies made during training.
Commerciality: How the commercial nature of AI development and deployment impacts the fair use analysis. Distinguishing between direct monetization and incidental benefits.
Unlawful Access: The potential relevance of whether training data was obtained through unlawful means.
Factor Two: Nature of the Copyrighted Work: How the type of work (e.g., factual vs. creative, published vs. unpublished) influences the fair use analysis.
Factor Three: Amount and Substantiality of the Portion Used:Assessing the quantity of copyrighted material used in training datasets.
Evaluating the reasonableness of using entire works or substantial portions in light of the training purpose.
Considering the amount of copyrighted material made available to the public through AI outputs.
Factor Four: Effect of the Use Upon the Potential Market:Analyzing potential harms to the market for the original copyrighted works, including lost sales, market dilution, and lost licensing opportunities.
Considering the possibility of AI outputs serving as substitutes for the original works.
Examining whether a potential licensing market for AI training data exists or is developing.
Weighing the Factors: Understanding that fair use is a balancing test and no single factor is determinative.
Competition Among Developers: Briefly considering how fair use might impact competition in the AI development landscape.
III. Licensing for AI Training
Voluntary Licensing:Feasibility of Voluntary Licensing: Exploring whether direct and collective licensing mechanisms are practical for AI training data.
Ability to Provide Meaningful Compensation: Discussing whether voluntary licensing can effectively compensate copyright owners.
Possible Legal Impediments to Collective Licensing: Considering potential antitrust issues related to collective licensing organizations.
Statutory Approaches:Compulsory Licensing: Understanding how a compulsory license would function in the context of AI training and the arguments against it.
Extended Collective Licensing (ECL): Examining the concept of ECL and its potential applicability, as well as criticisms of opt-out provisions.
IV. International Approaches
Brief overview of how other jurisdictions (EU, Singapore, Japan, Israel, Brazil) are addressing copyright and AI training, including text and data mining exceptions and opt-out mechanisms.
Quiz & Answer Key
Quiz
Describe how the data collection and curation phase of AI training can implicate copyright infringement.
Explain why the training phase of generative AI models is considered a potential area of copyright infringement.
What is Retrieval-Augmented Generation (RAG) and how does it involve the reproduction of copyrighted works?
According to the report, what is a key question in determining if a model's weights infringe reproduction or derivative work rights?
Briefly explain the concept of "transformativeness" in the context of fair use and AI training.
How does the commercial nature of AI development and deployment impact the fair use analysis?
What aspect of copyrighted works is considered under the second fair use factor?
Under the third fair use factor, what is examined regarding the use of copyrighted material in AI training?
Identify one type of harm to the potential market considered under the fourth fair use factor in the context of AI.
What is voluntary licensing, and is it considered feasible for AI training data according to some commenters?
Answer Key
The data collection and curation phase involves making copies of works to create datasets for training AI models. This copying, regardless of the source, implicates the right of reproduction held by copyright owners.
The training phase is considered potentially infringing because it involves reproducing copyrighted works as inputs to the model. The model's parameters (weights) are adjusted based on this data, which can lead to memorization of protectable expression from the training examples, implicating the reproduction right.
RAG is a system where a generative AI can access a database of external material to inform its responses. This process involves reproducing (copying) material from the database and supplying it to the model along with the user's prompt.
The key question is whether the model has retained or memorized substantial protectable expression from the works used in training.
Transformativeness in AI training refers to whether the AI's use of copyrighted material adds new expression, meaning, or purpose that is distinct from the original work. Courts also consider the ultimate use made possible by the initial copying.
The commercial nature of AI development makes it more likely that a use will not be considered fair. Direct monetization of datasets or models is a clear indicator of commerciality, but indirect benefits can also be considered.
The second fair use factor considers the nature of the copyrighted work, such as whether it is factual or creative, and whether it has been published.
Under the third fair use factor, the amount and substantiality of the copyrighted portion used in relation to the whole work is examined. This includes assessing the quantity of data used in training and whether the use of entire works was reasonably necessary.
One type of harm is market substitution, where the AI outputs could replace the need for consumers to purchase or license the original copyrighted works.
Voluntary licensing is when AI developers seek permission directly from copyright owners or their representatives to use their works for training. Some commenters consider this feasible and note that such licensing is already occurring in some industries.
Essay Questions
Discuss the tension between promoting innovation in generative AI and protecting the rights of copyright holders, as explored in the provided text. Analyze how the concepts of prima facie infringement and fair use are applied to navigate this balance during different phases of AI development and deployment.
Analyze the concept of "transformativeness" as it applies to the training and output of generative AI models. Based on the text, explain the challenges courts face in applying this factor and how restrictions on AI outputs might influence the assessment of transformativeness.
Evaluate the arguments for and against voluntary licensing as a solution for obtaining training data for generative AI. Consider the feasibility, compensation mechanisms, and potential legal challenges discussed in the text.
Compare and contrast the statutory licensing approaches (compulsory licensing and extended collective licensing) discussed in the text as potential models for AI training data access. What are the primary criticisms and concerns associated with each approach?
Examine how international approaches to text and data mining exceptions and copyright relate to the discussions within the U.S. Copyright Office report. Discuss specific examples mentioned in the text and their potential implications for future U.S. copyright policy regarding AI training.
Glossery of Key Terms
Generative AI: Artificial intelligence systems capable of creating new content (text, images, music, etc.) based on patterns learned from training data.
Training Data: The collection of works (text, images, audio, etc.) used to train an artificial intelligence model.
Training Phases: The iterative process during which an AI model learns from training data, adjusting its internal parameters (weights) to improve performance.
Memorization: The phenomenon where a generative AI model retains and can reproduce specific, often identifiable, portions of its training data in its outputs.
Deployment: The process of making a trained AI model available for use by others, often through an application or service.
Prima Facie Infringement: A legal term indicating that sufficient evidence exists to establish a violation of copyright unless a valid defense (like fair use) can be proven.
Data Collection and Curation: The process of gathering, selecting, cleaning, and organizing data to create a training dataset.
Retrieval-Augmented Generation (RAG): A technique where a generative AI system retrieves relevant information from an external database to inform its output, often used to provide more accurate or up-to-date responses.
Outputs: The content generated by an artificial intelligence system in response to a user's prompt or query.
Fair Use: A legal doctrine that permits limited use of copyrighted material without obtaining permission from the copyright holder for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.
Purpose and Character of the Use: The first factor of fair use, which considers the reason for using the copyrighted work and whether the use is transformative or serves a commercial purpose.
Transformativeness: The extent to which a secondary use of copyrighted material adds new expression, meaning, or purpose, rather than merely superseding the original.
Commerciality: Whether the use of copyrighted material is for profit or is otherwise commercially driven.
Nature of the Copyrighted Work: The second factor of fair use, which considers characteristics of the work such as whether it is factual or creative, published or unpublished.
Amount and Substantiality of the Portion Used: The third factor of fair use, which assesses the quantity and importance of the copyrighted material used in relation to the work as a whole.
Effect of the Use Upon the Potential Market: The fourth factor of fair use, which examines the potential harm of the secondary use to the market for or value of the original copyrighted work.
Lost Licensing Opportunities: The potential for the secondary use to undermine the ability of copyright owners to license their works for similar purposes.
Voluntary Licensing: A system where AI developers negotiate and obtain licenses directly from copyright owners or their representatives for the use of copyrighted material.
Collective Licensing: A system where a single entity (a collective management organization or CMO) represents multiple copyright owners and grants licenses for their works, often for specific uses.
Compulsory Licensing: A government-mandated licensing scheme that allows specific uses of copyrighted works without the copyright owner's explicit consent, subject to statutory terms and royalty payments.
Extended Collective Licensing (ECL): A form of collective licensing where a license granted by a collective management organization is extended to cover all rightsholders in a particular category, including those who are not members of the organization, often with an opt-out mechanism.
Text and Data Mining (TDM): The automated analysis of text and data to extract information, patterns, and trends. Some jurisdictions have specific copyright exceptions for TDM.
Weights (Model Weights): The numerical parameters within an artificial intelligence model that are adjusted during training and encode the model's learned knowledge and patterns.
Timeline of Main Events
2003: Yoshua Bengio et al. publish "A Neural Probabilistic Language Model," which contributes to the mathematical modeling of language used in AI systems.
June 10, 2016: Sennrich et al. publish "Neural Machine Translation of Rare Words with Subword Units," suggesting the use of subword tokens for better accommodation of rare words in language models.
February 21, 2018: Reid Pryzant et al. publish the "JESC: Japanese-English Subtitle Corpus," an example of a dataset used for language modeling, often incorporating timestamps.
2019: Alec Radford et al. publish "Language Models are Unsupervised Multitask Learners," highlighting the emerging capabilities of generative models.
January 23, 2020: Jared Kaplan et al. publish "Scaling Laws for Neural Language Models," discussing the impact of scaling on these models.
October 5, 2020: Adam Roberts et al. publish "How Much Knowledge Can You Pack Into the Parameters of a Language Model?", exploring how knowledge is implicitly stored in model parameters (weights).
April 12, 2021: Patrick Lewis et al. publish "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," introducing the RAG technique and detailing the process of using a downloaded copy of Wikipedia split into chunks.
December 18, 2022: The Ministry of Justice, State of Israel, issues an opinion on the uses of copyrighted materials for machine learning, suggesting that such uses are typically transformative.
June 22, 2022: Jiahui Yu et al. publish "Scaling Autoregressive Models for Content-Rich Text-to-Image Generation," contributing to the development of text-to-image models.
July 22, 2022: Tom B. Brown et al. publish "Language Models are Few-Shot Learners," demonstrating the ability of language models to learn with limited examples.
January 26, 2023: Andrew Agostinelli et al. publish "MusicLM: Generating Music from Text," detailing a model capable of generating music from text prompts.
September 8, 2023: Matthew Finnegan reports that Microsoft pledges to defend Copilot customers against copyright lawsuits, stating the implementation of content filters.
November 27, 2023: Zeming Chen et. al. publish "MediTron-70B: Scaling Medical Pretraining for Large Language Models," presenting generative LLMs adapted for medical reasoning.
January 2, 2025: Winston Cho reports that music publishers reach a deal with Anthropic over copyrighted song lyrics, including agreements on guardrails in training.
January 6, 2025: NVIDIA announces Nemotron-4, a family of models designed to advance agentic AI.
January 15, 2025: A Framework for Artificial Intelligence Diffusion is published in the Federal Register as an interim final rule, adopting export controls on AI model weights.
January 22, 2025: Ashley King reports on AI Music, creator of Moises, raising $40 million in Series A funding, with a stated mission for ethical AI development trained on licensed content.
February 20, 2024: 273 Ventures issues a press release introducing KL3M, a legal large language model trained on the Kelvin Legal DataPack, a commercially available dataset with clear provenance.
March 7, 2024: Pierre Colombo et al. publish "SaulLM-7B: A pioneering Large Language Model for Law," detailing a model tailored for the legal domain, built upon continued pre-training of an existing model.
April 1, 2024: 3Blue1Brown releases a visual explanation of transformers (how LLMs work) on YouTube.
June 4, 2024: Brody Ford reports for Yahoo! Finance on Shutterstock's AI-licensing business generating $104 million, driven by demand for legally obtained training data.
August 6, 2024: SOUNDRAW publishes "Ethical AI in Music: Navigating Copyright Concerns," explaining their practice of training on music produced in-house.
October 21, 2024: Dow Jones & Co. files a complaint against Perplexity AI, Inc. (S.D.N.Y. Oct 21, 2024, ECF No. 1), alleging unauthorized copying of news publisher material for RAG.
November 7, 2024: Etan Vlessing reports that the Lionsgate CEO states an AI deal promises a "Transformational Impact" on the studio, suggesting the potential for licensing.
December 2024: Bill 2338/2023, which includes a text and data mining exception, is approved by the Brazilian Senate.
March 27, 2025: Jack Lindsey et al. publish "On the Biology of a Large Language Model," identifying internal model mechanisms associated with writing poetry.
April 9, 2025: Michael J Bommarito II, Julian Bommarito, and Daniel Martin Katz publish "The KL3M Data Project: Copyright-Clean Training Resources for Large Language Models," describing their proprietary dataset.
May 2025: The U.S. Copyright Office releases the pre-publication version of "Copyright and Artificial Intelligence, Part 3: Generative AI Training Report," discussing various aspects of copyright law as applied to AI training. This report is in response to congressional inquiries and stakeholder interest.
Cast of Characters
Yoshua Bengio: Co-author of a foundational paper on neural probabilistic language models.
Sennrich et al.: Researchers who proposed using subword units in neural machine translation.
Reid Pryzant et al.: Researchers who developed the JESC: Japanese-English Subtitle Corpus.
Alec Radford et al.: Researchers who published on unsupervised multitask learners, highlighting emerging AI capabilities.
Jared Kaplan et al.: Researchers who published on scaling laws for neural language models.
Adam Roberts et al.: Researchers who explored how knowledge is stored in language model parameters (weights).
Patrick Lewis et al.: Researchers who introduced the Retrieval-Augmented Generation (RAG) technique.
Jiahui Yu et al.: Researchers involved in scaling autoregressive models for text-to-image generation.
Tom B. Brown et al.: Researchers who published on few-shot learning in language models.
Andrew Agostinelli et al.: Researchers who developed MusicLM for generating music from text.
Zeming Chen et al.: Researchers who developed MediTron-70B, a medical large language model.
Murray Shanahan: Professor noted for explaining the statistical nature of large language models.
Lemony Snicket: Author of "The Wide Window," referenced as a source for a training example.
Pierre Leval: Judge known for describing the concept of transformativeness in fair use analysis.
Winston Cho: Reported on the deal between music publishers and Anthropic.
Matthew Finnegan: Reported on Microsoft's pledge to defend Copilot customers against copyright lawsuits.
Brody Ford: Reported for Yahoo! Finance on Shutterstock's AI-licensing business.
Etan Vlessing: Reported on the Lionsgate CEO's comments about an AI deal.
Jack Lindsey et al.: Researchers who studied internal model mechanisms related to writing poetry.
Michael J Bommarito II: Co-author of the paper on the KL3M Data Project, a copyright-clean training dataset.
Julian Bommarito: Co-author of the paper on the KL3M Data Project.
Daniel Martin Katz: Co-author of the paper on the KL3M Data Project.
Marybeth Peters: Former Register of Copyrights, cited for her view on the limited use of compulsory licenses.
Dow Jones & Co.: News publisher that filed a lawsuit against Perplexity AI alleging copyright infringement related to RAG.
Perplexity AI, Inc.: AI company sued by Dow Jones & Co.
Anthropic: AI company that reached an agreement with music publishers regarding copyrighted song lyrics.
Microsoft: Company that pledged to defend Copilot customers against copyright lawsuits and whose Phi-4 model is available under an MIT license.
Meta: Company developing the Llama series of language models, mentioned in the context of model architecture, weights, and licensing.
NVIDIA: Company developing the Nemotron-4 family of AI models.
Shutterstock: Company with an AI-licensing business for training data and a generative AI offering trained on licensed content.
SOUNDRAW: Company that trains its music AI models on music produced in-house to ensure copyright compliance.
Rightsify: Company that uses its own data to train its Hydra music AI model, ensuring legality.
AI Music: Creator of Moises, a company committed to developing ethical AI solutions trained on licensed content.
273 Ventures: Company that introduced the legal large language model KL3M.
Fairly Trained: Organization that certifies AI models trained on legally obtained content.
ASCAP: Collective management organization (CMO) for music, mentioned in discussions of market replacement effect and antitrust consent decrees.
UMG (Universal Music Group): Music company mentioned in relation to lawsuits against AI companies and its position on voluntary licensing.
NMPA (National Music Publishers' Association): Organization representing music publishers, commenting on licensing for AI training and criticizing opt-out provisions.
Getty Images: Company providing licensed visual data for AI training and offering a generative AI product trained on licensed content.
AAP (Association of American Publishers): Organization representing publishers, commenting on the feasibility of voluntary licensing.
STM (Scientific Technical Medical Publishers): Organization representing academic/scientific/medical publishers, commenting on the pervasiveness of licensing in their sector.
CCC (Copyright Clearance Center): Organization mentioned in the context of transactional and external licensing options.
ImageRights International: Company commenting on potential compulsory licensing regimes.
BigBear.ai: Company suggesting a compulsory license is "worthy of consideration."
ASCRL (American Society of Collective Rights Licensing): Organization commenting on compulsory licenses.
Graphic Artists Guild: Organization commenting on obtaining licenses for image AI training.
Authors Guild: Organization commenting on statutory licensing approaches.
BRIA.AI: Company detailing its AI accountability framework, emphasizing the use of commercially licensed data.
U.S. Copyright Office: The governmental body that authored the report, discussing various legal aspects of AI and copyright, including infringement, fair use, and licensing.
FAQ
What are the prima facie copyright infringement issues related to generative AI training?
Prima facie infringement in generative AI training primarily involves the reproduction right under copyright law. This occurs at multiple stages:
Data Collection and Curation: The act of downloading, copying, formatting, or creating subsets of copyrighted material to build a training dataset involves making reproductions of those works.
Training: The process of training a generative AI model involves repeatedly processing and adjusting model parameters based on the training data. This can implicate the reproduction right, especially if the model "memorizes" substantial portions of the training data.
Retrieval-Augmented Generation (RAG): Systems utilizing RAG typically involve copying material into a retrieval database or accessing external databases containing copyrighted works, which also constitutes reproduction.
Outputs: While the creation of outputs is a key feature, the focus for prima facie infringement lies in the copying that occurs during the training and retrieval processes, not necessarily the output itself unless it is substantially similar to a copyrighted work. However, if model weights have memorized protected expression from training data, distributing or using these weights could infringe the reproduction and potentially the derivative work rights.
How does the fair use doctrine apply to the use of copyrighted works in generative AI training?
The fair use doctrine is a legal defense that can excuse certain uses of copyrighted material without permission. In the context of generative AI training, assessing fair use involves applying the four statutory factors outlined in copyright law:
Purpose and Character of the Use: This factor considers whether the use is transformative (adding new expression, meaning, or purpose) and its commercial nature. While AI training models may have a transformative purpose in learning statistical relationships within data, the ultimate use of the model (e.g., generating outputs that substitute for original works) is also relevant. Commercial uses are weighed against fair use.
Nature of the Copyrighted Work: This factor considers the type of work used (e.g., factual vs. creative, published vs. unpublished). Using highly creative or unpublished works may weigh against fair use.
Amount and Substantiality of the Portion Used: This considers how much of the copyrighted work is used. Training often involves using entire works or substantial portions, which can weigh against fair use, although some courts have found using entire works acceptable for transformative purposes like enabling search functions. The "reasonableness in light of purpose" of the amount used is also considered.
Effect of the Use upon the Potential Market: This is often a crucial factor. It examines whether the AI training or the resulting AI outputs harm the market for the original copyrighted work, including actual or potential sales and lost licensing opportunities. If AI-generated content serves as a substitute for original works, it can weigh against fair use.
Weighing these factors in the context of AI training is complex and involves considering both the technical process of training and the commercial deployment and outputs of the model.
What role does "transformativeness" play in the fair use analysis for generative AI training?
Transformativeness, under the first fair use factor, assesses whether the secondary use adds new expression, meaning, or purpose to the original work. While the technical process of training an AI model to learn patterns and relationships in data can be argued as transformative in its purpose of creating a new capability (generating content), the ultimate use of the generated outputs is also heavily considered. If the AI system's outputs serve a purpose similar to the original copyrighted works (e.g., generating text that replaces the need to read the original), even if the training process itself involved technical transformation, this can weigh against a finding of fair use. The focus is not solely on the technical alterations but also on the distinct purpose and character of the use as a whole, including the model's deployment and outputs.
How does the commercial nature of generative AI systems affect the fair use analysis?
The commercial nature of the use is a significant aspect of the first fair use factor. Most generative AI systems are developed and deployed for commercial purposes, whether through direct monetization of outputs, licensing of the models or datasets, or using AI to enhance existing commercial products. Courts have consistently viewed commercial uses with less favor in fair use analyses. Even if an organization has a non-profit corporate structure, direct monetization activities related to the AI system can be considered commercial. Identifying the specific commercial activities of all entities involved (data providers, model developers, deployers) is crucial for this analysis.
What are the concerns regarding the "amount and substantiality" of copyrighted material used in AI training?
Generative AI models are typically trained on vast datasets that often include entire or substantial portions of copyrighted works. While some court decisions have permitted the copying of entire works for transformative purposes like enabling search functions, the appropriateness of using such extensive amounts in AI training is subject to debate. The key question is whether the amount used is "reasonable in light of the purpose." If the training process requires using large amounts to achieve its goal of learning patterns, this might be viewed differently than simply making full copies available to the public. However, the fact that entire works are often ingested for training weighs heavily in this factor, especially when combined with concerns about market substitution.
How do lost licensing opportunities factor into the fair use analysis for AI training?
Lost licensing opportunities are a significant consideration under the fourth fair use factor (market effect). If generative AI training and outputs displace or negatively impact established or potential markets where copyright owners could license their works (e.g., for text and data mining, or for use in creating new content), this weighs against a finding of fair use. The emergence of markets for licensing data for AI training is seen as evidence that unlicensed training can harm this potential market. Copyright owners are increasingly exploring licensing models for AI training data, and the existence of such a market strengthens the argument that unauthorized use causes market harm.
What are the potential legal and practical challenges associated with voluntary and collective licensing for AI training data?
Voluntary and collective licensing are potential solutions for obtaining necessary rights for AI training, but they face challenges:
Feasibility: Licensing can be complex due to the massive scale of data required, the diversity of content types and rightsholders, and the difficulty in identifying and negotiating with all relevant parties, especially individual creators. However, several commenters in the report indicated that voluntary and collective licensing is feasible and already occurring in various industries.
Meaningful Compensation: Ensuring that rightsholders receive meaningful compensation commensurate with the value of their work in training is a challenge.
Legal Impediments: Collective licensing, while potentially efficient for managing rights at scale, could face legal challenges related to antitrust laws depending on their structure and implementation.
System Design: The feasibility of licensing can depend on the design of the AI system and its intended uses, as the value and type of data needed can vary significantly.
Despite these challenges, voluntary and collective licensing are seen as more aligned with the copyright system's principles compared to statutory licensing approaches.
What international approaches exist regarding copyright and text and data mining (TDM) for AI training?
Some international jurisdictions have implemented specific exceptions or limitations for text and data mining that may apply to AI training. Notably:
European Union (EU): The EU Copyright Directive includes exceptions for TDM, but it allows rightsholders to opt-out of this exception for online uses, which has been a point of contention.
Singapore: Requires lawful access to the work and limits the use of copies to computational data analysis, with restrictions on sharing copies.
Japan: Allows the use of copyrighted works for AI development or other non-enjoyment purposes, provided lawful access. This is considered a broad exception.
Israel: A legal opinion suggested that uses of copyrighted materials for machine learning could be considered fair use based on the transformative nature and public benefit, although this is an opinion and not a statutory exception.
Brazil: Proposed legislation includes a TDM exception with an opt-out mechanism for rightsholders.
These international approaches demonstrate varying degrees of flexibility and rightsholder control regarding the use of copyrighted material for TDM, highlighting the ongoing global debate on this issue.
Table of Contents with Timestamps
Introduction: The Magic Behind AI - 00:25
The seemingly magical process of AI generation and the vast data requirements that make it possible
Understanding Machine Learning Fundamentals - 01:52
Basic concepts of machine learning, neural networks, and the scale required for modern AI systems
The Data Pipeline: Collection to Training - 03:33
How AI companies gather, curate, and process massive datasets from web scraping to final training sets
Copyright Infringement: Where Law Meets Technology - 05:33
Identifying the multiple points where copyright law potentially intersects with AI training processes
The Fair Use Defense: A Four-Factor Analysis - 08:08
Examining transformativeness, commerciality, and the critical role of memorization in fair use determinations
Factor Analysis Deep Dive - 08:40
Detailed exploration of the four fair use factors and their application to AI training scenarios
The Licensing Landscape - 13:25
Current state of voluntary licensing deals and the feasibility debate around comprehensive licensing systems
Global Perspectives and International Approaches - 16:18
How different countries are addressing AI training through text and data mining exceptions and fair use doctrines
Future Implications and Unresolved Questions - 18:12
The evolving relationship between technology, markets, and legal frameworks in the AI era
Index with Timestamps
Adobe Firefly, 13:40
AI training, 00:52, 03:05, 08:45
Antitrust, 14:32
AP licensing, 13:48
Brazil, 16:53
China, 16:53
Collective licensing, 14:32, 14:59
Commercial use, 09:44, 10:02
Copyright infringement, 05:33, 06:02
Copyright management, 04:25
Copyright Office, 01:17, 15:21
Data curation, 03:53, 04:02
Data laundering, 09:58
Data scraping, 03:36
Dataset Providers Alliance, 13:59
Deep learning, 02:19
Derivative work, 06:18
Extended collective licensing, 14:59, 15:00
Factor four, 12:35, 13:08
Factor one, 08:33, 09:44
Factor three, 10:52, 11:28
Factor two, 10:23
Fair use, 08:08, 08:19
Fine-tuning, 04:47, 05:01
Getty Images, 04:07, 13:44
Google Books, 11:01
Guardrails, 09:36, 12:07
International treaties, 17:01
Japan, 16:30
Korea, 16:42
Licensing feasibility, 13:25, 14:05
Machine learning, 01:52, 01:55
Memorization, 11:37, 11:58
Neural networks, 02:19, 02:22
Pirated works, 10:10
Pre-training, 04:47, 04:59
RAG systems, 07:15, 07:22
Revenue sharing, 14:02
Singapore, 16:34
Stability AI, 04:07
Statutory licensing, 14:50
Text and data mining, 16:22
Three-step test, 17:07
Tokens, 02:46
Training data, 00:44, 03:18
Transformativeness, 08:23, 08:42
UK, 16:34
Voluntary licensing, 15:25, 15:33
Watermarks, 04:02, 04:07
Poll
Post-Episode Fact Check
✅ VERIFIED CLAIMS:
AI systems do train on massive datasets scraped from the internet - Confirmed by multiple academic papers and company disclosures
The U.S. Copyright Office released a comprehensive report on AI training and copyright - Report published and publicly available
Getty Images sued Stability AI over watermarked training data - Legal case filed in 2023, ongoing
Adobe Firefly, AP-OpenAI, and other licensing deals exist - Publicly announced partnerships
Google Books case established precedent for transformative fair use - Authors Guild v. Google (2015)
EU has text and data mining exceptions with opt-out provisions - Directive 2019/790 Article 4
Japan allows TDM but may restrict if licensing markets develop - Japanese Copyright Act Article 30-4
Training involves copying data to servers and temporary RAM copies - Technical necessity confirmed by computer science literature
⚠️ CONTEXT NEEDED:
"Running out of internet text" concern - This is a projected possibility, not current reality. Estimates vary widely among researchers
"Memorization occurs and verbatim extraction is possible" - True but frequency varies significantly by model and training methods
Effectiveness of guardrails is "disputed" - Accurately reflects ongoing technical and legal debates rather than settled fact
❓ UNRESOLVED/OPINION-BASED:
Whether AI training constitutes fair use - Currently being litigated in multiple courts
Whether model weights themselves are infringing copies - Novel legal theory without settled precedent
Feasibility of comprehensive licensing - Economic projections vary widely depending on assumptions
"Non-expressive use" legal theory validity - Untested in higher courts for AI training context
📊 STATISTICAL CLAIMS:
"Billions or trillions of connections/parameters" - Accurate for large language models (GPT-3: 175B parameters)
"70 kilometers per decade" species migration - This appears to be from climate science, not directly related to AI/copyright
🔍 SOURCES REFERENCED:
U.S. Copyright Office Report on AI and Copyright (2023)
Various ongoing lawsuits (Getty v. Stability AI, etc.)
Academic literature on machine learning
International copyright legislation
OVERALL ASSESSMENT: The podcast accurately represents the complex legal landscape while clearly distinguishing between established facts and ongoing debates. Technical explanations are simplified but fundamentally correct.
Image (3000 x 3000 pixels)
Mind Map