Post by Mark Lemley / Redsky

I wrote about the mess this will make for copyright law here: journals.library.columbia.edu/index.php/st...

aug 29, 2025, 5:51 pm • 2 0

Replies

Thanks, another good read, though it's addressing a different question: that of copyright for AI-emitted content and perhaps prompts. I'm concerned with AI-emitted content infringing human-authored works from its training corpus. I claim that depends on how LLMs work internally.

aug 30, 2025, 6:31 pm • 6 2 • view

That training an LLM is fair use does not imply that every use of the trained LLM is also fair use. An LLM can be used for fuzzy search and phrase embeddings (fair use) or it can emit near-verbatim passages from its corpus.

aug 30, 2025, 6:31 pm • 8 1 • view

Suppose a work does not meet human tests for substantial similarity, but it is discovered that it was created by copying and then making thesaurus/grammar substitutions until a plagiarism detector turned green. Surely that would undermine claims that it was an independent expression of ideas.

aug 30, 2025, 6:31 pm • 0 0 • view

If LLMs are understood to manipulate expression rather than ideas, it is mechanistically closer to the detector-gaming, and the cases of near-verbatim output can be understood not as incidental similarity but as sloppiness in evading the test while engaging in infringing behavior.

aug 30, 2025, 6:32 pm • 0 0 • view

If LLM developers wish to claim it manipulates ideas, they should adopt cleanroom practices: train LLM A on copyright content and train LLM B exclusively on content licensed for the purpose (or public domain). LLM A can create an auditable intermediate representation that would be read into LLM B.

aug 30, 2025, 6:32 pm • 1 0 • view

User queries only ever go to LLM B, which has never accessed the copyright materials. Many researchers in this area would agree that LLM "quality" would drop precipitously in this setting. There are many studies supporting the thesis that LLMs do not, and perhaps can not, manipulate ideas.

aug 30, 2025, 6:33 pm • 1 0 • view

Existing tests for substantial similarity are circumstantial evidence of the mechanism by which a human produced the allegedly-infringing work. I believe those tests lack construct validity for machine-generated content.

aug 30, 2025, 6:33 pm • 1 0 • view

(Having done some work on this, I think this type of clean room practice isn’t actually super feasible in practice. Even if you get curation like this right—a huge if—I’m finding that these types of counterfactuals are really brittle in practice)

aug 31, 2025, 12:37 am • 1 0 • view