Back to all posts
The RAG Copyright Trap: Why Retrieval Is Not Safe Harbor
Encypher Team

The RAG Copyright Trap: Why Retrieval Is Not Safe Harbor

RAG copyright liability exceeds training risk. The US Copyright Office, two federal rulings, and active litigation confirm retrieval creates fresh infringement at every query.

When a Fortune 500 company's legal team approved a RAG deployment last year to summarize licensed news feeds for internal analysts, they signed off because the system did not retrain on the content - it just retrieved it. 󠇟󠇠󠇡󠇢󠇞󠇡󠆀󠆟󠇕󠅻󠄽󠇅󠆯󠇝󠅶󠇌󠇯󠅓󠄩󠄩󠆷󠅁󠅆󠆽󠅟󠇁󠇎󠄎󠇈󠅁󠆿󠇯󠅠󠅃󠄵󠆴󠆈󠄟󠄓󠅿󠇃󠆘󠅨󠄢The general counsel's memo distinguished RAG from fine-tuning on exactly this basis: no model weights were updated, so the copyright exposure associated with training did not apply. 󠇟󠇠󠇡󠇢󠆚︅󠇙󠅬󠅾󠄺󠄸󠅾󠆦󠅫󠄇︇󠆳󠆛󠄀󠄞󠄒󠅍󠇋󠄈󠄙󠇀󠇃󠅁󠅿󠇘󠅻󠄨󠇞󠄶󠅓󠅨󠄖󠄘󠇞󠅐󠆠󠄋󠆝󠆅That reasoning was wrong. 󠇟󠇠󠇡󠇢󠇩󠆢󠇮󠇅󠅢󠆿󠄼󠆇󠆁󠄵󠇕󠇃󠇨󠆥󠆶󠆰󠄄󠄲󠅤󠇅󠇫󠄁󠄿󠄅󠆢󠅫󠅛󠅁︄󠅽󠆑󠆅󠇉󠅏󠄕󠄇󠅕󠇯︂󠅺The US Copyright Office's Part 3 report on generative AI found that RAG "involves making copies of that additional content and inputting that copy as part of an updated prompt" - and that this copying is less likely to qualify as transformative use than training itself. 󠇟󠇠󠇡󠇢󠇫󠆵󠅇󠄞󠅿󠅷󠄺󠄬󠅺󠆹󠇪󠆄󠅤󠅔󠅏󠆋󠅀󠇑󠇋󠆟󠇂︃󠅋󠄣󠅙󠄬󠅯󠄿󠄉󠅢󠇋󠇣󠆄󠆨󠄋󠆞󠅚󠅚󠅝󠅞A November 2025 federal ruling in the Southern District of New York then declined to dismiss direct infringement claims against a RAG operator, indicating that even claims based on non-verbatim summaries may plausibly support direct infringement. 󠇟󠇠󠇡󠇢󠅢󠇧󠆅󠆂󠆚󠆦󠄻󠆑󠆕󠆭󠇄︂󠄛󠄧󠇠󠄅󠆶󠇈󠄨󠄦󠇌󠆦󠅒󠅥󠇛󠅒󠆌󠆹󠄩󠆅󠅞︁󠆄󠅫︊󠇁󠇢󠄉󠄘󠆶Enterprises that adopted RAG to avoid copyright risk may have increased it - and without machine-readable provenance metadata traveling with retrieved content, they have no technical record to prove licensing consent at the moment of retrieval.

󠇟󠇠󠇡󠇢󠄡󠄠󠄈󠆈󠅻󠇏󠄺󠆩󠆥󠅟󠅐󠄚󠄙󠅶󠄫󠇯󠆿󠆺󠆪󠆮󠅥󠆽󠄘󠆊󠅮󠇆󠄶󠄚󠅷󠇇󠄞󠆰︉󠄯󠇃󠅮󠆻󠆱󠄄󠄙This post discusses legal developments for informational purposes only and does not constitute legal advice. 󠇟󠇠󠇡󠇢󠄯󠇚󠄘󠅀󠄑󠄼󠄹󠄬󠆑󠄟󠆍󠅁󠇙󠆃󠇥󠄙󠅇󠄎󠄠󠇮󠅬󠅛󠅿︄󠅗󠅐󠅞󠆌︄󠆇󠄹󠆳󠅐󠇧󠆮󠄣󠅀󠅠󠆸󠅄Encypher is a technology company, not a law firm. 󠇟󠇠󠇡󠇢󠇟󠆬󠆱󠆣󠇤󠆔󠄻󠄯󠆢󠇘󠄨󠆾︍󠅍󠅡󠄔󠅣󠄝󠄨󠄌󠄎󠆃󠄷󠅪󠄭󠇏󠄏󠆺󠅯︉󠄢󠆸󠆙󠆚󠄜󠆦︃︍󠇎󠆄Consult qualified legal counsel for advice specific to your situation.󠇟󠇠󠇡󠇢󠆆󠇕󠄮󠅪󠇗󠄞󠄲󠆰󠅱󠄼󠇟󠆨󠆮󠄒󠆬󠄲󠄔󠅾󠇡󠆢󠅅󠄥︊󠅪󠄱󠆠󠄹󠅧󠄜󠅶󠆇󠇣󠄇󠄌󠇘󠇚󠄔󠇊󠇎󠅅

󠇟󠇠󠇡󠇢󠇆󠆾󠅦︀󠇊︊󠄱󠅧󠆮󠇋󠇇󠆙󠄢󠄽󠅧󠄖󠅋󠆕󠄭󠆃󠆿󠄯︀󠅵󠆨󠄥󠅨󠄓󠄻󠆩󠄋󠅩󠄌󠆘︇󠄅󠄁󠅍󠄧︃What RAG Actually Does at Inference Time

The legal distinction between training and RAG is smaller than most enterprise risk assessments assume, and it cuts in the wrong direction.

󠇟󠇠󠇡󠇢󠄈󠆶󠆩󠄠󠆳󠅏󠄹󠆋󠆪︍󠇌󠆋󠅥󠄸󠄡󠅷󠄬󠆷󠆌󠄬󠅼󠅛󠄨󠆞󠆻󠇨󠇫󠅝︃󠄿󠇥󠄦󠅿󠅐󠅜󠅄󠅑󠅯︀󠆛When a large language model trains on copyrighted works, it ingests millions of documents and encodes statistical patterns into model weights. 󠇟󠇠󠇡󠇢󠅁󠇎󠅤︉󠆹󠄶󠄵󠆯󠆚󠅅󠇌󠆾󠄗󠇌󠅯󠆔󠄐󠄔󠆧󠆫󠄝︌󠅾󠄕󠄕︇󠆺︀󠄔󠆁󠄥︅󠄌󠇐󠆓󠇨󠇘󠅄󠆍󠄓The original text is not stored. 󠇟󠇠󠇡󠇢󠆜󠆁󠆊󠅽󠅀󠇮󠄺󠄿󠅻󠇓󠆭󠅫󠇩󠆀󠇛󠅞󠅭󠆱󠅒󠆣󠅘󠇁󠅾︆󠄈󠇊󠆘󠆟󠅹󠇑󠇫󠇉󠅣󠄿󠇉󠇩󠅛︈󠅇󠄨The model produces outputs that reflect broad patterns learned across a diverse corpus. 󠇟󠇠󠇡󠇢󠆹󠄪󠆞󠆟󠄳󠄧󠄽󠆇󠆁︌󠄄󠄭󠇪󠆣︅󠅑󠇕󠅱󠇕︋︉󠆨󠅈󠅭󠅧󠇤󠅘󠅕󠄥󠆽︅󠆈󠆁󠄤󠇭︈󠅦󠅎󠇭󠆥This process has a plausible - if contested - transformative use argument under fair use doctrine: the model learned from the works but does not reproduce them.

󠇟󠇠󠇡󠇢󠇁󠅧󠄷󠄪󠆘󠇨󠄵󠇫󠅸︆󠅔󠇀󠇓󠆚󠄬󠆆󠆭󠆼󠄵󠇥󠄻󠄡󠄵󠇈󠇮󠄱󠇎󠅭󠄽󠆚︁󠆼󠇌󠆷󠄮󠅚󠇚󠄩󠄳󠄓RAG does something different. 󠇟󠇠󠇡󠇢︁󠄏󠅃󠇨󠅇󠆽󠄵󠄢󠆖󠇤󠆂󠆉󠆍󠅠󠇗󠅬󠆴󠄢󠄽󠄲󠆵󠆐︌󠇫󠄙󠅬󠅂󠅳󠇊󠅶󠇮󠄱󠇙󠆬󠄳󠇃󠇇󠇊󠆠︃When a user queries a RAG system, the system retrieves specific documents from an index and copies them - sometimes in full, sometimes in relevant chunks - directly into the prompt context window. 󠇟󠇠󠇡󠇢󠄖󠄟󠇏󠆯󠄯󠄏󠄸󠇚󠅲󠄌󠇡󠅽󠄦󠆗󠇎󠄀󠆎󠅐󠄅󠇪󠅎󠄉󠅄󠅯󠅠󠆅󠅱󠄇󠄱󠄪󠇞󠄴󠅿󠇩󠅎󠅹󠅆󠅿󠅾󠅯The model then generates a response using that copied text as source material. 󠇟󠇠󠇡󠇢󠄐󠇈󠅤󠅐󠄒󠆱󠄽󠅛󠆧󠆄󠇛󠄘󠄽󠆜󠇮󠆴󠆏󠆼󠄋󠆤󠅷󠄿󠅑󠅃󠄰󠅣󠄋󠅷󠆡󠇒󠆅󠄫󠆄󠆮󠆲󠄠󠄋󠇂󠄯󠅘The Copyright Office described this mechanism precisely: RAG retrieves "targeted works" with specific intent, rather than drawing on a "large, diverse dataset." 󠇟󠇠󠇡󠇢󠆉󠅲󠄾󠆈󠄻󠅢󠄱󠄦󠅲󠄻︈󠄋󠇓󠅛󠆛󠆾󠄒󠄑︃󠄍󠇕󠄝󠄘󠄯󠆸󠆴󠇨󠇣󠆕󠅫󠅇󠅗󠄉󠆊󠇯󠇁󠄢󠅢󠄐󠅂The Office concluded that this targeting makes RAG outputs "less likely to be transformative where the purpose is to generate outputs that summarize or abridge the source work."

󠇟󠇠󠇡󠇢󠆯󠅪󠄣󠅋󠇓󠄎󠄴󠇜󠆮󠆌󠅒󠅣󠄤󠇧󠄖󠆊󠇋󠇦󠄀󠄪󠅿󠆩󠅞󠆊󠄟󠅃󠆓󠇯󠄖󠇜󠆛󠅖󠆿󠇎󠅜󠆰󠇠󠆼󠅚󠆭The counter-intuitive result: RAG may face greater copyright exposure than general model training in some respects. 󠇟󠇠󠇡󠇢󠅟󠄸󠄹󠅳󠅯󠇟󠄴󠄛󠆖󠅗󠇣󠇒󠅍󠆨󠄛󠅑󠆞󠆺󠅉󠄒󠇨󠆾󠆞󠅁󠄊󠇧󠆆󠅌󠄹󠄢󠆥󠆋󠇧󠅦󠆈󠄚󠇇󠄼󠆎󠇯Training at least transforms content into statistical representations. 󠇟󠇠󠇡󠇢󠄮󠄛󠄓󠄳󠅋󠇫󠄳󠇖󠆑󠄐󠄥󠇯󠅹󠆜󠄶󠄰󠆧󠅾󠇐󠄟󠅆󠄆︄󠇟󠅌󠄅󠇠󠆳󠆻󠇮󠆇󠄸󠆽󠄝󠄭󠄨󠄏󠇚󠅲󠄷RAG copies targeted text into memory and serves derivative responses from it at every query.

󠇟󠇠󠇡󠇢󠇃󠅁󠄜󠆔󠇛󠄗󠄵󠆒󠆣󠆵󠇜󠅮󠄇󠆎󠄓󠅧󠇡󠇈󠆩󠇖󠆶󠆉︍󠆬󠅲󠅱󠅣󠅿󠇦󠄬󠇊󠄽󠇁󠆢󠄩󠆤󠅸󠄆󠆛󠇩󠇟󠇠󠇡󠇢󠇒󠆲︊󠇭︈󠄿󠄷󠅱󠆦󠆉󠅊󠄟󠆄󠇠󠅵󠆒󠆪󠆮󠆒󠅛︍󠅅󠆴󠇞󠆒󠆳󠆝󠅶󠆨󠄈󠅳󠆎︊󠅼󠆀󠇐󠅖󠆪󠆴󠇙The Litigation Evidence

Two major federal cases confirm that publishers have identified this vulnerability and are building litigation strategy around it.

󠇟󠇠󠇡󠇢󠄣󠆰󠇟󠆉󠆧󠅩󠄺︇󠆛󠆵󠇃󠆓󠇥󠆨󠇜︍󠆀󠄱︋󠇦󠆧󠅬󠄞󠇄󠅌󠆢󠅮󠆎󠅄󠄛󠅤󠄡󠇡󠇐︄󠆉󠅴󠄿󠅓󠆩In Advance Local Media LLC v. Cohere Inc., fourteen publishers - including Forbes, The Guardian, the Los Angeles Times, The Atlantic, and Conde Nast - sued Cohere over its RAG-based products. 󠇟󠇠󠇡󠇢󠅦󠅊󠅮󠇓︀󠇦󠄲󠇠󠆩󠅜󠆙󠅰︇󠅽󠄕󠆰󠅀󠄾󠇈󠇃󠄥󠅦󠄘󠅁󠆖󠅳󠄘󠇋︆󠆳󠆡󠆶︍󠆀󠅐󠇛󠄸󠄋󠅭󠄅The complaint identifies more than 4,000 allegedly infringed works and 75 output examples that the plaintiffs say closely track the structure, sequencing, tone, and expressive choices of the original reporting. 󠇟󠇠󠇡󠇢󠆠󠄷󠇍󠆄󠇤︌󠄴󠆲󠆈󠅙󠄂︁󠅲󠆸󠄏󠄜󠅬󠆼󠇯󠅦󠆿󠆭︂󠅰󠄈󠆧󠅦︈󠆕󠆨󠆶󠆅󠄡󠆙󠅟󠇉󠇪󠆒󠆣󠆸In November 2025, Judge McMahon denied Cohere's motion to dismiss, holding that "substitutive summaries" - non-verbatim outputs that mirror the expressive structure and journalistic storytelling of originals - may plausibly infringe copyright. 󠇟󠇠󠇡󠇢󠆎󠅜󠇡󠅼󠆋󠅆󠄻󠄖󠆭󠅁󠅎󠆞󠄆󠄧󠆃󠆹󠄣󠅥󠄨󠄳︍󠇎󠅯󠅧󠇖󠆶󠆜󠄮󠆺󠄽󠄔󠄨︆󠄑󠅣󠆅󠅀󠇥󠆂󠅡The court declined to set a word-count threshold for infringement. 󠇟󠇠󠇡󠇢󠅲󠇨󠆗󠄆󠄙󠆥󠄲︂󠅾󠅂󠄞󠅥󠆌󠆩󠆈󠆅󠆓󠄻󠆬󠇛󠄟󠇡󠇇󠅮󠆁󠆣︅󠄲󠄕󠅿󠄔󠇑󠄺󠆊󠇐󠅋󠇠󠄺󠅱󠄚Cohere did not challenge the RAG-retrieval theory at dismissal, leaving it fully intact as a basis for the case to proceed.

󠇟󠇠󠇡󠇢󠅠󠅴󠄙︉󠅜󠆇󠄿󠇕󠆆󠇝󠅆󠅙󠇭󠇡󠆦󠄎󠆅󠆰󠆽󠅇󠅙󠆬󠇙︇󠆅󠄁󠅛󠇊󠇀󠇖󠅙󠆳󠅣󠆐󠄫󠄋󠄭󠅻󠅯︂The substitutive-summary holding matters for every enterprise running RAG. 󠇟󠇠󠇡󠇢󠇤︅󠅐󠆖󠄙󠆅󠄷󠄻󠅿󠇈︀󠄩󠄍󠅚󠅜󠇪󠄀󠆱󠄙󠄰󠄬󠇛︌󠇯󠆏󠅴󠅃󠅪󠅱󠅘󠄖󠄁󠅴󠅘󠆈󠄝󠅒󠆱󠆜󠄞Most enterprise RAG systems produce exactly this kind of output: summaries, abridgments, and synthesized answers drawn from source documents. 󠇟󠇠󠇡󠇢󠆤󠅹󠅭󠄚󠅜󠆽󠄳󠇛󠆀󠄉󠆠󠄤︀󠆵󠆣󠅌󠇤󠅽󠇂󠅽󠇘󠅟󠆡󠄤󠆻󠇌󠅫󠇐󠅙󠄫󠇆󠅁󠆅󠅄󠆊󠄼󠇄󠄓󠆷󠄎Judge McMahon's ruling indicates that the test is not whether the output copies words verbatim but whether it substitutes for the market of the original work.

󠇟󠇠󠇡󠇢󠅥󠅁󠇎󠇡󠇟󠅡󠄹󠇖󠆏󠇅󠆘󠆨󠆎󠆂󠅴󠅬󠇎󠇞󠆓󠇟󠄍󠄿󠆅󠆳󠄫󠇡󠅓󠆨󠆥︉󠅍󠅼󠅗󠇏󠅖󠅌󠅗󠆎󠅡󠆊In Dow Jones v. Perplexity AI, the Wall Street Journal and New York Post allege two independent counts of copyright infringement: one for copying articles into Perplexity's RAG index, and a second for providing users with "full or partial verbatim reproductions" of copyrighted articles in responses. 󠇟󠇠󠇡󠇢︋󠄶󠅨󠆢󠄘󠄵󠄺󠅚󠆙󠇪󠆛󠆦󠆷󠄟󠆘󠅋󠄳󠅱󠇢󠄋󠄃󠇋󠅲󠅴󠅥︋󠅺󠇎󠅹󠄥󠅋︍󠆹󠄢󠆐󠇀󠆪󠅴󠇌󠇃The complaint also asserts Lanham Act trademark dilution for AI-hallucinated content attributed to plaintiffs' publications using their trademarks. 󠇟󠇠󠇡󠇢󠆞󠅾󠇩󠄾󠆣󠆜󠄳︂󠅴󠇍󠇤︀󠄈󠄅︅󠄋󠅚󠄝󠅑󠆖󠇍󠆒󠆇󠇙󠅥︄󠅩󠄖󠅂󠇚󠄼󠄕󠇤󠆾󠆉󠄻󠄹󠅪󠅢󠆐The case remains active as of February 2026 with no settlement. 󠇟󠇠󠇡󠇢󠅯󠆆󠇞󠆬󠆷󠅥󠄺󠇬󠆦󠄫󠄎󠅮󠄽󠅵󠅠󠅷󠆕󠅹󠆣󠅚󠄎󠅢󠄍󠅑󠄎󠆽󠆍󠄵󠅒󠅔󠅉󠄓󠅢󠆰󠄑󠆍󠆣󠄭󠇚󠆅The two-count structure maps directly to the two-stage liability that any enterprise RAG operator faces: liability at ingestion when content enters the index, and liability at output when the system serves a response derived from that content.

The Licensing Defense and Where It Falls Short

The strongest counterargument is straightforward: enterprises using properly licensed content in their RAG indexes face no exposure. 󠇟󠇠󠇡󠇢󠆫󠆈󠅎󠆷󠆮󠄮󠄷󠆹󠆩󠄰󠅟󠄙󠇈󠅩󠆱󠇥󠆎󠆗󠅢︀󠄨󠄄󠆿󠅁󠆚󠅑󠆉󠅆︅󠄥󠆃󠄿󠄮󠄊󠆉󠄛󠇔󠄇󠄪󠇀A company that holds valid licenses for every document in its retrieval corpus has a contractual defense. 󠇟󠇠󠇡󠇢󠅢󠇔󠄃󠄜󠄴󠄨󠄿󠄳󠆭󠆹󠆄󠄥󠅫󠇇󠆷󠅤󠆞󠄛󠅠󠄻󠄋󠄺󠆎︊󠇊󠅾󠆛󠆽️󠅾󠇔󠅃󠇬󠅳󠅣󠇈󠄃󠄅󠆐󠄎Several major licensing deals in 2025 - 󠇟󠇠󠇡󠇢󠆛︆󠆡󠅘󠄎󠇙󠄵󠅫󠅸󠅩󠄰󠅪󠄓󠄽󠅶󠅷󠄰󠆬󠄠󠆤󠅌󠆫󠅊󠄳󠄜󠆡󠆒󠅌󠇂󠇉󠄼󠇫󠇛󠇝󠄨󠄒󠆡󠇧󠅄󠄾Amazon with the New York Times, and separately with Hearst and Conde Nast - suggest the market can solve RAG liability commercially.

󠇟󠇠󠇡󠇢󠅧󠄢󠇔󠇓󠇧󠅹󠄵󠆤󠆝󠅋󠇂󠇥󠄝󠄹󠆲󠄔󠅘󠄞󠆅󠄫󠇩󠇤󠅍󠅝󠅹󠇉󠇙󠄊󠄸󠇜󠅳󠇖󠄣󠆗󠆱󠄻󠄙󠄈󠆽󠆻Three problems undercut this defense at scale.

󠇟󠇠󠇡󠇢󠄻󠄢󠅒󠇚󠅃󠆙󠄺󠅚󠆋󠇛󠅭󠆃󠄋󠅉󠆷︉󠄐󠅤󠄬󠆩󠆏󠇁󠆗󠄌󠅟󠇊︇󠆈󠆵󠇃󠆢󠄐󠆗󠇕󠄮󠆕󠄂󠆏󠄢󠇅First, most enterprise RAG systems do not operate on a single, curated corpus. 󠇟󠇠󠇡󠇢󠄏󠇟󠅾󠄉󠆱󠇈󠄾󠆘󠆏󠆩󠅎󠇇󠅫󠅥󠅂󠇚󠆲󠇮󠆿󠆲󠇆󠆟󠄤󠄒︁󠅎󠆄󠄌󠅤󠇦︇󠆂󠇌󠅯󠇌󠄧󠆾󠇯󠅞︈They pull from licensed feeds, open-web crawls, internal documents that themselves contain excerpted copyrighted material, and third-party data providers whose own licensing status is opaque. 󠇟󠇠󠇡󠇢󠅙󠄅󠆊︁󠅩󠆹󠄼󠇜󠆫︎󠅣󠇪󠄊󠄭󠇗󠆌󠅓󠅹󠅐️󠇎󠄜󠅞󠄶󠄌󠆝󠄭󠄥󠆑󠆒󠄬󠇟󠆉󠅍󠅂󠆷󠇗󠄙󠆺󠇤An internal strategy memo that quotes three paragraphs from a Financial Times article enters the index alongside the licensed wire feeds. 󠇟󠇠󠇡󠇢󠆝󠄔󠄉󠄗󠇁󠇂󠄷︂󠆟󠅯󠆔︊󠇬󠆉󠇋󠇎󠅒󠇜︎󠇃󠄴󠄫󠆇󠅜󠆬󠆆󠇏󠇂󠅔󠆶󠅏󠅉󠇊󠇀󠄐󠅳󠇇󠄧󠄙󠇋At query time, the system does not distinguish between them. 󠇟󠇠󠇡󠇢󠆺󠇇󠇮󠅄󠄰󠇀󠄵󠇪󠅾󠆃󠇭󠅨󠄣󠆟󠅎󠆞󠄐󠄫︎󠄔󠅪︍︌󠅸󠇊󠆚󠅁󠅦󠄞󠇓󠄡󠆙󠆍󠇘󠄰󠅆󠄦󠇡󠇯󠇧Bilateral deals with flagship publishers do not cover the long tail of content in a typical enterprise RAG corpus.

󠇟󠇠󠇡󠇢󠇖󠅽️󠇛󠅈󠆅󠄸󠄭󠆃󠇯󠇦󠇢󠄹󠇮︁︁󠆡󠄓󠆧󠆺󠄞󠅕󠆞︀󠅯󠇘󠅰︃︂󠇟󠇀󠆟󠆚󠆞󠄚󠇁󠅬󠅍󠅑󠇋Second, even a fully licensed retrieval corpus creates output-stage liability if the system returns substitutive summaries that compete with the market of the original work. 󠇟󠇠󠇡󠇢󠇉󠆱︋󠄝󠄓︋󠄼︊󠆦󠄁󠅦󠅳󠄠󠅤󠅒󠇦󠆈󠅦󠇈󠄬󠅢󠇩󠇦󠄯󠆑︁️󠇕󠅔󠆿︂󠅇󠄰󠇚󠅂󠅽󠇕󠅓󠆿󠄂Judge McMahon's ruling does not distinguish between licensed and unlicensed source material when assessing whether the output substitutes for the original. 󠇟󠇠󠇡󠇢󠅎︂󠄑󠄉󠇣󠇜󠄺󠆨󠆖󠇕󠆣󠄓󠇄󠅻󠆵︅󠄎󠆴󠄓󠅠󠇀󠆢󠅞󠆾󠄪󠅿󠅐󠅚󠆁󠆄︍󠄘︎󠇬󠇪󠅩󠅌󠆬󠄛󠆛A license to retrieve content is not necessarily a license to generate competing summaries from it - that depends on the specific terms of the agreement, which most blanket licensing deals do not address at the per-query level.

󠇟󠇠󠇡󠇢󠅃︉󠆕󠇥󠇏󠆧󠄸󠆎󠆘󠄋󠆶󠆊󠅂󠇀󠅮󠄛󠄜󠆑󠅻󠄵󠇬󠄊󠅃󠇡󠇔󠅿󠄫󠄢󠄃󠆋󠆔󠇐󠅿󠇡󠄣︋󠅌󠇣󠆵︉Third, licensing contracts produce no per-query audit record. 󠇟󠇠󠇡󠇢󠆚󠄣󠆖󠆁󠄅󠆻󠄷︃󠆛󠄺󠆗󠆊󠅼󠇕︆󠄛󠄗󠆚󠆠󠇒󠇆󠅘󠄮󠇊󠆪󠄒󠇧󠄐󠆹󠆨󠇬󠅎󠅊󠅷󠅆󠇐󠆄︅󠇞󠆑A court or regulator asking what specific content was retrieved on a specific date, and under what authorization, cannot be satisfied by a blanket license agreement. 󠇟󠇠󠇡󠇢󠅮󠅷󠄌󠄤󠆂󠄪󠄵󠅳󠆔󠆩󠅞󠅵󠇤︎󠄘󠄌󠆸󠅒󠇬󠅇󠅫󠆷󠇩︂󠅓󠄖󠆁︆󠅃󠆅󠄈󠄻󠅷󠅘󠅪󠅣󠆶󠆫󠇚󠇧The defense requires evidence that consent was checked for each document at the moment of retrieval. 󠇟󠇠󠇡󠇢󠄎󠄜󠅟︊󠇈󠄣󠄶󠇅󠆢󠆆󠆲󠆑󠄷󠇩󠇋󠅺󠄴󠇒󠇗󠅄󠅻󠅽󠅺󠅈󠄥󠆉󠄣󠇩󠆊󠇧󠅬󠇏󠆦󠄻󠄡󠆭󠅡󠇗󠅱󠅸No major RAG platform produces that evidence today.

󠇟󠇠󠇡󠇢󠅛󠅾󠅏󠅩󠆰󠆜󠄶󠄭󠆊󠆼󠅧󠅓󠄳︊󠅠󠆙󠄆󠄕󠅣󠄾󠆁󠇭󠇅󠆣󠆥󠄚󠄚󠆟󠇨󠇩󠅙󠇓󠄇󠅸󠅸󠇖󠅶󠆷󠇟󠆼The EU Intellectual Property Office reached a similar conclusion in its May 2025 analysis. 󠇟󠇠󠇡󠇢󠇐︀󠄦󠄼󠄤󠅟󠄻󠄁󠅲󠄡󠇘︃󠄍󠆲󠆻︀󠅧󠄿󠅕󠇐󠅢󠇛󠄄󠄆󠅚󠄨󠆐󠇪︃󠅘︆󠄿󠇈󠇨󠆛󠆛︃󠆰󠆰󠇨The EUIPO noted that RAG's temporary storage of retrieved content might qualify for the temporary-reproduction exception under EU law - but conditioned this on how the system is technically implemented: if retrieved content is stored locally rather than handled transiently, it may constitute long-term reproduction. 󠇟󠇠󠇡󠇢︈󠇨󠆽󠆎󠄮󠅲󠄾󠄿󠆐󠄔󠅅󠆡󠆠󠄗󠄗󠆈󠇏󠅞󠇮󠆀󠇘󠅊󠇖󠄘󠄁󠇗󠄹󠅈󠇓󠄛󠄠󠇠󠆅󠇦󠅅︌󠄭󠅷󠅻󠄁The temporary-reproduction defense is not automatic. 󠇟󠇠󠇡󠇢󠆤󠇑󠇋󠅔󠄃󠆔󠄷󠆐󠆗󠅂󠄪󠇤󠆩󠇁󠅲󠆧󠄐󠅠󠅟󠇞󠇍󠇛󠇟󠆹󠄀󠆃󠇘󠅎󠇮󠆳󠆺󠄔󠆥󠇑󠆘󠇮󠅅󠇘︄󠄰It depends on whether the content was covered by a rights signal or license at retrieval time - facts that only a technical audit trail can establish.

󠇟󠇠󠇡󠇢󠆏󠆄󠄅󠅰󠇥󠅲󠄼󠆜󠆋󠇨︈󠄓󠆗󠆎󠄭󠆍󠄞󠅒󠆳󠄙󠆸︆󠅛󠅡󠆯󠅲󠄇󠆘󠇢󠆜󠇆󠇥󠅰󠅏󠆆︎󠄥󠅸󠆂󠄭󠇟󠇠󠇡󠇢󠅇󠄶󠄻󠆀󠆰󠄘󠄾︁󠆝󠄱󠇡󠄾︍󠄔󠆓󠆧󠄩󠅎󠆢󠄫󠇮󠅏󠅷󠅝︂󠆧󠅨󠅔󠆂󠄆󠄸󠄊󠆪󠄌󠆲︄󠅔󠄻󠆰󠅆The Provenance Gap

The legal exposure described above has a specific technical shape. 󠇟󠇠󠇡󠇢󠄹󠄛󠇀󠅦󠆋󠆤󠄲󠄰󠅼󠅷󠇀󠆀󠆆󠇠󠆑󠆡󠅬󠅙󠆬󠄿󠇝󠆅󠄿󠆢󠅆󠆉󠆃󠆔︅󠄿󠆽󠄐󠇅︌󠇈󠄮󠄑󠅝󠅎󠆵At the moment of retrieval, a RAG system needs to know the rights status of each source document: who published it, under what terms, and whether the current use is permitted. 󠇟󠇠󠇡󠇢󠆤󠄥󠆷󠆏󠄩󠅴󠄲󠆈󠆕󠆀󠆋󠆶︀󠆠󠇏󠄳󠅐󠅰󠆫󠇃󠄷󠅰󠄢󠅯󠇢󠇏󠆱󠆆󠇂󠇙󠄠︍󠄀︈󠆁󠄢󠇄︀󠇋󠄠That information must travel with the content itself, because content moves through crawlers, APIs, data pipelines, and vector databases before it reaches the retrieval layer. 󠇟󠇠󠇡󠇢󠄜󠆿󠆔󠆠󠆐󠅲󠄹󠄵󠆭󠇝󠄮󠄄󠅏󠇐󠆥󠇃󠆶󠆅󠇅󠆊󠇂󠅮󠇆󠄕󠄚󠄣󠆗󠄕󠅲︄󠆇󠆒󠆆󠆻󠄜󠆈󠅡󠅈󠇥󠄔Metadata stored in a separate system - a CMS, a licensing database, a robots.txt file - is not present at the point where the RAG system makes its retrieval decision.

󠇟󠇠󠇡󠇢󠅊󠇠󠇖󠅋󠄎󠇗󠄹︁󠆌󠇒󠅘󠄊︁󠇖󠅾󠆉󠇖󠄟󠄃︀󠅊󠆡󠄰󠄡󠄒󠇠󠅑󠅅󠅿󠅷󠇗󠅺󠆫󠆎󠅅󠆰󠅚󠄩󠅭︇This is the gap that machine-readable content credentials address. 󠇟󠇠󠇡󠇢󠆨󠅏󠇏󠅻󠄆󠆙󠄼󠄵󠆓󠇮︌️󠅃󠅬󠇠󠅡󠄸󠇗󠄈󠆌󠇀󠆾󠆋︉󠄻󠅔󠇬󠄅󠅢󠅇󠆕️󠆩󠆵︆󠅩󠆗󠅋󠇕︁A C2PA authentication manifest embedded in a document at publication time creates a tamper-evident, cryptographically signed record of authorship, licensing terms, and permitted downstream uses. 󠇟󠇠󠇡󠇢󠅡󠄝󠆸󠄄󠆔󠄸󠄷󠆪󠆀󠄞󠄧󠇈󠇜󠅀󠄄󠇇󠇇󠇁󠆥󠅂󠇜󠅉󠅝󠅘󠅔︋󠆵󠄶󠄉󠇙󠆘󠅿󠄬󠆛󠄞󠅅󠆥󠇎󠄋󠄹That record travels with the content through every system it passes through. 󠇟󠇠󠇡󠇢󠆯󠄜󠇏󠄼︆󠄡󠄿󠅑󠅸󠆯󠄑󠅕󠇆󠄁󠅵󠅮󠄵󠅆󠇬󠄨󠇈󠇜󠅋󠆁󠆋󠅶󠄽󠄶󠅞󠇄󠄍󠇓󠆫󠇐󠆉󠅡󠄱󠆬󠅴󠇗When a C2PA-signed document enters a RAG corpus, the signed credential persists: who published it, under what license, whether text-and-data-mining is permitted. 󠇟󠇠󠇡󠇢󠆌󠄙󠆮󠆦󠅃󠇯󠄱󠅣󠅸︋󠄦󠅲󠅆󠅥󠆆󠆰󠄛󠅈󠄗󠇥󠇉󠅕󠅛󠅊󠄴󠄾󠆞󠄓󠇂󠆵󠆆󠇃󠇈󠇗󠄧󠄦󠆎󠄀︍󠄕A C2PA-aware RAG pipeline can read those credentials at index time, filter or flag content based on its rights status, and log the credential hash as part of the retrieval record.

󠇟󠇠󠇡󠇢󠅔󠆜󠆷󠇃󠄁󠆋󠄽󠇂󠆖󠆐󠄅󠅽󠄷󠄤󠆶󠄩󠇨󠆳󠄣󠄧󠆡︃󠇢󠅶󠇢︍󠇤󠄭󠅘󠅷󠅠󠆁󠆝󠇦󠆴󠅗󠇄󠄀󠇁󠇧We co-authored Section A.7 of the C2PA specification, published January 8, 2026, which defines this mechanism for unstructured text. 󠇟󠇠󠇡󠇢󠅢󠇤󠄶󠄱󠅇󠆒󠄳󠄾󠆎󠅽󠇄󠇚󠄸󠄋󠆷󠄔󠆫󠇅󠆒󠅴󠄩󠅘󠆤󠆋󠅃󠆁󠆘󠆵󠅟󠅩󠄺󠅢󠇜󠆈󠅕︁󠅴󠇖󠆄󠆥The standard was developed with review from Google, OpenAI, Adobe, Microsoft, the New York Times, BBC, and AP through the C2PA consortium. 󠇟󠇠󠇡󠇢󠄑󠄄󠇂󠅈󠄓󠇬󠄼󠇌󠆥󠄒󠄳󠄕󠇟︃󠅒󠅉󠅂󠅰󠄆󠇇󠄬󠅺󠅝󠅜󠆵󠅠︄󠅂︁󠄁︉󠆱󠇅󠆢󠆮󠄾󠆠󠄴󠆥󠆏It exists because the technical requirement follows directly from the legal landscape: if the content itself carries no machine-readable rights signal, no downstream system can make an informed decision about whether to use it, and no enterprise can reconstruct its compliance posture after the fact.

󠇟󠇠󠇡󠇢󠅱󠆞󠇪󠇏󠄋󠅔󠄰󠇅󠆚󠄵󠄜󠆷󠄭󠅗󠄀󠆹󠆥󠆤󠇩󠇃󠅻󠆾󠇤󠆄󠄵󠅱󠅢︎󠇘󠄧󠅁󠆧󠆡󠇮󠆚󠆶󠅪󠅂󠅔󠅷Without signed provenance metadata, "we had a license" remains a contract claim that says nothing about which documents were retrieved on which dates - and contracts do not answer precisely the question that the Cohere and Perplexity litigation will force enterprises to answer.

󠇟󠇠󠇡󠇢󠇯󠄚󠄒󠆝󠆮󠆲󠄹󠄻󠆄󠄖󠇫󠄟󠄺󠅛󠇖󠅜︈󠇬󠄜󠇂󠅗󠇒󠄌󠅣󠆕󠄚󠅥󠄯︌︂󠇗󠅏󠆸󠆔󠄺︅󠅓󠅗️󠅂What Comes Next

If the Cohere and Perplexity cases proceed on their current trajectories, the next few years will likely produce the first final judgment - not just a motion-to-dismiss denial - in a RAG-specific copyright case. 󠇟󠇠󠇡󠇢󠄇󠆏󠆆󠄏︇󠅭󠄷󠇤󠆄󠅭󠄻󠆂󠆤󠅉󠄀󠅼󠄈󠅋󠆦󠇪󠅭󠆺󠆲󠆠󠅌󠄒󠄊󠇂󠄞󠅟︂󠆗︈󠇯󠄹󠄧󠄫󠄒󠄨󠆩The Cohere case has survived dismissal with the RAG theory intact. 󠇟󠇠󠇡󠇢󠆔󠇁󠆰󠆸󠅩󠅶󠄴󠄯󠆞󠆓󠅓󠅍󠅨󠄼󠄛󠄕󠇁󠅆󠄉︈󠄞︄󠇄󠄤󠇝󠇮󠆌󠇕󠄧󠇩󠅎󠆸󠅇︎󠄊󠇖󠄰︅󠅚󠄄The Perplexity case asserts both input-stage and output-stage liability with no settlement in sight. 󠇟󠇠󠇡󠇢󠇯󠇉󠇪󠆋󠅈󠇒󠄿󠆈󠆈󠄀󠅸󠄾󠆮󠆟󠆍︀󠆔󠄛󠇏󠆰󠆭󠅔︋󠇀󠄀󠆞󠅂󠅶󠆔󠇯󠆌︆󠆒󠇥󠄍󠄝󠄓󠄳󠄥󠇓The Copyright Office has stated its position. 󠇟󠇠󠇡󠇢󠅑︌󠆧󠆳󠄔︆󠄵󠇢󠆩󠄢󠆖︂󠆼󠅣󠄗󠅡󠅃󠅬󠇂︌󠄛󠅆︂󠅺󠇤󠅕󠆇󠅛󠇒󠆑󠆡󠄍󠄢󠄓󠆔󠆚󠇙󠅋󠇍󠆬Three federal courts in the Southern District of New York are developing the doctrine in parallel.

󠇟󠇠󠇡󠇢󠆥󠇄︉󠇃󠇉󠄊󠄱󠇎󠆉󠆇󠇀󠆬󠄋󠅏󠅀󠄀󠆏󠆸󠇜󠇒󠇕󠆼󠅸󠇘󠅡󠆗︎󠇩󠇢󠆨󠄓︋󠄁󠆉󠅕󠅢󠆅󠆵󠇉󠆴Enterprises that wait for a final ruling before building provenance infrastructure will be retrofitting under litigation pressure. 󠇟󠇠󠇡󠇢󠆶󠄗󠆧󠄲󠄓󠄬󠄱󠇞󠆢󠄵󠅏󠅟󠇇󠅃󠄵󠄜󠅱󠆳󠅑󠄁󠆅󠅀󠅊󠆷󠅘󠆻󠅙󠄇󠆬󠅹󠄧󠄳︅󠅰󠇛󠇐󠄯󠄸󠅺︋The question that every RAG operator should be able to answer today is specific: for any given query response your system produces, can you identify which source documents contributed to it, and can you demonstrate that each of those documents was licensed for that use at the time of retrieval? 󠇟󠇠󠇡󠇢󠇃󠆗󠆞󠅽󠆯󠅵󠄻󠄲󠅽󠅛󠇮󠆤󠄦󠆣󠆅󠅸󠄜󠄠󠄶󠅆󠇓󠆢️󠇍󠆗︎󠇨󠆴󠅨󠆤󠅫󠄲󠄊󠇠󠆾󠆗󠅣󠇢󠇐󠅍

Get the weekly Encypher briefing

Analysis of AI copyright, content provenance, and publisher rights - written from inside the C2PA standard-setting process. No filler.