How Robots.txt Fails Publishers: In-Content Rights Signals

In December 2025, a federal judge in the Southern District of New York compared robots.txt to a "keep off the grass" sign. 󠇟󠇠󠇡󠇢󠅝󠅊󠅏󠆰󠅞󠄛󠇠󠇉󠆽󠅈󠆨󠄠󠆵︂󠇚󠇠󠅶󠅡󠄤󠅌󠄯󠆓󠅈󠅅󠄴󠄣󠅁󠅘︍󠆊󠅶󠄍󠇮󠅧󠆵󠅀󠅼󠄃󠄴󠅿The case was Ziff Davis v. OpenAI. 󠇟󠇠󠇡󠇢󠄮󠅯󠄷󠆬󠅕󠆻︍︁󠄙󠆓󠆁󠅓󠅵󠅧󠄅︄󠇢︍󠅛󠇜󠄖󠅍󠆃󠆗︀󠅣󠅧󠆙󠆴󠄧󠄟󠄢󠇘󠅡󠆛󠆮󠆺󠄎󠆻󠅔Ziff Davis argued that OpenAI's crawlers violated the DMCA by circumventing robots.txt restrictions on its sites. 󠇟󠇠󠇡󠇢󠇝󠅣󠆧󠅞󠇃󠅟󠅴󠅀󠄐󠆤︅󠅼󠄟󠄹󠇔󠅗󠄰󠄄󠇧󠅔󠆆󠇪󠄲󠇡󠄣︆󠄒󠅫󠇎󠆹󠆸󠇠󠅫󠆄󠄷󠄻󠄆󠄄󠆒󠆿Judge Sidney H. Stein dismissed the claim, ruling that robots.txt is not a "technological measure that effectively controls access" to copyrighted works under Section 1201 of the DMCA. 󠇟󠇠󠇡󠇢󠅶󠇕󠅍󠆴󠄂󠇮󠄧󠄞󠇘󠅓󠆑󠅲󠆠󠅉󠄙󠄴󠄰󠅲󠄚󠅋️󠄵󠅻󠄇󠅊󠄧󠇬󠅆︃󠅝󠄊󠇎󠄰󠇤󠅳󠄶󠆍󠆶󠆯󠆮The protocol relies on crawlers choosing to comply. 󠇟󠇠󠇡󠇢󠄩󠆸󠆒󠇏󠇄󠇢󠆭󠆝󠆕󠆋󠅆󠅲󠄣󠆍󠆉󠆮󠅵󠇬︊󠅬󠆆󠇚󠄖󠅧󠅗󠅱󠇏󠆶󠆙󠇨󠄄󠅡︉󠆽︆󠆋󠆱󠆰󠅹󠇑It enforces nothing. 󠇟󠇠󠇡󠇢󠆟󠇪󠆷󠇆󠅽󠆡󠄮︃󠆥󠄋󠅓󠅸︍󠇗󠆵󠆓󠇮󠇇󠇭󠆝󠇋󠅳󠅊󠆺󠆝󠅓󠇙󠄹󠅪󠆰󠄢󠄱󠇒󠇅󠅹󠇪󠄥󠇭󠇚󠇬Publishers spent three years updating robots.txt files to block AI crawlers. 󠇟󠇠󠇡󠇢󠄉󠅞󠄿󠅫󠇦󠄩󠅍󠄱󠆚󠄞󠄵󠆊󠅳󠄞󠆌󠄄󠇢󠅕󠇞󠅧󠇄󠇎󠄉󠅀󠆤󠄿󠄌󠄋󠄻󠄤󠇣󠅙󠆋󠆡󠆅󠇫󠇭󠅞󠄮󠅃A federal court has now indicated that the mechanism they relied on has no legal standing as a rights reservation tool - and its architectural failures run deeper than the legal ones.

󠇟󠇠󠇡󠇢󠇗󠆭󠇉󠄑󠇡󠆔︂󠄪󠇂󠅄󠄃󠅣󠇞󠆆󠅒󠅋󠄁󠄁󠇖󠆎󠇘󠅕󠅌󠄑󠆡󠅧󠆵󠅮󠅄︅󠄚󠅗󠇤︆󠅡󠄀󠇚󠆻󠅥󠄶This post discusses legal developments for informational purposes only and does not constitute legal advice. 󠇟󠇠󠇡󠇢󠆄󠆅󠆉︉󠆾󠅬󠄾󠇚󠅊󠆋󠆱󠇙󠅕󠅁󠅇󠅆󠅈󠅾󠇤󠆌󠆷󠇧󠆊󠅏󠇌󠅕󠅜󠅣󠇀󠅚󠄷󠆲󠅵󠄰󠆢󠅻󠅞󠅦󠆙󠆴Encypher is a technology company, not a law firm. 󠇟󠇠󠇡󠇢󠆊󠄯󠇟󠄩󠇜󠅹󠄦󠅛󠅟󠅴󠆰󠇟︃󠄄󠅸󠅨󠆓󠇂󠇁󠆪󠄠󠅷󠇩󠆼󠆍󠆱󠄋󠅟︍󠆃󠆾︇󠇇󠇔󠆫󠄱󠇫󠅵󠆹󠇡Consult qualified legal counsel for advice specific to your situation.󠇟󠇠󠇡󠇢󠄕󠅎󠆚󠄨󠅢󠄦󠄔󠅉󠇋󠅚󠆧󠆜󠅚󠄠󠆍󠄁󠅩󠇠󠇃󠆌󠅆󠄶󠄖󠄔󠄂󠆸󠆈󠆯󠇢󠇟︎󠄕󠇁󠇟󠄍󠄔󠄻󠅎󠅺󠄛

Three Failure Modes of a 1994 Protocol

Robots.txt was written in 1994 to solve a server load problem. 󠇟󠇠󠇡󠇢󠆩󠅎︋󠄨󠇅󠇢󠆔󠅶󠇦󠅑󠆩󠇚󠄎󠅻󠇍󠇝󠄻󠄃󠆧󠇝󠅺󠅤󠄞󠇔󠆹󠆔󠄣󠇪󠆾󠅾󠆤󠆅󠄄󠅲󠆚󠇣󠇂󠄍󠇐󠆓Web crawlers were hammering small servers, and site operators needed a way to tell crawlers which directories to skip. 󠇟󠇠󠇡󠇢󠇉󠄘󠅎󠇜󠄝󠄖󠇯󠇕󠆖󠅴︄󠅦︎󠆂󠆞󠅈󠄃󠅌󠄀󠄕󠅁󠅾󠄴󠇧󠄠󠄚󠅺󠅀󠄴󠄠󠇁󠅣󠅣󠄈󠆮󠆷󠆶󠇐󠄙󠅥The protocol was never designed to express intellectual property rights. 󠇟󠇠󠇡󠇢󠇗󠅏󠄆󠅷󠄼󠅟󠄊󠅠󠇙󠄥󠄢󠄿󠄹󠄒󠇪󠇦󠇕󠄊󠄮󠆇󠄖󠅾󠆌󠆿󠆇󠇌󠇌󠆕󠆙󠇋󠇛󠄜󠇃󠄖󠅃󠅏󠄣󠇤󠅶󠄤It has three structural limitations that no amount of updating can fix.

󠇟󠇠󠇡󠇢󠅮󠄶󠄾󠆗󠄘󠄑󠆿󠅊󠅻󠄗󠄡󠄖󠄌󠆳󠄵󠆽󠇕󠅨󠆪󠇉󠆿󠇃󠅅󠅭󠇏︂󠆍󠅎󠄿󠅀󠇅󠇧︇󠇖󠇛󠇭󠄢󠆲󠅓󠄜First, it has no authentication. 󠇟󠇠󠇡󠇢󠅅󠄔󠄍󠅻󠇭󠇥󠇠󠅕󠇕󠇁󠆚󠇁󠅉󠄋󠅘󠆤󠅊󠄚︇󠄊︊󠄘󠆢󠅨󠇢󠄻󠆛󠇧󠆵󠄪󠅝󠆪󠇞󠆳󠄌󠇙󠄎󠆫󠅑󠇯Any crawler can identify itself as whatever it wants. 󠇟󠇠󠇡󠇢󠄗︆󠆉󠇗󠇈󠅷󠄥󠇁󠅖󠄔󠆝󠆌󠄥󠅼󠇆󠄌󠄠󠅴󠇍󠅲󠇇︋󠄟󠄝󠆸󠆔󠆈󠆡󠅞󠆜󠄨󠇥󠅼󠇟󠄧󠄻󠆍󠅟󠅲󠇤A bot claiming to be Googlebot may or may not be Googlebot. 󠇟󠇠󠇡󠇢󠆳󠄒󠄴󠅠󠆢󠇂󠅼️󠄼󠆚󠄨󠇠󠆰󠄕󠄈󠆁󠆎󠆪󠅁󠄶󠄮󠅎󠆬󠄦󠆮󠅥󠇬󠅸󠆼󠇦󠄥󠄨󠇊󠇒󠄱󠅮󠆉󠆉󠄱󠅓Robots.txt directives are addressed to user-agent strings - self-reported identifiers that require no verification. 󠇟󠇠󠇡󠇢󠇁󠄓󠄑󠇜󠆿󠄁󠄴󠅳󠆬󠄙󠄖󠇎󠆣︁󠅟󠅼󠇤󠇡󠇢󠅺󠄜󠅞󠆍󠄵󠅹󠆌󠅄󠄸󠅝󠅇󠄈󠄓󠅩󠅒󠆼󠇭︀󠇎︎️A publisher blocking "GPTBot" in robots.txt has no way to confirm that the bot reading the file is actually GPTBot, or that other bots are not accessing the same content under different names. 󠇟󠇠󠇡󠇢󠆂󠄇󠄅󠇊󠄞󠄪󠅫󠅑󠅛︎󠆳󠆠󠇩󠇂󠄥󠆸󠆲︋󠄌󠆲󠆅󠅍󠅉󠄋󠅥󠅰󠄳󠄨󠅨󠇎󠅥󠇙󠅎󠄻︋󠆕󠆯︍︆󠆒Data from Tollbit, a crawler monetization company, illustrates the practical result: across all AI bots, 13.26 percent of requests ignored robots.txt directives in Q2 2025, up from 3.3 percent in Q4 2024. 󠇟󠇠󠇡󠇢󠄼󠅦󠅿󠆡️󠇎󠇛󠇭󠄐󠄛󠅽󠅠󠅧󠆯󠆎󠆝󠆶󠅃󠄓󠇌󠆔󠆳󠄋󠆞󠇥󠅭󠄗󠅓󠆅󠇁󠄇󠄷󠆃󠇍󠄬󠄧󠇎󠇌︁󠇘Non-compliance quadrupled in six months.

󠇟󠇠󠇡󠇢󠆎󠄲︈󠅞󠅁󠄇󠅦󠄬󠆒󠅌󠆋󠅖󠆜󠇓󠄢󠆯󠅕󠄙󠅧󠄙󠅰󠇌󠇏󠆮󠄽󠅯󠆦󠅫󠇛󠇍󠅦󠄛󠅎󠆶󠄋󠅈󠄫󠆃󠅂󠇝Second, robots.txt has no legal binding force. 󠇟󠇠󠇡󠇢󠆹󠅓󠆦󠆛󠅫󠅫󠆷󠄅󠆟󠆒󠄪󠇭󠇢󠇛󠄛︂󠇐󠆋󠄞󠅨󠅜󠆒󠄹󠆦󠇌󠅧󠆗󠇘󠇦󠄝󠅀󠅀󠄮󠇙󠄚󠄒󠇃󠇕󠆹󠆌The Ziff Davis ruling supported what intellectual property lawyers had long suspected: a polite request is not a technological protection measure. 󠇟󠇠󠇡󠇢󠅕︃󠇖󠇣󠇦󠄔󠅕︅󠄢󠆹󠆑󠅼󠅏󠅷󠇡󠄆󠄐󠇛󠇒󠄙󠄷󠆸󠅟󠆛󠇝󠄡󠅽󠆹󠆳︃󠅘󠅸󠆜󠆢󠅗󠄨󠄃󠆄󠄆︋The sign expresses a preference. 󠇟󠇠󠇡󠇢󠆋︌󠅝󠇋󠄀󠅥󠇇󠇅󠇝󠅡󠆏󠅳󠄲󠅽󠅃󠇍󠅟󠄗󠄶󠆆󠆼︎󠄿󠆟󠅭󠆝󠄕󠇖󠄑󠄌󠆆󠆕󠄥󠄹󠄾󠆹󠅰󠅹󠅡󠆓It does not constitute a fence. 󠇟󠇠󠇡󠇢󠄊󠄻󠇃󠇝󠇠󠇕󠆑󠆃󠅼󠄡󠇩󠅆󠇑󠆓󠇚󠆡󠄴󠇪󠄉󠅺󠆵󠆃󠅆󠅅󠆶󠄠󠅰󠆾󠄐󠅊️󠄤󠆞󠆳󠄔󠄩󠅏󠆛󠅥󠄣Publishers who treated robots.txt as a legal instrument were building on a foundation that, as of December 2025, a federal court has indicated does not hold.

󠇟󠇠󠇡󠇢󠄗󠄧󠇖󠄄󠆜󠄓󠆔󠆎󠆎︂󠄨󠄦󠅋󠆰󠅐󠇆󠆝󠅥󠅶󠇞󠅡󠇮󠇌󠅏󠅯󠄶󠄦󠅳󠆦󠄹󠅁󠅮󠅸󠇥󠅇󠅰󠄵󠄚󠄧󠇪Third, robots.txt cannot express granular permissions. 󠇟󠇠󠇡󠇢󠇟󠆓󠅳󠄨󠇒󠇁󠄃󠄔󠆿󠆶󠄱󠇖󠆪󠄉󠅄󠆱󠇌󠆕󠅖󠄍󠇘󠆊︂󠅀󠄦󠆷󠇁󠅠󠆖󠆌󠅺󠆾󠄅󠄒󠆋󠅱󠄤󠆪󠄟󠇛A publisher cannot say "you may index this article for search results but not use it for model training" or "you may retrieve this for RAG but must attribute and link back." 󠇟󠇠󠇡󠇢󠅑︉󠇔󠄘󠆩󠄄󠇘︅󠄚󠆓󠆫󠇯󠅁󠅓󠄂󠄯󠅆󠇄󠅀󠄃󠄞󠅲︆󠇧󠇦󠄼󠆍󠅏󠆗󠄄󠆢󠅍󠇣󠆹󠆻󠆴󠅔󠇟󠇣󠆳The protocol is binary: allow or disallow a given path for a given user-agent. 󠇟󠇠󠇡󠇢󠆕󠇮󠆿󠇎󠄔󠆿󠇞󠆴󠆅️󠆭󠅷󠅾󠇨󠆂󠅐󠇟󠄇󠄋󠆫󠆼󠅄󠄑󠇛󠆻󠆇󠇘󠇂󠅕󠇔󠅜󠄌󠆑󠅧󠇣󠆥󠇖󠆈󠆐󠅬The licensing relationships that publishers and AI companies are negotiating require far more precision than a directory-level allow/disallow toggle can express.

󠇟󠇠󠇡󠇢󠅹󠅃󠆻󠄙󠅾󠅕󠄺󠅡󠆷︎󠄉󠅑󠇖󠆧󠆂󠆔󠇥󠄚󠅑󠅑󠄆󠇙󠆫󠄪󠅦󠇂󠅭󠅢󠇝󠆽󠅗󠅆󠅪󠅏󠆚󠅯󠄽󠄢󠅼󠄖󠇟󠇠󠇡󠇢󠆡󠇯󠅝󠅴︌󠄻󠅭󠇓󠅁󠅚󠆮󠅂󠄎󠄝󠄓󠇥󠇣󠅏󠄗󠇃󠅜󠆚󠅠󠇡︂󠅐󠆽󠇬󠇋󠅝󠅛󠇘󠄵󠇠󠇜󠅣󠆺󠆊󠅏󠄀The Fourth Failure: Content That Leaves Your Server

These three problems might be fixable with a better protocol at the server level. 󠇟󠇠󠇡󠇢󠆻︎󠆠󠅅󠅺󠆞󠅜󠅕󠅖󠆁︉󠇞󠄑󠇌󠅷󠆅󠆌󠄊󠄑󠆷󠄚󠄚󠆍󠇗󠄔󠄺󠇊󠆈󠄢󠆬󠅏︉󠇖󠇮󠄿󠇫󠄐󠄎󠆲󠄼The fourth is not.

󠇟󠇠󠇡󠇢󠄚󠆛󠆦󠄫󠆬󠆟󠄹︀󠅒󠇕󠆓󠅖󠆤󠇖󠆗󠄛󠆣󠇦󠄔󠄩󠆿󠇓󠆭󠆩󠆗󠇄󠄣󠆡󠆬󠅢󠅆󠆲󠄮󠅧󠅜󠄳󠇁󠇜󠇏󠆅Robots.txt exists on the publisher's origin server. 󠇟󠇠󠇡󠇢󠄀󠄌󠆄︍󠆕󠅪󠄷󠄒󠄒󠄰󠇓󠄾󠅭󠅓󠇣󠇙󠅢︀󠆰︅󠇙󠄠󠇭󠄢󠅖󠆉󠄼󠆚󠇖󠅇󠅅󠆿󠆠󠅖󠄈󠇦󠆖󠅎󠅎󠇙It governs the behavior of crawlers that visit that server. 󠇟󠇠󠇡󠇢󠅧󠅓󠆹󠄸󠄳︈󠅛󠅔󠆀󠇦󠇈󠅮󠇭󠄖󠆉󠆮󠅰󠆃󠇗󠇏󠅼󠅘󠆲󠅹󠆺󠆏󠄇󠅉󠅻󠅽󠇪󠅊󠄒󠆘󠅯󠅊󠅨󠇨󠇌󠄗The moment content leaves the origin - through syndication feeds, licensing agreements, RSS aggregators, API partnerships, caching layers, or a single copy-paste - the robots.txt file on the publisher's domain is irrelevant. 󠇟󠇠󠇡󠇢󠄞󠆴󠆻󠆽󠄠󠄑󠅠󠅚󠄘󠅃󠄞󠄵󠅀󠄍󠅐󠆧󠇢︃󠆡︁󠆞󠅞󠆧󠆀󠆝︌󠄈󠆦󠅞️󠅬󠄌󠆥󠄠󠇌󠆃󠅒󠅧󠄽󠅇The content now sits on a different server, or in a different system entirely, with no rights signal attached.

󠇟󠇠󠇡󠇢󠄢︃󠅡󠆣󠆈󠅳󠅇󠄻󠅓󠇎󠇛󠇤󠅢󠆁󠄅󠅨󠆘󠄢󠅱󠇩󠅿︍󠅈󠅂󠇥󠄪󠆱󠄯󠄟󠅂󠆠󠆪󠄧󠅤󠄒︂󠆎󠄎󠇙󠇮This is not an edge case. 󠇟󠇠󠇡󠇢󠄸󠄑󠅯󠄓󠅩󠅽󠄪󠇧󠅋󠆑󠅏󠆠󠄐󠆝󠇘󠅀󠄈󠅼󠇒󠅍󠅠󠇛󠄼󠆁󠄇󠆪󠅴󠄽󠆹󠅯󠅪󠄽󠅂󠅌󠅷󠅽󠇅󠅣︅︃It describes how most publisher content actually circulates. 󠇟󠇠󠇡󠇢󠇎󠄺󠄩󠅔󠄟󠅋󠅆󠇠󠆭️󠅁󠄁󠅈󠆇󠄸󠄬󠅀󠇐󠆊󠅎󠇈󠅺󠄦󠇜󠄶󠄇󠆪󠅁󠄝󠆭󠆪󠇄󠄙󠆁󠆨󠆰󠅰󠆍󠅥󠄲A newspaper article published at 8 a.m. may appear by noon on three aggregator sites, two licensed partner platforms, a cached version in a search engine index, and an archived copy on the Wayback Machine. 󠇟󠇠󠇡󠇢󠆔󠅕󠅦󠅍󠆁󠅯󠇁󠄼󠇠󠆜󠄺󠄡󠆞󠇅󠄯󠇊󠅔󠆬󠄞󠄩󠆉󠆢󠄅󠄝󠅅󠅕󠆬󠇑󠅟󠆳󠅊󠇕󠆒󠅨󠇪󠆛󠇔󠇫󠅚󠅬The robots.txt file on the newspaper's domain says nothing about any of these copies. 󠇟󠇠󠇡󠇢󠆛󠅞󠅄󠄅󠅝󠇈󠇪︀󠆀󠅣󠆺󠅒󠇔󠄚󠅬󠄉󠆾󠆃󠅔󠇊󠄍󠅡󠆀󠆁󠆷󠄌󠇅󠆉󠄉󠄄󠅅󠅑󠅷󠄀󠇋󠄵󠅜󠅝󠅨󠅫If an AI crawler scrapes the aggregator site, the newspaper's robots.txt directives never enter the picture.

󠇟󠇠󠇡󠇢󠄱󠅝󠅤󠄒󠆦󠇌󠄎󠆷󠆊󠆳󠅫󠆓󠆔󠆤󠅦󠇜︅󠆓󠄞󠇊󠆈󠄩󠄯󠄔󠅁󠄁󠅹󠆱󠄅󠅆󠅥󠅂󠆑󠄳󠅑󠇬󠄯󠅣︃󠅦Nathan Reitinger's empirical study of robots.txt deployment found that unless a website is in the top 1% of the most popular and best-resourced sites in the world, robots.txt is not effectively used to control AI crawlers. 󠇟󠇠󠇡󠇢󠅰󠅦󠄟󠅽󠄛󠆶︎󠆗󠅶󠇙󠄄󠇉󠆋󠇯󠄛󠆕󠅺󠆓󠆷󠅅󠇗󠆾󠅌︃󠆈󠄱󠇏󠅐󠅡󠅚󠇣󠅙󠄌󠇬󠄭󠄩󠆁󠇨󠄲󠆭The bottom 99% of publishers - regional newspapers, trade publications, independent media, academic journals - either do not maintain robots.txt files that address AI crawlers, or maintain files that the crawlers ignore. 󠇟󠇠󠇡󠇢󠄮󠅍󠆳󠆬󠅦󠅕󠆜󠇯󠄜󠄄󠅣󠄎󠇞󠄚󠄆󠆇󠅓󠆻︄󠆕󠄵󠆻󠅶󠅓󠆠󠅗󠇀󠆬󠅗󠆶󠅭󠆴󠅆󠅍󠅿󠄟󠆂󠅛󠄺󠆅The syndication problem compounds this: even publishers who maintain perfect robots.txt files lose control the moment their content appears on a site that does not.

󠇟󠇠󠇡󠇢󠅛󠇔󠆨󠄆︎󠄣󠄤󠇤󠄁󠄃󠆎󠄓󠄾󠄗󠅓󠅃󠇔󠇡󠄨󠇞󠅷󠇭󠆱󠅽󠇘󠅆󠇔󠄆󠆦󠅡󠄰󠆫󠄡󠇃󠆀󠅅󠄤󠆎󠄁󠄓Server-level blocking through Cloudflare's AI bot filtering or CDN-level controls addresses some of these problems at the origin. 󠇟󠇠󠇡󠇢󠄦󠅃󠄹󠇋󠅬︅󠅵󠇦󠅙󠄩︊󠇧󠆃󠇄󠄲󠅇󠅰󠄗󠄣󠄲󠆬󠅝󠆀󠇇󠇪󠄟󠄦󠇠󠅕󠇙󠅬︍󠄬󠆁󠄴󠇁󠅓󠆳︈󠄬It cannot address content that has already left. 󠇟󠇠󠇡󠇢󠆭󠆢󠄐󠄘󠅮󠇡󠇝󠇕󠇄󠅗󠄅︍󠅒󠄿󠅧󠄢󠆌󠄧󠄏󠅱󠆴󠄹󠅑󠄩󠆪󠇩󠆌󠄎󠆰︇󠄝󠄡󠇄󠅢󠆨󠇡󠇁󠄷󠆤󠅤It cannot attach rights signals to content circulating through third-party systems. 󠇟󠇠󠇡󠇢󠅞󠅗󠄐󠅡󠇩󠄮󠅁󠄃󠅂󠅷󠇧󠇒󠆓󠅣󠇏󠅹󠄇󠆫󠇮󠆀󠅩󠆛󠅈󠄺󠅚󠄳󠇎󠇤󠄍󠄰󠄻󠄇󠅫󠅨󠄊󠆏󠄏󠆅󠅘󠇯And it can only deny access - it cannot communicate affirmative licensing terms that would enable legitimate use.

󠇟󠇠󠇡󠇢󠆗󠅳󠆓󠅄󠆑󠅜󠅫󠆄︍󠆼󠅫︍󠇡󠅈󠇢󠅜󠆱󠇢︀󠇖󠆌󠇦󠇓︊󠅙󠅛󠅳󠅖󠄎󠅣󠄖󠇔󠆃󠅏󠅃󠄳󠆣󠆕󠆦󠄃󠇟󠇠󠇡󠇢󠆀󠅰󠆮󠄌󠇅︃󠆲󠄼󠄘︇󠄦󠆋󠆓󠄢󠅢󠆀󠄗󠅽󠆖󠆅󠄋󠇨󠇄󠄹󠄃󠆦󠅊󠄁󠆝󠇍󠅞󠇏󠄲󠄓󠇤󠆖󠅚󠆱󠅈󠄃The EU Is Designing a Replacement

The European Commission understands the problem. 󠇟󠇠󠇡󠇢󠆣󠅖󠇮󠄒󠆢󠆘󠄣󠆞󠆂︊󠇒󠆨󠇈󠅔󠆼󠆍󠄮󠆚󠆏󠄮󠇛󠆨󠅶󠆀󠆤󠄅󠆰󠄣󠇅󠄧󠅽󠆽󠇡︄󠇐︁󠆷󠄦󠇓󠄗In December 2025, the Commission launched a stakeholder consultation on machine-readable protocols for reserving rights against text and data mining under the AI Act and the GPAI Code of Practice. 󠇟󠇠󠇡󠇢󠅬󠅴󠄴󠇇󠅗󠅹󠆎󠅴󠇃󠆱󠆽󠆫󠄪󠅍󠆞󠆱󠇎󠄌󠇓󠄹󠇀󠅗󠅶󠅹󠆨󠇡󠆫󠅕︎󠇓󠆼󠄅󠇚󠇝󠆱󠅈︍󠆻︆󠇎The consultation closed in January 2026. 󠇟󠇠󠇡󠇢󠅍󠅩󠆀︍󠆙󠄀󠆼󠇓󠅵󠅵󠅶󠆲󠄃󠇝󠄞󠄃󠆒󠆓󠇞󠆫󠇪󠄣󠆁󠅞󠆇󠄁󠅊󠆭󠆒󠅵󠇈󠆢󠇤󠄃󠆫󠄖󠆐󠇅󠅻󠅒The Commission explicitly sought protocols that go beyond existing conventions such as robots.txt - standardized, machine-readable solutions that can be implemented across different media, languages, and sectors.

󠇟󠇠󠇡󠇢󠅲󠆏󠆽󠅯󠆾󠅯󠆰󠄺󠆭󠄊󠇯󠆶󠇇󠇛󠇪󠅇󠄽󠅗󠄿󠄩󠄫󠆇󠆹󠇑󠇃󠅸󠅭󠆚︅󠄄󠆉󠇘󠄜󠅭󠇋󠆙󠇈󠆳󠇑󠅦The strongest counterargument to this analysis is that the existing system is improving. 󠇟󠇠󠇡󠇢󠆬󠅩︊󠅶󠇠󠆗󠅅󠅜󠄘󠆋󠆫󠅍󠅩󠆉󠅴󠆒󠅞󠄷󠆍󠄸󠅾󠆳󠅩󠄷󠅦󠆔󠅍󠆐󠆗󠅿︋󠄬󠇌󠅹󠄀󠆗󠇀󠆉󠅩󠇉Major AI companies - OpenAI, Google, Anthropic - have publicly committed to honoring robots.txt. 󠇟󠇠󠇡󠇢󠄲󠅭󠄡󠄊󠆽󠆸󠅛󠇈󠆘󠆩󠅪󠄇󠅎󠆨󠅐󠅗󠄩󠇙︋󠇆󠅅󠆗󠄢󠆿󠇛󠇢󠇅󠆰󠅱󠄭󠆧󠅷󠄍󠄗󠄃︈󠄌󠆀󠄂︂Bilateral licensing deals between AI companies and publishers are multiplying. 󠇟󠇠󠇡󠇢󠅄󠆐󠆊︃󠄱󠇪󠆙󠇬󠄝󠅄︅󠆳󠆏󠄙󠆡󠆻󠅸󠄨󠅠󠇎󠄥󠇗󠄨󠅪︋󠄢󠇍󠇚󠆋󠅮󠅂󠄙󠅋󠅎󠆴󠅷󠇭󠄢︍󠅚If voluntary compliance and commercial agreements solve the problem, the infrastructure question may be secondary.

󠇟󠇠󠇡󠇢󠆇󠇤󠅝󠆶󠆭󠇘󠇜󠇣󠆌󠅴󠄲︀󠅦󠅛󠅤󠆆󠅧󠄈󠄢󠇯󠆼󠇐󠄬󠄹󠄦︁󠅅󠇂︎󠄛󠆅󠅸󠆍󠆨󠇣󠆕󠅕󠄨󠅪︀This argument has two weaknesses. 󠇟󠇠󠇡󠇢󠅛󠇚󠇜󠆂󠇄󠅓󠇔󠆚󠆍󠇕󠆖󠆱󠆏󠇇󠄀󠆥󠅃󠇞󠅆󠅸󠄳󠇘󠇇󠄨󠅦󠆣󠅮󠄏󠄃󠆭󠅾󠇖󠆺󠅏󠆐󠄞󠅬󠆧󠅏󠄮Voluntary commitments from the largest players do not bind the hundreds of smaller model operators, open-source scrapers, and non-US actors that also crawl publisher content. 󠇟󠇠󠇡󠇢󠅧󠇄󠅋󠆥󠇘󠇝󠅟󠇤󠇂󠄗󠇆󠇕󠇙󠄇︍󠄹󠅚󠄏󠆞󠆏󠅯󠅪︉︎󠆕︈󠆥󠆬󠄅󠅹󠇘︍󠆉󠅘󠇪󠇫󠇎󠆆󠆐󠄜The Tollbit data showing a four-fold increase in non-compliance suggests that even voluntary compliance is eroding, not strengthening. 󠇟󠇠󠇡󠇢󠇇󠆫󠇮󠇜󠆳󠇘󠅙󠆏󠅣󠇨󠄹󠇎󠄔󠄼󠅻󠄹󠆯︁󠆤󠇒󠇐︇︋󠆱󠇫󠄠󠆂󠇀󠆯󠆕󠄴󠇆󠅾󠆑󠇙󠇑︆󠇗󠅬󠅛And bilateral licensing deals - 󠇟󠇠󠇡󠇢󠆖󠅳︊︊󠄐󠄊󠇋󠇘󠇏󠅒󠆎󠅖󠅢󠅂󠄪󠆕󠆭󠄊󠄦󠅑󠄚󠆓󠄠󠆤󠅒󠅲󠄑󠄒󠅸󠅿󠅃󠆰󠄮󠅄󠅈󠆨󠅟󠆻︃󠅡Amazon with the New York Times, 󠇟󠇠󠇡󠇢󠅬󠅢󠆎󠇊󠅓󠄪󠆠󠅓󠅯󠅮󠅬󠅄󠇢󠇉󠅹︁󠅛󠄖󠅍󠄞󠇟󠅇󠄿󠄺󠇃󠅱󠄟󠆃󠄙󠇥︈󠄇󠆘󠄑󠄏󠇯󠅊󠇬󠅀󠅫Amazon with Hearst and Conde Nast - cover named parties for specific time periods. 󠇟󠇠󠇡󠇢󠇃󠄅󠅍󠅆󠇗󠅆󠆴󠆘󠇞󠇔󠄄󠅅󠅒󠅕󠆗󠆆󠆠󠆖󠆎󠅵󠅺󠇤󠄱󠅱󠇒󠇚󠄴󠆎󠆣󠅱󠄞󠄋󠇀󠄻︊󠇨󠄯󠆹󠅭󠇙They create no per-document, per-query record that a given retrieval was authorized at the moment it occurred. 󠇟󠇠󠇡󠇢󠆵󠅑󠄐󠇀󠅏󠅡󠇆󠅪󠆊︌󠇅︄󠆋󠆳󠅳󠆵󠇉󠆦󠆩󠅫󠅌󠇇󠄓󠆗󠄬󠆨󠄣󠄧󠄂󠇨󠅤󠆜󠇑󠆝󠆙󠆲󠄯󠆉︌󠄂They do not cover the long tail of content outside the signatories' catalogs. 󠇟󠇠󠇡󠇢󠇅󠄴󠇒󠄧󠅀󠅵󠇐󠆜󠆣󠄉󠆚󠄋󠄳󠆛󠆩󠇬󠆎󠆨󠄝󠆰󠆥󠅰󠇣󠇚󠆵󠆿󠇣󠄶󠅬󠄃︁󠇅󠄨󠅛󠆞󠄔󠅜󠄹󠄞󠅊A licensing agreement is a contract between two organizations. 󠇟󠇠󠇡󠇢󠅌︊󠄹󠇥󠆜󠇟󠆆󠄋️󠄡︆󠇓󠇉󠅫󠄽󠆸󠄫󠇉󠄓󠇃󠅽󠆡󠅗󠅻󠄬󠅊󠇌󠅗󠇄󠆭󠄑󠄟󠆂󠆻󠆘󠅢󠄈󠄥󠅒󠅜It is not a machine-readable signal that travels with the content through every system it passes through.

󠇟󠇠󠇡󠇢󠆄󠅲󠄆󠇛󠆅󠇬󠄻󠇄󠇠󠆻󠅗󠄥󠅎︂︀󠅟︋󠇯󠄶󠆶󠄤󠄞󠇋󠆓󠄸󠄄󠅁󠇎󠅝󠄻󠅙󠇮󠅥󠄒󠅖󠆪󠅊󠅜󠄷󠆹The EU consultation explicitly asked whether proposed protocols can be implemented across different media types. 󠇟󠇠󠇡󠇢󠆫󠇮󠆍󠅾󠄑︀󠆎󠇟󠇥󠄷󠆭󠄰󠆝󠆰󠆬󠅕󠅮󠄛️󠄆󠅍󠆦󠄕󠇝󠆉󠄛︈󠄾󠄥󠆏󠇥󠇫󠄻󠄺󠅙󠅵󠅠󠄺󠆽󠇈This is the right question. 󠇟󠇠󠇡󠇢󠆢󠆁󠇉󠅥󠄟󠇇󠅣󠅮︃󠆖︈󠅿󠄳︅󠆈󠆨󠇠󠄣︇󠆟󠄑󠄈︃󠆛󠅭󠄊󠆐󠄔󠅛︃󠅬󠇞󠄒󠆁󠄌󠆯󠄎󠄩󠄪󠇀A rights reservation mechanism that works only on a publisher's origin web server, only for crawlers that identify themselves honestly, and only at the moment of the initial crawl does not meet the requirement the Commission described.

󠇟󠇠󠇡󠇢󠅜󠇂󠅙󠆰󠅿󠄓󠅔󠅊󠆨󠄟󠇏󠄣󠅫󠄰󠇀󠆈󠄟󠅍󠆄󠅘󠆛󠅗󠄼󠆵︆󠆎󠄔󠇆󠄡󠄋󠆃󠄂󠅗󠄈󠆽󠆘󠄼󠅶󠄇󠄰In-Content Credentials as the Technical Answer

The architectural requirement that emerges from each of these failures is specific: the rights signal must be embedded in the content itself. 󠇟󠇠󠇡󠇢󠅞󠆚󠄿󠇇󠅸󠆧󠄉󠆦󠄿󠅦󠅆󠇪︍󠇗󠆋󠄟󠅲󠆸󠅴󠅎󠇈󠅯󠇯󠆽󠅢󠄮󠄁󠄭󠄙󠅸󠇖󠅠󠆷󠄗󠄰󠅘󠅔󠅽󠇞󠆊Not in a file on the origin server. 󠇟󠇠󠇡󠇢󠆘󠆵󠇏󠆛󠆬󠄶󠅻󠇖󠄇󠆂︀󠆑󠄵󠅭󠆫󠅉󠄁󠆡󠇪󠅈︇󠅽󠆐󠅭󠆑󠅏󠄏󠅕󠆈󠄮󠇞󠇁󠅸󠆡󠆣󠆯󠅉󠄌󠅬󠄤Not in a separate database. 󠇟󠇠󠇡󠇢󠅮󠅼󠇘󠄇󠅙󠄡󠇥󠇦󠇚󠆂󠆀︃󠅡󠅨󠇎󠅰󠄋󠆚󠇪󠆱󠇞󠅳󠄖󠄢󠅀󠄪󠅸󠅐󠆼󠄌󠅓󠄀󠅅󠆽︉󠇈󠄐󠅏󠅾󠅰Not in a bilateral contract. 󠇟󠇠󠇡󠇢︁󠄺󠄏󠇄󠄭󠇊︅󠆫󠄳󠇂󠄺󠄀󠅞󠇘󠇨︈󠄢︎︊󠄶󠇘󠇠󠆄󠇦󠄋󠄻󠅛󠆀󠇀󠄠󠆂󠄛󠇆󠆯󠄛󠅭󠄄󠇐󠄣󠅔In the content - so that wherever the content goes, the signal goes with it.

󠇟󠇠󠇡󠇢󠄄󠄼󠄽󠄣󠄕󠅮︌󠆕󠄻󠆬󠆮󠆫󠄚󠇦󠄌󠆫󠇓󠆦󠆆󠇓󠅅󠄀󠇁󠇋󠇈󠄉󠆻󠅾󠄒󠆗︍󠅻󠅽󠅕󠇭󠅗󠇒󠅒󠄧󠅠C2PA content credentials meet this requirement. 󠇟󠇠󠇡󠇢󠅇󠇡󠅤󠇬󠄝󠅏󠄝󠅶󠄼󠄃󠅫󠆃󠇔󠆡󠅙󠇚󠆘󠆦󠅟󠆇󠆵󠄏󠆾󠅃󠄀󠅏󠅔󠄀󠄎󠆰󠄶󠄶󠅥︌󠅱󠇃󠅈󠄀󠄏󠆞A C2PA authentication manifest embedded in a document at publication time creates a cryptographically signed record of authorship, licensing terms, and permitted downstream uses. 󠇟󠇠󠇡󠇢󠆫󠅸󠇊󠅰󠆓󠇛󠆣︎󠆴󠇕󠄜󠄾󠄈󠅀󠆉󠄹󠇊󠇑󠅦󠇒󠇕󠅖󠄹󠅉󠅀󠆿󠇯󠆂󠅓󠅬󠆇󠆯󠆙󠅁︁󠆊󠄳󠄐󠆹󠅧The credential is tamper-evident: stripping or modifying it is detectable. 󠇟󠇠󠇡󠇢󠄿󠇧󠆫︂󠆪󠆓︉󠅈󠇋︃󠆇︄󠆆󠆧󠅀󠄦󠄵󠄀󠆗󠆻󠇫󠄎󠅨󠅇󠇏︃︇󠅗󠆕󠄫󠇕󠅁󠄌󠄓󠄯󠆉󠆈󠆞󠇮󠅡It can carry granular assertions - rights holder identity, TDM permission flags, licensing terms, a resolution URL for rights queries. 󠇟󠇠󠇡󠇢󠆨󠅌︉󠅑󠄞󠅂󠅄󠇫󠆉󠄝󠆃󠇍󠆶󠄝󠅩󠇅󠆂󠅙󠅔󠄏󠄨󠄮󠇂󠄄󠆓󠄽󠆺󠇤󠆧󠆐󠇩󠇌󠆇︊󠅈󠆳󠄁︎󠅸󠇁And it travels with the content through syndication, caching, aggregation, and retrieval.

󠇟󠇠󠇡󠇢󠇨󠄀󠆙󠇓󠅏󠅷󠆥󠄏󠅆󠆛󠄳󠆈󠅱󠄟󠄥︌󠆹󠇘󠅨󠇣󠆅󠅤󠇛󠇮󠄫󠇇󠄍󠆥󠅮󠆜󠄻︁󠅏󠅟󠆦󠆹󠅣󠆉󠆁󠇇We co-authored Section A.7 of the C2PA 2.3 specification, published January 8, 2026, which defines this mechanism for unstructured text. 󠇟󠇠󠇡󠇢󠆞󠆁󠅨󠄢󠅮󠄕󠄿󠆕󠅷󠆤󠄄󠅊󠆮󠄚󠆞󠆴󠅑︈󠇘󠄃󠆼󠇪󠄙󠇗︌󠅤󠆗󠄎︁󠅄󠆔󠄧󠇇󠆶󠆁󠇯󠄮󠅠󠇉󠆲The standard was developed with review from Google, OpenAI, Adobe, Microsoft, the New York Times, BBC, and AP through the C2PA consortium. 󠇟󠇠󠇡󠇢󠇙󠅜󠆁󠆸󠆨󠆩󠄐󠆚︃󠄦󠅈󠄉󠄕󠆭󠄣︄󠆲󠇃󠆪󠇫󠆆󠇡󠅂󠄚󠅒󠆝󠇡󠆊󠇙󠇕󠆮󠇢󠄹󠅍󠇊︃󠅤󠆫󠅩󠅌It exists because the technical requirement follows directly from the problem set: if the content itself carries no machine-readable rights signal, no downstream system can make an informed decision about whether to use it.

󠇟󠇠󠇡󠇢󠇯󠄔󠅕︋󠅭󠆯󠆝󠆒󠅚󠄌󠆱󠇭󠅹︂󠆈󠇉󠄼󠇕󠄅󠆩󠄺󠆍󠆬󠅨󠅶󠇅️󠅤󠅥󠅥󠄀󠄌󠄶󠄯󠇦󠄫󠇝︄󠇚󠅭A C2PA-aware crawler or RAG pipeline reads the credential at the moment of retrieval - not at the moment of first crawl, and not by consulting a separate file on a server the content may have left months ago. 󠇟󠇠󠇡󠇢󠆪󠇇󠄜󠆫󠇯󠇨󠇥︌󠇜󠅈󠇨󠄸󠄔󠇄󠄹󠆛󠄨󠅲󠆯󠄓󠇀󠅨︀󠇋󠆗󠄌󠄢︁󠅆󠄁󠅟󠄞󠅁󠄫󠄋󠅃󠇘󠅎󠆺󠆇Publishers who sign their content today are building the audit record that licensing deals and future regulation will require. 󠇟󠇠󠇡󠇢︌󠇀󠄈󠅪︆󠅄󠄨󠆴󠄼󠅤󠅾︂󠄉󠆠󠆇󠇐󠅿󠄲󠅊󠇝󠄢︊󠇥󠆑︆󠆭󠆤󠅊󠇐󠅵󠅉󠇧󠅾󠆢󠅜︇󠆽󠅏󠆴󠇣They are also providing the machine-readable signal that the EU Commission's consultation process is designed to identify and standardize.

󠇟󠇠󠇡󠇢󠅁󠄶󠆺󠅱󠇆︍󠅔󠇘󠇚󠅇󠅕󠇃󠇊󠆰󠅭󠅮󠇕󠅶󠅴󠇞󠆴󠆪󠆃󠆔󠅫󠆣󠇃󠆛󠅛󠆢󠅷󠄹󠅩󠇆󠆉󠅶󠄡︋󠄔󠅀The technology is not speculative. 󠇟󠇠󠇡󠇢󠇗󠅎󠄙󠆕󠆲󠆿󠆘󠅱󠅳󠇀󠅵󠆚󠅏󠆪󠇦󠇢󠆉󠇅󠅕󠄤󠄩󠄜󠇛󠄌󠇉󠄎󠅙󠇘󠆯󠅰󠄇󠇤󠄨󠄛󠄍󠄜󠆴󠄗󠆩󠆁The standard is published. 󠇟󠇠󠇡󠇢󠄸󠅼󠄈󠆯󠇌󠆄󠄱󠅚󠆊󠄮󠄤󠆖󠆠󠅲󠄍󠅓󠆢󠆫󠄋󠆆󠅑󠇋󠄸󠆉󠇒󠆫󠇑󠇐︍󠆻󠇭󠆠󠄷󠄕󠅡︀󠆻󠅣󠅂󠆯The consortium members include the major AI companies, the major publishers, and the major platform operators. 󠇟󠇠󠇡󠇢󠆰󠆠󠆋󠅺󠆳󠆑󠅥󠅪󠇌󠆶󠅊󠇚󠅯󠇩󠅏󠅯󠄉󠅺󠄈󠇆󠇞󠆮󠆐󠇌󠆔︇󠅐󠄗󠅞󠆠󠆓󠄃󠅛󠇂󠇪󠅔󠆙󠇈󠄽󠇐The open question is adoption - whether publishers will embed credentials before or after the regulatory and litigation landscape forces the issue. 󠇟󠇠󠇡󠇢󠄣󠆬󠅖󠄤󠆨󠆨󠆚󠄐︍️󠇝󠆶󠄟󠄜󠆴󠅟󠆵󠆾󠅞󠅔󠅱󠇔󠆑󠆧󠅵󠆈󠆆󠆡󠄯︉󠇤󠄫󠇤󠄩󠆆󠇣󠆭󠆍󠇝︂The EU Commission's agreed list of machine-readable opt-out protocols will be published and reviewed at least every two years. 󠇟󠇠󠇡󠇢󠇬󠇎︅󠇢󠇏󠆧󠄚️󠇇󠇄󠅘󠇞︀󠅿󠆯󠇧󠇗󠄤󠅓󠇘󠇞󠇏󠅉󠅚󠆓󠄒󠅪󠅍󠅐󠅓󠇪󠆯󠄇󠅛󠅌󠅙󠅥󠆵󠇒󠅫Publishers waiting to learn which protocol to adopt are waiting on a process that is converging on the only mechanism that solves the syndication problem: credentials that travel with the content, not files that stay on the server. 󠇟󠇠󠇡󠇢󠅠󠆣󠄾︄󠇩󠆈󠆚󠆌󠄙󠅴󠅞󠄻󠅉󠇢󠄾󠄈󠄣󠄽󠆽󠇆󠇀︄󠇡󠇑󠄯󠅣󠅭︎󠇄󠄥󠇬󠄰󠅵󠆌󠄄󠆶󠄄󠅜󠇞󠄥

How Robots.txt Fails Publishers: In-Content Rights Signals

Three Failure Modes of a 1994 Protocol

󠇟󠇠󠇡󠇢󠅜󠇂󠅙󠆰󠅿󠄓󠅔󠅊󠆨󠄟󠇏󠄣󠅫󠄰󠇀󠆈󠄟󠅍󠆄󠅘󠆛󠅗󠄼󠆵︆󠆎󠄔󠇆󠄡󠄋󠆃󠄂󠅗󠄈󠆽󠆘󠄼󠅶󠄇󠄰In-Content Credentials as the Technical Answer

Topics

Enjoyed this article? Share it!