Skip to content

Commit e7af051

Browse files
committed
Address Roman's feedback
1 parent b4a6507 commit e7af051

File tree

1 file changed

+25
-7
lines changed

1 file changed

+25
-7
lines changed

draft-iab-ai-control-report.md

Lines changed: 25 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -66,6 +66,24 @@ informative:
6666
org: European Parliament
6767
date: 2024-06-13
6868

69+
DECLINE:
70+
title: "Consent in Crisis: The Rapid Decline of the AI Data Commons"
71+
target: https://www.ietf.org/slides/slides-aicontrolws-consent-in-crisis-the-rapid-decline-of-the-ai-data-commons-00.pdf
72+
author:
73+
-
74+
ins: S. Longpre
75+
name: Shayne Longpre
76+
-
77+
ins: R. Mahari
78+
name: Robert Mahari
79+
-
80+
ins: A. Lee
81+
name: Ariel Lee
82+
-
83+
ins: C. Lund
84+
name: Campbell Lund
85+
date: 2025
86+
6987
--- abstract
7088

7189
The AI-CONTROL Workshop was convened by the Internet Architecture Board (IAB) in September 2024. This report summarizes its significant points of discussion and identifies topics that may warrant further consideration and work.
@@ -80,7 +98,7 @@ The Internet Architecture Board (IAB) holds occasional workshops designed to con
8098

8199
The Internet is one of the major sources of data used to train large language models (Large Language Models (LLMs), or more generally, "Artificial Intelligence (AI)"). Because this use was not envisioned by most publishers of information on the Internet, a means of expressing the owners' preferences regarding AI crawling has emerged, sometimes backed by law (e.g., in the European Union's AI Act {{AI-ACT}}).
82100

83-
The IAB convened the AI-CONTROL Workshop to "explore practical opt-out mechanisms for AI and build an understanding of use cases, requirements, and other considerations in this space" {{CFP}}. In particular, the emerging practice of using the Robots Exclusion Protocol {{?RFC9309}} -- also known as "robots.txt" -- has been uncoordinated, and may or may not be a suitable way to control AI crawlers. However, discussion was not limited to consideration of robots.txt, and approaches other than opt-out were considered.
101+
The IAB convened the AI-CONTROL Workshop on 19-20 September 2024 to "explore practical opt-out mechanisms for AI and build an understanding of use cases, requirements, and other considerations in this space" {{CFP}}. In particular, the emerging practice of using the Robots Exclusion Protocol {{?RFC9309}} -- also known as "robots.txt" -- has not been coordinated between AI crawlers, resulting in considerable differences in how they treat it. Furthermore, robots.txt may or may not be a suitable way to control AI crawlers. However, discussion was not limited to consideration of robots.txt, and approaches other than opt-out were considered.
84102

85103
To ensure many viewpoints were represented, the program committee invited a broad selection of technical experts, AI vendors, content publishers, civil society advocates, and policymakers.
86104

@@ -101,9 +119,9 @@ Furthermore, the content of the report comes from presentations given by worksho
101119

102120
The workshop began by surveying the state of AI control.
103121

104-
Currently, Internet publishers express their preferences for how their content is treated for purposes of AI training using a variety of mechanisms, including declarative ones, such as terms of service and robots.txt {{RFC9309}}, and active ones, such as the use of paywalls and selective blocking of crawlers (e.g., by IP address, User-Agent).
122+
Currently, Internet publishers express their preferences for how their content is treated for purposes of AI training using a variety of mechanisms, including declarative ones, such as terms of service, embedded metadata, and robots.txt {{RFC9309}}, and active ones, such as use of paywalls and selective blocking of crawlers (e.g., by IP address, User-Agent).
105123

106-
There was disagreement about the implications of AI opt-out overall. Research indicates that the use of such controls is becoming more prevalent, reducing the availability of data for AI training. Some of the participants expressed concern about the implications of this -- although at least one AI vendor seemed less concerned by this, indicating that "there are plenty of tokens available" for training, even if many opt out. Others expressed a need to opt out of AI training because of how they perceive its effects on their control over content, seeing AI as usurping their relationships with customers and a potential threat to whole industries.
124+
There was disagreement about the implications of AI opt-out overall. Research presented at the workshop {{DECLINE}} indicates that the use of such controls is becoming more prevalent, reducing the availability of data for AI training. Some of the participants expressed concern about the implications of this -- although at least one AI vendor seemed less concerned by this, indicating that "there are plenty of tokens available" for training, even if many opt out. Others expressed a need to opt out of AI training because of how they perceive its effects on their control over content, seeing AI as usurping their relationships with customers and a potential threat to whole industries.
107125

108126
However, there was quick agreement that both viewpoints were harmed by the current state of AI opt-out -- a situation where "no one is better off" (in the words of one participant).
109127

@@ -133,13 +151,13 @@ This means that while publishers' preferences may be available when content is c
133151

134152
This leaves a few unappealing choices to AI vendors that wish to comply with those preferences. They can simply omit such data from foundation models, thereby reducing their viability. Or, they can create a separate model for each permutation of preferences -- with a likely proliferation of models as the set of permutations expands.
135153

136-
Compounding this issue was the observation that preferences change over time, whereas LLMs are created over long time frames and cannot easily be updated to reflect those changes. Of particular concern to some was how an opt-out regime makes the default stickier.
154+
Compounding this issue was the observation that preferences change over time, whereas LLMs are created over long time frames and cannot easily be updated to reflect those changes. Of particular concern to some was how this makes an opt-out regime "stickier" because content that has no associated preference (such as that which predates the authors' knowledge of LLMs) is allowed to be used for these unforeseen purposes.
137155

138156
## Trust
139157

140158
This disconnection between the statement of preferences and its application was felt by participants to contribute to a lack of trust in the ecosystem, along with the typical lack of attribution for data sources in LLMs, lack of an incentive for publishers to contribute data, and finally (and most noted) a lack of any means of monitoring compliance with preferences.
141159

142-
This lack of trust led some participants to question whether communicating preferences is sufficient in all cases without an accompanying way to mitigate or track cases of those preferences being followed. Some participants also indicated that a lack of trust was the primary cause of the increasingly prevalent blocking of AI crawler IP addresses, among other measures.
160+
This lack of trust led some participants to question whether communicating preferences is sufficient in all cases without an accompanying way to enforce them, or even to audit adherence to them. Some participants also indicated that a lack of trust was the primary cause of the increasingly prevalent blocking of AI crawler IP addresses, among other measures.
143161

144162
## Attachment
145163

@@ -190,7 +208,7 @@ Although no conclusions regarding exact vocabulary were reached, it was generall
190208

191209
# Conclusions
192210

193-
Participants seemed to agree that on its current path, the ecosystem is not sustainable. As one remarked, "robots.txt is broken and we broke it."
211+
Participants generally agreed that on its current path, the ecosystem is not sustainable. As one remarked, "robots.txt is broken and we broke it."
194212

195213
Legal uncertainty, along with fundamental limitations of opt-out regimes pointed out above, limit the effectiveness of any technical solution, which will be operating in a system unlike either robots.txt (where there is a symbiotic relationship between content owners and the crawlers) or copyright (where the default is effectively opt-in, not opt-out).
196214

@@ -286,7 +304,7 @@ Attendees of the workshop are listed with their primary affiliation. Attendees f
286304
* Fred von Lohmann, OpenAI
287305
* Shayne Longpre, Data Provenance Initiative
288306
* Don Marti, Raptive
289-
* Sarah McKenna, Alliance for Responsible Data Collection; CEO, Sequentum
307+
* Sarah McKenna, Alliance for Responsible Data Collection; Sequentum
290308
* Eric Null, Center for Democracy and Technology
291309
* Chris Needham, BBC
292310
* Mark Nottingham, Cloudflare (PC)

0 commit comments

Comments
 (0)