Skip to content

Commit d86f804

Browse files
Update and rename draft-todo-yourname-protocol.md to draft-iab-ai-control-report.md
Added initial version of adopted IAB draft of AI control workshop report.
1 parent 57b1c73 commit d86f804

File tree

2 files changed

+331
-89
lines changed

2 files changed

+331
-89
lines changed

draft-iab-ai-control-report.md

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,331 @@
1+
---
2+
title: "IAB AI-CONTROL Workshop Report"
3+
category: info
4+
5+
docname: draft-iab-ai-control-report-latest
6+
submissiontype: IAB
7+
number:
8+
date:
9+
consensus: true
10+
v: 3
11+
keyword:
12+
- policy
13+
- Artificial Intelligence
14+
- Robots Exclusion Protocol
15+
- web crawler
16+
- robots.txt
17+
18+
pi:
19+
compact: yes
20+
subcompact: yes
21+
22+
author:
23+
-
24+
ins: M. Nottingham
25+
name: Mark Nottingham
26+
organization: Cloudflare
27+
postal:
28+
- Prahran
29+
country: Australia
30+
31+
uri: https://www.mnot.net/
32+
-
33+
ins: S. Krishnan
34+
name: Suresh Krishnan
35+
organization: Cisco
36+
37+
38+
normative:
39+
40+
informative:
41+
42+
CHATHAM-HOUSE:
43+
title: Chatham House Rule
44+
target: https://www.chathamhouse.org/about-us/chatham-house-rule
45+
author:
46+
-
47+
org: Chatham House
48+
49+
CFP:
50+
title: IAB Workshop on AI-CONTROL
51+
target: https://datatracker.ietf.org/group/aicontrolws/about/
52+
author:
53+
-
54+
org: Internet Architecture Board
55+
56+
PAPERS:
57+
title: IAB Workshop on AI-CONTROL Materials
58+
target: https://datatracker.ietf.org/group/aicontrolws/materials/
59+
author:
60+
-
61+
org: Internet Architecture Board
62+
63+
AI-ACT:
64+
title: Regulation (eu) 2024/1689 of the European Parliament and of the Council
65+
target: https://eur-lex.europa.eu/eli/reg/2024/1689/oj
66+
author:
67+
-
68+
org: European Parliament
69+
date: 2024-06-13
70+
71+
--- abstract
72+
73+
The AI-CONTROL Workshop was convened by the Internet Architecture Board (IAB) in September 2024. This report summarizes its significant points of discussion and identifies topics that may warrant further consideration and work.
74+
75+
76+
--- middle
77+
78+
# Introduction
79+
80+
The Internet Architecture Board (IAB) holds occasional workshops designed to consider long-term issues and strategies for the Internet, and to suggest future directions for the Internet architecture. This long-term planning function of the IAB is complementary to the ongoing engineering efforts performed by working groups of the Internet Engineering Task Force (IETF).
81+
82+
The Internet is one of the major sources of data used to train large language models (LLMs, or more generally "AI"). Because this use was not envisioned by most publishers of information on the Internet, a means of expressing the owners' preferences regarding AI crawling has emerged, sometimes backed by law (e.g., in the European Union's AI Act {{AI-ACT}}).
83+
84+
The IAB convened the AI-CONTROL Workshop to "explore practical opt-out mechanisms for AI and build an understanding of use cases, requirements, and other considerations in this space." {{CFP}} In particular, the emerging practice of using the Robots Exclusion Protocol {{?RFC9309}} -- also known as "robots.txt" -- has been uncoordinated, and may or may not be a suitable way to control AI crawlers. However, discussion was not limited to consideration of robots.txt, and approaches other than opt-out were considered.
85+
86+
To ensure many viewpoints were represented, the program committee invited a broad selection of technical experts, AI vendors, content publishers, civil society advocates, and policymakers.
87+
88+
89+
## Chatham House Rule
90+
91+
Participants agreed to conduct the workshop under the Chatham House Rule {{CHATHAM-HOUSE}}, so this report does not attribute statements to individuals or organizations without express permission. Most submissions to the workshop were public and thus attributable; they are used here to provide substance and context.
92+
93+
{{attendees}} lists the workshop participants, unless they requested that this information be witheld.
94+
95+
## Views Expressed in this Report
96+
97+
This document is a report on the proceedings of the workshop. The views and positions documented in this report are expressed during the workshop by participants and do not necessarily reflect IAB's views and positions.
98+
99+
Furthermore, the content of the report comes from presentations given by workshop participants and notes taken during the discussions, without interpretation or validation. Thus, the content of this report follows the flow and dialogue of the workshop but does not attempt to capture a consensus.
100+
101+
# Overview of the AI Crawling Landscape
102+
103+
The workshop began by surveying the state of AI control.
104+
105+
Currently, Internet publishers express their preferences for how their content is treated for purposes of AI training using a variety of mechanisms, including declarative ones, such as terms of service and robots.txt {{RFC9309}}, and active ones, such as use of paywalls and selective blocking of crawlers (e.g., by IP address, User-Agent).
106+
107+
There was disagreement about the implications of AI opt-out overall. Research indicates that the use of such controls is becoming more prevalent, reducing the availability of data for AI training. Some of the participants expressed concern about the implications of this -- although at least one AI vendor seemed less concerned by this, indicating that "there are plenty of tokens available" for training, even if many opt out. Others expressed a need to opt out of AI training because of how they perceive its effects on their control over content, seeing AI as usurping their relationships with customers and a potential threat to whole industries.
108+
109+
However, there was quick agreement that both viewpoints were harmed by the current state of AI opt-out -- a situation where "no one is better off" (in the words of one participant).
110+
111+
Much of that dysfunction was attributed to the lack of coordination and standards for AI opt out. Currently, content publishers need to consult with each AI vendor to understand how to opt out of training their products, as there is significant variance in each vendor's behaviour. Furthermore, publishers need to continually monitor both for new vendors, and for changes to the policies of the vendors they are aware of.
112+
113+
Underlying those immediate issues, however, are significant constraints that could be attributed to uncertainties in the legal context, the nature of AI, and the implications of needing to opt out of crawling for it.
114+
115+
## Crawl Time vs. Inference Time
116+
117+
Perhaps most significant is the "crawl time vs. inference time" problem. Statements of preference are apparent at crawl time, bound to content either by location (e.g. robots.txt) or embedded inside the content itself as metadata. However, the target of those directives is often disassociated from the crawler, either because the crawl data is not only used for training AI models, or because the preferences are applicable at inference time.
118+
119+
### Multiple Uses for Crawl Data
120+
121+
A crawl's data might have multiple uses because the vendor also has another product that uses it (e.g., a search engine), or because the crawl is performed by a party other than the AI vendor. Both are very common patterns: operators of many Internet search engines also train AI models, and many AI models use third party crawl data. In either case, conflating different uses can change the incentives for publishers to cooperate with the crawler.
122+
123+
Well-established uses of crawling such as Internet search were seen by participants as at least partially aligned with the interests of publishers: they allow their sites to be crawled, and in return they receive higher traffic and attention due to being in the search index. However, several participants pointed out that this symbiotic relationship does not exist for AI training uses -- with some viewing AI as hostile to publishers, because it has the capacity to take traffic away from their sites.
124+
125+
Therefore, when a crawler has multiple uses that include AI, participants observed that "collateral damage" was likely for non-AI uses, especially when publishers take more active control measures such as blocking or paywalls to protect their interests.
126+
127+
Several participants expressed concerns about this phenomenon's effects on the ecosystem, effectively "locking down the Web" with one opining that there were implications on freedom of expression overall.
128+
129+
### Application of Preferences
130+
131+
When data is used to train an LLM, the resulting model does not have the ability to only selectively use a portion of it when performing a task, because inference uses the whole model, and it is not possible to identify specific input data for its use in doing so.
132+
133+
This means that while publishers preferences may be available when content is crawled, they generally are not when inference takes place. Those preferences that are stated in reference to use by AI -- for example, "no military uses" or "non-commercial only" cannot be applied by a general-purpose "foundation" model.
134+
135+
This leaves a few unappealing choices to AI vendors that wish to comply with those preferences. They can simply omit such data from foundation models, thereby reducing their viability. Or, they can create a separate model for each permutation of preferences -- with a likely proliferation of models as the set of permutations expands.
136+
137+
Compounding this issue was the observation that preferences change over time, whereas LLMs are created over long time frames and cannot easily be updated to reflect those changes. Of particular concern to some was how an opt-out regime makes the default stickier.
138+
139+
## Trust
140+
141+
This disconnection between the statement of preferences and its application was felt by participants to contribute to a lack of trust in the ecosystem, along with the typical lack of attribution for data sources in LLMs, lack of an incentive for publishers to contribute data, and finally (and most noted) a lack of any means of monitoring compliance with preferences.
142+
143+
This lack of trust led some participations to question whether communicating preferences is sufficient in all cases without an accompanying way to mitigate or track cases of those preferences being followed. Some participants also indicated that lack of trust was the primary cause of increasingly prevalent blocking of AI crawler IP addresses, among other measures.
144+
145+
## Attachment
146+
147+
One of the primary focuses of the workshop was on _attachment_ -- how preferences are associated with content on the Internet. A range of mechanisms was discussed.
148+
149+
### robots.txt (and similar)
150+
151+
The Robots Exclusion Protocol {{RFC9309}} is widely recognised by AI vendors as an attachment mechanism for preferences. Several deficiencies were discussed.
152+
153+
First, it does not scale to offer granular control over large sites where authors might want to express different policies for a range of content (for example, YouTube).
154+
155+
Robots.txt also is typically under the control of the site administrator. If a site has content from many creators (as is often the case for social media and similar platforms), the administrator may not allow them to express their preferences fully, or at all.
156+
157+
If content is copied or moved to a different site, the preferences at the new site need to be explicitly transferred, because robots.txt is a separate resource.
158+
159+
These deficiencies led many participants to feel that robots.txt cannot be the only solution to opt-out: rather, it should be part of a larger system that addresses its shortcomings.
160+
161+
Participants noted that other, similar attachment mechanisms have been proposed. However, none appear to have gained as much attention or implementation (both by AI vendors and content owners) as robots.txt.
162+
163+
### Embedding
164+
165+
Another mechanism for associating preferences with content is to embed them into the content itself. Many formats used on the Internet allow this; for example, HTML has the `<meta>` tag, images have XMP and similar metadata sections, and XML and JSON have rich potential for extensions to carry such data.
166+
167+
Embedded preferences were seen to have the advantage of granularity, and of "travelling with" content as it is produced, when it is moved from site to site, or when it is stored offline.
168+
169+
However, several participants pointed out that embedded preferences are easily stripped from most formats. This is a common practice for reducing the size of a file (thereby improving performance when downloading it), and for assuring privacy (since metadata often leaks information unintentionally).
170+
171+
Furthermore, some types of content are not suitable for embedding. For example, it is not possible to embed preferences into purely textual content, and Web pages with content from several producers (such as a social media or comments feed) cannot easily reflect preferences for each one.
172+
173+
Participants noted that the means of embedding preferences in many formats would need to be determined by or coordinated with organisations outside the IETF. For example, HTML and many image formats are maintained by external bodies.
174+
175+
### Registries
176+
177+
In some existing copyright management regimes, it is already common to have a registry of works that is consulted upon use. For example, this approach is often used for photographs, music, and video.
178+
179+
Typically, registries use hashing mechanisms to create a "fingerprint" for the content that is robust to changes.
180+
181+
Using a registry decouples the content in question from its location, so that it can be found even if moved. It is also claimed to be robust against stripping of embedded metadata, which is a common practice to improve performance and/or privacy.
182+
183+
However, several participants pointed out issues with deploying registries at Internet scale. While they may be effective for (relatively) closed and well-known ecosystems such as commercial music publishing, applying them to a diverse and very large ecosystem like the Internet has proven problematic.
184+
185+
## Vocabulary
186+
187+
Another major focus area for the workshop was on _vocabulary_ -- the specific semantics of the opt-out signal. Several participants noted that there are already many proposals for vocabularies, as well as many conflicting vocabularies already in use. Several examples were discussed, including where existing terms were ambiguous, did not address common use cases, or were used in conflicting way by different actors.
188+
189+
Although no conclusions regarding exact vocabulary were reached, it was generally agreed that a complex vocabulary is unlikely to succeed.
190+
191+
192+
# Conclusions
193+
194+
Participants seemed to agree that on its current path, the ecosystem is not sustainable. As one remarked, "robots.txt is broken and we broke it."
195+
196+
Legal uncertainty, along with fundamental limitations of opt-out regimes pointed out above, limit the effectiveness of any technical solution, which will be operating in a system unlike either robots.txt (where there is a symbiotic relationship between content owners and the crawlers) or copyright (where the default is effectively opt-in, not opt-out).
197+
198+
However, the workshop ended with general agreement that positive steps could be taken to improve communication of preferences from content owners for AI use cases. In discussion, it was evident that discovery of preferences from multiple attachment mechanisms is necessary to meet the diverse needs of content authors, and that therefore defining how they are combined is important.
199+
200+
We outline a proposed standard program below.
201+
202+
## Potential Standards Work
203+
204+
The following items were felt to be good starting points for IETF work:
205+
206+
* Attachment to Web sites by location (in robots.txt or a similar mechanism)
207+
* Attachment via embedding in IETF-controlled formats (e.g., HTTP headers)
208+
* Definition of a common core vocabulary
209+
* Definition of the overall regime; e.g., how to combine preferences discovered from multiple attachment mechanisms
210+
211+
It would be expected that the IETF would coordinate with other SDOs to define embedding in other formats (e.g., HTML).
212+
213+
### Out of Initial Scope
214+
215+
It was broadly agreed that it would not be useful to work on the following items, at least to begin with:
216+
217+
* Enforcement mechanisms for preferences
218+
* Registry-based solutions
219+
* Identifying or authenticating crawlers and/or content owners
220+
* Audit or transparency mechanisms
221+
222+
# Security Considerations
223+
224+
_TODO_
225+
226+
227+
228+
--- back
229+
230+
231+
# About the Workshop
232+
233+
The AI-CONTROL Workshop was held on 2024-09-19 and 2024-09-29 at Wilkinson Barker Knauer in Washington DC, USA.
234+
235+
Workshop attendees were asked to submit position papers. These papers are published on the IAB website [PAPERS], unless the submitter requested it be withheld.
236+
237+
The workshop was conducted under the Chatham House Rule [CHATHAM-HOUSE], meaning that statements cannot be attributed to individuals or organizations without explicit authorization.
238+
239+
## Agenda
240+
241+
This section outlines the broad areas of discussion on each day.
242+
243+
### Thursday 2024-09-19
244+
245+
Setting the stage
246+
: An overview of the current state of AI opt-out, its impact, and existing work in this space
247+
248+
Lightning talks
249+
: A variety of perspectives from participants
250+
251+
### Friday 2024-09-20
252+
253+
Opt-Out Attachment: robots.txt and beyond
254+
: Considerations in how preferences are attached to content on the Internet
255+
256+
Vocabulary: what opt-out means
257+
: What information the opt-out signal needs to convey
258+
259+
Discussion and wrap-up
260+
: Synthesis of the workshop's topics and how future work might unfold
261+
262+
## Attendees {#attendees}
263+
264+
Attendees of the workshop are listed with their primary affiliation. Attendees from the program committee (PC) and the Internet Architecture Board (IAB) are also marked.
265+
266+
* Jari Arkko, Ericsson
267+
* Hirochika Asai, Preferred Networks
268+
* Farzaneh Badiei, Digital Medusa (PC)
269+
* Fabrice Canel, Microsoft (PC)
270+
* Lena Cohen, EFF
271+
* Alissa Cooper, Knight-Georgetown Institute (PC, IAB)
272+
* Marwan Fayed, Cloudflare
273+
* Christopher Flammang, Elsevier
274+
* Carl Gahnberg
275+
* Max Gendler, The News Corporation
276+
* Ted Hardie
277+
* Dominique Hazaël-Massieux, W3C
278+
* Gary Ilyes, Google (PC)
279+
* Sarah Jennings, UK Department for Science, Innovation and Technology
280+
* Paul Keller, Open Future
281+
* Elizabeth Kendall, Meta
282+
* Suresh Krishnan, Cisco (PC, IAB)
283+
* Mirja Kühlewind, Ericsson (PC, IAB)
284+
* Greg Leppert, Berkman Klein Center
285+
* Greg Lindahl, Common Crawl Foundation
286+
* Mike Linksvayer, GitHub
287+
* Fred von Lohmann, OpenAI
288+
* Shayne Longpre, Data Provenance Initiative
289+
* Don Marti, Raptive
290+
* Sarah McKenna, Alliance for Responsible Data Collection; CEO, Sequentum
291+
* Eric Null, Center for Democracy and Technology
292+
* Chris Needham, BBC
293+
* Mark Nottingham, Cloudflare (PC)
294+
* Paul Ohm, Georgetown Law (PC)
295+
* Braxton Perkins, NBC Universal
296+
* Chris Petrillo, Wikimedia
297+
* Sebastian Posth, Liccium
298+
* Michael Prorock
299+
* Matt Rogerson, Financial Times
300+
* Peter Santhanam, IBM
301+
* Jeffrey Sedlik, IPTC/PLUS
302+
* Rony Shalit, Alliance For Responsible Data Collection; Bright Data
303+
* Ian Sohl, OpenAI
304+
* Martin Thomson, Mozilla
305+
* Thom Vaughan, Common Crawl Foundation (PC)
306+
* Kat Walsh, Creative Commons
307+
* James Whymark, Meta
308+
309+
The following participants requested that their identity and/or affiliation not be revealed:
310+
311+
* A government official
312+
313+
314+
# IAB Members at the Time of Approval
315+
{:numbered="false"}
316+
317+
Internet Architecture Board members at the time this document was approved for publication were:
318+
319+
_TBC_
320+
321+
322+
# Acknowledgements
323+
{:numbered="false"}
324+
325+
The Program Committee and the IAB would like to thank Wilkinson Barker Knauer for their generosity in hosting the workshop.
326+
327+
We also thank our scribes for capturing notes that assisted in production of this report:
328+
329+
* Zander Arnao
330+
* Andrea Dean
331+
* Patrick Yurky

0 commit comments

Comments
 (0)