-
Notifications
You must be signed in to change notification settings - Fork 6
Description
i learned last week that at least the CC snapshots provide a unique WARC record id (presumably an UUID), which some consumers of the HPLT datasets would be valuable to have available, as a unique reference, e.g. when comparing the outputs from different data extraction pipelines (it remains to be seen whether e.g. FineWeb has preserved these identifiers, but in principle we should in the HPLT data).
another thing i learned about relatively recently is the text & data mining reservation protocol. this appears a relatively new invention, so presumably will hardly be present in crawls before 2024. but for newer data, it would be interesting to see what is recorded by e.g. the CC snapshots, or whether they possibly not even archive pages that declare a TDM opt-out this way? if not, i think warc2text should at least offer an option to filter WARC records on this basis.