extract additional WARC metadata

i learned last week that at least the CC snapshots provide a unique WARC record id (presumably an UUID), which some consumers of the HPLT datasets would be valuable to have available, as a unique reference, e.g. when comparing the outputs from different data extraction pipelines (it remains to be seen whether e.g. FineWeb has preserved these identifiers, but in principle we should in the HPLT data).

another thing i learned about relatively recently is the [text & data mining reservation protocol](https://www.w3.org/community/reports/tdmrep/CG-FINAL-tdmrep-20240510/).  this appears a relatively new invention, so presumably will hardly be present in crawls before 2024.  but for newer data, it would be interesting to see what is recorded by e.g. the CC snapshots, or whether they possibly not even archive pages that declare a TDM opt-out this way?  if not, i think `warc2text` should at least offer an option to filter WARC records on this basis.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

extract additional WARC metadata #78

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

extract additional WARC metadata #78

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions