RFC to introduce Datasets + Common Crawl as a Dataset #5248
desmondcheongzx
started this conversation in
Ideas
Replies: 2 comments
-
|
A concrete implementation of this proposal might look like #5244 |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
This dataset has been released :) https://docs.daft.ai/en/stable/datasets/common-crawl/ |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Motivation
daft.read_warc, but they must manually construct complex S3 glob patterns to target specific segments and file types. By making Common Crawl a first-class citizen in Daft, we can dramatically simplify this workflow and make the dataset more accessible to researchers and practitioners.Current user experience
read_warcfunction and does not simplify any of the user experience of how to obtain the WARC file to read.Goal
Provide a simple, ergonomic way to access Common Crawl from Daft. For example:
This should “just work” without requiring users to know S3 bucket structure or individual segment URLs.
Additional Common Crawl background
Proposal
We can introduce the concept of Datasets to Daft and create abstractions for things like Hugging Face and Common Crawl etc without users needing to call
read_warc,read_parquetetc.Common Crawl would be a fantastic first Dataset to test drive this idea. We can do something like:
daft.datasets.common_crawlLoad Common Crawl data as a DataFrame. This function automatically resolves the specified crawl and segment into the appropriate Common Crawl files and loads them as a DataFrame, handling the WARC reading process internally.
Arguments
Returns
A DataFrame containing the requested Common Crawl data.
Action items
common_crawl()to resolve crawl → dataframenum_files,segment,content,io_configsupportnum_fileslimit into manifest retrieval stage)common_crawl(...).show()(In particular, limit pushdowns into the WARC reader)shardssupportOpen questions:
list_crawlsorlist_segments?Beta Was this translation helpful? Give feedback.
All reactions