-
Notifications
You must be signed in to change notification settings - Fork 0
add dataloader with docs #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new TarDataLoader with comprehensive documentation and tests for streaming TAR file entries from webshart datasets.
Key changes:
- New
TarDataLoader
andBatchDataLoader
classes for streaming dataset entries - Streaming infrastructure for processing TAR files efficiently
- Complete test suite for dataset streaming functionality
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/dataloader.rs | Implements TarDataLoader and BatchDataLoader for streaming TAR entries with buffering |
src/streaming.rs | Core streaming infrastructure for local and remote TAR files |
tests/test_dataset_streaming.py | Comprehensive test suite covering streaming functionality and error handling |
src/lib.rs | Adds new dataloader exports and dependencies |
src/metadata.rs | Adds files() method to ShardMetadata for accessing file information |
python/webshart/init.py | Exports TarDataLoader for Python users |
README.md | Documents TarDataLoader usage example |
Cargo.toml | Adds rayon and pythonize dependencies |
src/extract.rs | Removes unused variable |
src/discovery.rs | Adds get_hf_token() method |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
impl TarStreamer for RemoteTarStreamer { | ||
fn stream_entries(&self) -> Result<Box<dyn Iterator<Item = Result<TarFileEntry>>>> { | ||
self.stream_entries() |
Copilot
AI
Sep 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Infinite recursion: the trait implementation calls itself instead of the actual implementation method. This should call RemoteTarStreamer::stream_entries(&self)
from line 254.
impl TarStreamer for RemoteTarStreamer { | |
fn stream_entries(&self) -> Result<Box<dyn Iterator<Item = Result<TarFileEntry>>>> { | |
self.stream_entries() | |
RemoteTarStreamer::stream_entries(self) |
Copilot uses AI. Check for mistakes.
entry_errors += 1; | ||
|
||
// If we're getting too many errors at the beginning, skip the shard | ||
if entry_idx < 10 && entry_errors > 5 { |
Copilot
AI
Sep 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic numbers 10 and 5 should be defined as constants. Consider const EARLY_ERROR_THRESHOLD: usize = 10
and const MAX_EARLY_ERRORS: usize = 5
.
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
No description provided.