Skip to content

Conversation

bghira
Copy link
Owner

@bghira bghira commented Sep 1, 2025

No description provided.

@bghira bghira requested a review from Copilot September 1, 2025 19:43
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new TarDataLoader with comprehensive documentation and tests for streaming TAR file entries from webshart datasets.

Key changes:

  • New TarDataLoader and BatchDataLoader classes for streaming dataset entries
  • Streaming infrastructure for processing TAR files efficiently
  • Complete test suite for dataset streaming functionality

Reviewed Changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/dataloader.rs Implements TarDataLoader and BatchDataLoader for streaming TAR entries with buffering
src/streaming.rs Core streaming infrastructure for local and remote TAR files
tests/test_dataset_streaming.py Comprehensive test suite covering streaming functionality and error handling
src/lib.rs Adds new dataloader exports and dependencies
src/metadata.rs Adds files() method to ShardMetadata for accessing file information
python/webshart/init.py Exports TarDataLoader for Python users
README.md Documents TarDataLoader usage example
Cargo.toml Adds rayon and pythonize dependencies
src/extract.rs Removes unused variable
src/discovery.rs Adds get_hf_token() method

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +352 to +354
impl TarStreamer for RemoteTarStreamer {
fn stream_entries(&self) -> Result<Box<dyn Iterator<Item = Result<TarFileEntry>>>> {
self.stream_entries()
Copy link

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Infinite recursion: the trait implementation calls itself instead of the actual implementation method. This should call RemoteTarStreamer::stream_entries(&self) from line 254.

Suggested change
impl TarStreamer for RemoteTarStreamer {
fn stream_entries(&self) -> Result<Box<dyn Iterator<Item = Result<TarFileEntry>>>> {
self.stream_entries()
RemoteTarStreamer::stream_entries(self)

Copilot uses AI. Check for mistakes.

entry_errors += 1;

// If we're getting too many errors at the beginning, skip the shard
if entry_idx < 10 && entry_errors > 5 {
Copy link

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic numbers 10 and 5 should be defined as constants. Consider const EARLY_ERROR_THRESHOLD: usize = 10 and const MAX_EARLY_ERRORS: usize = 5.

Copilot uses AI. Check for mistakes.

@bghira bghira merged commit 16b7ec1 into main Sep 1, 2025
0 of 10 checks passed
@bghira bghira deleted the feature/streaming-iterator branch September 1, 2025 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant