Skip to content

fix: Add error handling for empty Parquet files while indexing and corresponding tests #601

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 21, 2025

Conversation

bhimrazy
Copy link
Collaborator

@bhimrazy bhimrazy commented May 21, 2025

What does this PR do ?

This PR adds error handling to the ParquetDir iterator to raise a clear exception when no Parquet files are found in the specified directory. Additionally, it includes tests to verify this behavior.

This change prevents unnecessary execution of the indexing process and avoids generating an empty index.json file when no data is available, and provides a clear error message to help the user identify and resolve the issue.

Resolves #600

@bhimrazy bhimrazy self-assigned this May 21, 2025
@bhimrazy bhimrazy changed the title fix: Add error handling for empty Parquet files and corresponding tests fix: Add error handling for empty Parquet files while indexing and corresponding tests May 21, 2025
@bhimrazy bhimrazy requested a review from Copilot May 21, 2025 08:40
@bhimrazy bhimrazy added enhancement New feature or request and removed bugfix labels May 21, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds explicit error handling for empty Parquet directories and corresponding tests to ensure a clear failure when no files are present.

  • Raise a RuntimeError in ParquetDir.__iter__ when self.files is empty.
  • Add test_no_parquet_files to verify that indexing fails with the new error for remote stores.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/litdata/utilities/parquet.py Check self.files and raise RuntimeError with guidance when no .parquet files exist
tests/streaming/test_parquet.py Add parameterized test_no_parquet_files to assert the new error in various backends
Comments suppressed due to low confidence (1)

tests/streaming/test_parquet.py:208

  • Consider adding a test case for a local directory with no .parquet files (e.g., using tmp_path / an empty temp folder) to ensure the error is raised for filesystem paths as well as remote URLs.
@pytest.mark.parametrize(

Copy link

codecov bot commented May 21, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (2402f45) to head (b38962c).
Report is 1 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #601   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        41     41           
  Lines      6136   6138    +2     
===================================
+ Hits       4836   4838    +2     
  Misses     1300   1300           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@tchaton tchaton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice !

@tchaton tchaton merged commit 7194311 into Lightning-AI:main May 21, 2025
32 checks passed
@bhimrazy bhimrazy deleted the fix/empty-parquet-dirs branch May 21, 2025 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Streaming HF Dataset - No chunks found issue
2 participants