Skip to content

Conversation

@gitttt-1234
Copy link
Collaborator

@gitttt-1234 gitttt-1234 commented Jan 21, 2026

Summary

This PR implements parallel image caching to significantly speed up dataset preparation for large datasets. Using ThreadPoolExecutor with thread-local video copies achieves 3-4x speedup for datasets with 50+ frames.

  • Add ParallelCacheFiller class for parallel I/O operations using thread-local video copies
  • Add parallel_caching and cache_workers configuration options to DataConfig
  • Update BaseDataset and all dataset subclasses to support parallel caching
  • Add comprehensive tests for parallel caching with both HDF5Video and MediaVideo backends

Benchmark Results

Dataset Size Sequential 4 Workers Speedup
50 frames 0.71s 0.19s 3.8x
100 frames 1.51s 0.36s 4.2x

Key Implementation Details

  • Uses thread-local video copies via deepcopy() for thread safety (pattern from providers.py)
  • Automatically falls back to sequential caching for small datasets (<20 frames) where overhead exceeds benefit
  • Lock-protected dictionary updates for memory caching
  • Progress callback support for UI integration

Configuration Options

DataConfig(
    parallel_caching=True,  # Enable parallel caching (default: True)
    cache_workers=4,        # Number of workers (default: 0 = auto)
)

Test plan

  • All existing tests pass
  • New tests cover parallel caching with HDF5Video backend
  • New tests cover parallel caching with MediaVideo backend
  • New tests verify configuration options work correctly
  • New tests verify thread safety with thread-local video copies
  • Linting passes (black + ruff)
  • Manually verify in windows / linux / mac

Closes #323

🤖 Generated with Claude Code

This PR implements parallel image caching to significantly speed up dataset
preparation for large datasets. Using ThreadPoolExecutor with thread-local
video copies achieves 3-4x speedup for datasets with 50+ frames.

Changes:
- Add ParallelCacheFiller class for parallel I/O operations
- Add parallel_caching and cache_workers config options to DataConfig
- Update BaseDataset and all subclasses to support parallel caching
- Add comprehensive tests for parallel caching with both HDF5 and MediaVideo

Key implementation details:
- Uses thread-local video copies via deepcopy() for thread safety
- Automatically falls back to sequential caching for small datasets (<20 frames)
- Lock-protected dictionary updates for memory caching
- Progress callback support for UI integration

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 80.00000% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.51%. Comparing base (ff91433) to head (6878a1a).
⚠️ Report is 125 commits behind head on main.

Files with missing lines Patch % Lines
sleap_nn/data/custom_datasets.py 79.62% 22 Missing ⚠️

❌ Your patch check has failed because the patch coverage (80.00%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage.

❗ There is a different number of reports uploaded between BASE (ff91433) and HEAD (6878a1a). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (ff91433) HEAD (6878a1a)
4 3
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #432       +/-   ##
===========================================
- Coverage   95.28%   84.51%   -10.77%     
===========================================
  Files          49       74       +25     
  Lines        6765    10875     +4110     
===========================================
+ Hits         6446     9191     +2745     
- Misses        319     1684     +1365     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parallelize data caching

2 participants