Optimized episode cache verification #2166

antoinedandi · 2025-10-10T13:35:45Z

Performance: Optimize episode index lookup in `_check_cached_episodes_sufficient`

Summary

Replaces slow set comprehension with HuggingFace Dataset's .unique() method for ~1000x speedup when checking cached episodes.

Problem

The original implementation used a set comprehension that iterated through every row of the dataset:

available_episodes = {
    ep_idx.item() if isinstance(ep_idx, torch.Tensor) else ep_idx
    for ep_idx in self.hf_dataset["episode_index"]
}

This approach was inefficient because:

self.hf_dataset["episode_index"] loads the entire column into memory row-by-row
Each iteration involves type checking and potential tensor conversions
No optimization for finding unique values

Solution

Use HuggingFace Dataset's built-in .unique() method:
pythonavailable_episodes = set(self.hf_dataset.unique("episode_index"))
Performance Impact

For large datasets (~80k episodes):
Before: 43.62 seconds
After: 0.05 seconds
Speedup: ~872x faster

Changes
Modified _check_cached_episodes_sufficient() method to use .unique() instead of set comprehension
Removed unnecessary type checking and tensor conversions
Maintained identical functionality and return values

Testing
Verified functionally equivalent behavior
Tested on datasets with varying sizes
Confirmed no regression in episode validation logic

Signed-off-by: Antoine <[email protected]>

Copilot

Pull Request Overview

This PR optimizes the episode cache verification process by replacing a slow set comprehension with HuggingFace Dataset's built-in .unique() method for significantly improved performance (~1000x speedup).

Replaces inefficient iteration through entire dataset column with optimized unique value extraction
Maintains identical functionality while dramatically reducing execution time
Removes unnecessary row-by-row processing for large datasets

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

src/lerobot/datasets/lerobot_dataset.py

Optimized episode cache verification

0d398a4

Signed-off-by: Antoine <[email protected]>

Copilot AI review requested due to automatic review settings October 10, 2025 13:35

Copilot AI reviewed Oct 10, 2025

View reviewed changes

src/lerobot/datasets/lerobot_dataset.py Show resolved Hide resolved

Merge branch 'main' into patch-1

1cdb950

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimized episode cache verification #2166

Optimized episode cache verification #2166

antoinedandi commented Oct 10, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Optimized episode cache verification #2166

Are you sure you want to change the base?

Optimized episode cache verification #2166

Conversation

antoinedandi commented Oct 10, 2025

Performance: Optimize episode index lookup in _check_cached_episodes_sufficient

Summary

Problem

Solution

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Performance: Optimize episode index lookup in `_check_cached_episodes_sufficient`