Skip to content

Conversation

antoinedandi
Copy link

Performance: Optimize episode index lookup in _check_cached_episodes_sufficient

Summary

Replaces slow set comprehension with HuggingFace Dataset's .unique() method for ~1000x speedup when checking cached episodes.

Problem

The original implementation used a set comprehension that iterated through every row of the dataset:

available_episodes = {
    ep_idx.item() if isinstance(ep_idx, torch.Tensor) else ep_idx
    for ep_idx in self.hf_dataset["episode_index"]
}

This approach was inefficient because:

self.hf_dataset["episode_index"] loads the entire column into memory row-by-row
Each iteration involves type checking and potential tensor conversions
No optimization for finding unique values

Solution

Use HuggingFace Dataset's built-in .unique() method:
pythonavailable_episodes = set(self.hf_dataset.unique("episode_index"))
Performance Impact

For large datasets (~80k episodes):
Before: 43.62 seconds
After: 0.05 seconds
Speedup: ~872x faster

Changes
Modified _check_cached_episodes_sufficient() method to use .unique() instead of set comprehension
Removed unnecessary type checking and tensor conversions
Maintained identical functionality and return values

Testing
Verified functionally equivalent behavior
Tested on datasets with varying sizes
Confirmed no regression in episode validation logic

@Copilot Copilot AI review requested due to automatic review settings October 10, 2025 13:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the episode cache verification process by replacing a slow set comprehension with HuggingFace Dataset's built-in .unique() method for significantly improved performance (~1000x speedup).

  • Replaces inefficient iteration through entire dataset column with optimized unique value extraction
  • Maintains identical functionality while dramatically reducing execution time
  • Removes unnecessary row-by-row processing for large datasets

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant