Skip to content

Conversation

nicholas-maselli
Copy link

There is a bug in the src\lerobot\datasets\aggregate.py file where the latest_duration key is being updated incorrectly which is causing training for the newly aggregated datasets to fail.

This occurs when aggregating 3+ datasets but was also occuring when video data was exceeding the "DEFAULT_VIDEO_FILE_SIZE_IN_MB" size.

This PR fixes this by removing the unused episode duration and properly updating the latest_duration key data by setting it equal to the timestamps_shift_s instead of adding the ever increasing timestamps_shift_s variable.

We have also set the latest_duration to 0 when rotating to a new file / chunk as the episode metadata parquet file requires the timestamp for a new file to begin at 0.

The existing tests pass here

pytest tests/datasets/test_aggregate.py

A robust way to confirm this fix works is to get 3 datasets and combine them and test to ensure the standard training script trains properly.

A second test would be to get datasets such that the video data pushes past the DEFAULT_VIDEO_FILE_SIZE_IN_MB variable

@pkooij pkooij added the dataset Issues regarding data inputs, processing, or datasets label Oct 7, 2025
@michel-aractingi
Copy link
Collaborator

Hey @nicholas-maselli you're right there is a bug in aggregate and we have a fix similar to what you do in this PR #2100

…y starts at 0 but the frames after go back to starting at large numbers (rather then properly offset by the total episode duration
@nicholas-maselli
Copy link
Author

nicholas-maselli commented Oct 8, 2025

Hey @nicholas-maselli you're right there is a bug in aggregate and we have a fix similar to what you do in this PR #2100

Oh excellent! I actually just pushed an additional fix here that fixes another rotating episode bug.

Do you need help with any dataset tools? I would love to help if there are any timelines for the release. I have several extremely large datasets I can test all your tools on if you would like =)

@michel-aractingi
Copy link
Collaborator

That would be great @nicholas-maselli ! We're planning to release it tomorrow but still it would be great if you test it and report or push any fixes that you find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants