Skip to content

Conversation

@prathamk-tw
Copy link
Contributor

When using interleave_datasets with stopping_strategy="all_exhausted_without_replacement" and probabilities=None, the function was incorrectly falling into the undersampling branch, causing it to stop at min(lengths) instead of continuing until all datasets were exhausted.

This fix adds a specific branch to handle the all_exhausted_without_replacement case when probabilities=None. The new logic cycles through all datasets round by round, adding elements from each dataset until all are exhausted, ensuring each element appears exactly once.

Example fix:

  • Input: d1=[0,1,2], d2=[10,11,12,13], d3=[20,21,22]
  • Before: [0, 10, 20, 1, 11, 21, 2, 12, 22]
  • After: [0, 10, 20, 1, 11, 21, 2, 12, 22, 13]

🤖 Generated with Claude Code

When using interleave_datasets with stopping_strategy="all_exhausted_without_replacement"
and probabilities=None, the function was incorrectly falling into the undersampling branch,
causing it to stop at min(lengths) instead of continuing until all datasets were exhausted.

This fix adds a specific branch to handle the all_exhausted_without_replacement case when
probabilities=None. The new logic cycles through all datasets round by round, adding elements
from each dataset until all are exhausted, ensuring each element appears exactly once.

Example fix:
- Input: d1=[0,1,2], d2=[10,11,12,13], d3=[20,21,22]
- Before: [0, 10, 20, 1, 11, 21, 2, 12, 22]
- After: [0, 10, 20, 1, 11, 21, 2, 12, 22, 13]

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch thanks ! I took the liberty of switching to using numpy for consistency with the rest of this function, hoping to make it fast for very large scale datasets

@lhoestq lhoestq merged commit 1f43ef8 into huggingface:main Jan 23, 2026
8 of 14 checks passed
@prathamk-tw prathamk-tw deleted the fix-interleave-all-exhausted-without-replacement branch January 24, 2026 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants