extend SimpleCutSampler to work better with CutConcatenate #1520

KarelVesely84 · 2025-09-29T14:53:45Z

by default batch size is computed an (num_cuts x longest_dur), however when cuts are concatenated the sholud be sum(utt_durs)
keeping the default behavior as it was before

pzelasko · 2025-10-07T12:34:00Z

But in your implementation it's currently not sum(utt_durs), it's only measuring the length of the last cut. Please add unit tests to demonstrate the desired behavior of this option before we proceed.

KarelVesely84 · 2025-10-07T13:56:18Z

Actually the TimeConstraint.current is accumulating the sum(utt_durs) : sampling/base.py#L477
(I understood that on the 2nd reading of the code, the 'current' is related to the current batch)

Should I rename the current to current_batch or similar ?

I used that code, and waiting for the 200s long utterance would most likely cause GPU OOM when training.

Ok, unit test is needed...

KarelVesely84 · 2025-10-08T10:36:25Z

unit test created, also wrote docstrings in TimeConstraint for the new feature

right now, concatenate_cuts=True is supported by SimpleCutSampler,
but there are many other samplers... the question is if to modify these or not...

for Bucketing samplers it makes little sense to use ConcatenateCuts
for DynamicCutSampler the __init__() accepts constraint, so there is a way to create TimeConstraint externally with concatenate_cuts=True

@pzelasko how about other samplers, sholud I add it into some of them too ?

list of samplers, for reference:

    "BucketingSampler",
    "CutPairsSampler",
    "DynamicCutSampler",
    "DynamicBucketingSampler",
    "RoundRobinSampler",
    "SimpleCutSampler",
    "WeightedSimpleCutSampler",
    "StatelessSampler",
    "ZipSampler",

KarelVesely84 · 2025-10-10T09:46:35Z

Hi, is it fine done in this way ?
(added unit test of TimeConstraint, showing the utterance durations are accumulated for the batch)

pzelasko · 2025-11-21T19:19:54Z

Hi Karel,
Sorry for such a long delay, I should be able to respond quicker now.
80% is quite arbitrary, can we remove 80% and tune max_duration instead?

- by default batch size is computed an (num_cuts x longest_dur), however when cuts are concatenated the sholud be sum(utt_durs) - keeping the default behavior as it was before

…e_to_exceeding()` method - for `concatenate_cuts=True` the behavior of `exceeded()` and `close_to_exceeding()` methods becomes the same - `close_to_exceeding()` seems to be used to decide if last batch of an epoch (incomplete batch) sholud be used for training or discarded.

KarelVesely84 · 2025-11-25T14:19:10Z

Hi Piotr,
okay, the 80% threshold in TimeConstraint::close_to_exceeding() is now replaced with max_duration.

Consequently, for concatenate_cuts=True the behavior of exceeded() and close_to_exceeding()
methods becomes the same.

From the code close_to_exceeding() seems to be used to decide if last batch of an epoch (i.e. incomplete batch) should
be used for training or discarded. And for ASR training it is okay to discard last batch.

K.

KarelVesely84 force-pushed the sampler_concatenate_cuts branch from b6feafc to 279d07e Compare October 8, 2025 10:26

KarelVesely84 added 2 commits November 25, 2025 14:44

extend SimpleCutSampler to work better with CutConcatenate

1f39d4b

- by default batch size is computed an (num_cuts x longest_dur), however when cuts are concatenated the sholud be sum(utt_durs) - keeping the default behavior as it was before

unittest of TimeConstraint with concatenate_cuts=True

f23f0f1

KarelVesely84 force-pushed the sampler_concatenate_cuts branch from 552744b to f23f0f1 Compare November 25, 2025 13:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

extend SimpleCutSampler to work better with CutConcatenate #1520

extend SimpleCutSampler to work better with CutConcatenate #1520

Uh oh!

KarelVesely84 commented Sep 29, 2025

Uh oh!

pzelasko commented Oct 7, 2025

Uh oh!

KarelVesely84 commented Oct 7, 2025 •

edited

Loading

Uh oh!

KarelVesely84 commented Oct 8, 2025 •

edited

Loading

Uh oh!

KarelVesely84 commented Oct 10, 2025 •

edited

Loading

Uh oh!

pzelasko commented Nov 21, 2025

Uh oh!

KarelVesely84 commented Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

extend SimpleCutSampler to work better with CutConcatenate #1520

Are you sure you want to change the base?

extend SimpleCutSampler to work better with CutConcatenate #1520

Uh oh!

Conversation

KarelVesely84 commented Sep 29, 2025

Uh oh!

pzelasko commented Oct 7, 2025

Uh oh!

KarelVesely84 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KarelVesely84 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KarelVesely84 commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pzelasko commented Nov 21, 2025

Uh oh!

KarelVesely84 commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KarelVesely84 commented Oct 7, 2025 •

edited

Loading

KarelVesely84 commented Oct 8, 2025 •

edited

Loading

KarelVesely84 commented Oct 10, 2025 •

edited

Loading

KarelVesely84 commented Nov 25, 2025 •

edited

Loading