Skip to content

Conversation

@KarelVesely84
Copy link
Contributor

  • by default batch size is computed an (num_cuts x longest_dur), however when cuts are concatenated the sholud be sum(utt_durs)
  • keeping the default behavior as it was before

@pzelasko
Copy link
Collaborator

pzelasko commented Oct 7, 2025

But in your implementation it's currently not sum(utt_durs), it's only measuring the length of the last cut. Please add unit tests to demonstrate the desired behavior of this option before we proceed.

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Oct 7, 2025

Actually the TimeConstraint.current is accumulating the sum(utt_durs) : sampling/base.py#L477
(I understood that on the 2nd reading of the code, the 'current' is related to the current batch)

Should I rename the current to current_batch or similar ?

I used that code, and waiting for the 200s long utterance would most likely cause GPU OOM when training.

Ok, unit test is needed...

@KarelVesely84 KarelVesely84 force-pushed the sampler_concatenate_cuts branch from b6feafc to 279d07e Compare October 8, 2025 10:26
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Oct 8, 2025

unit test created, also wrote docstrings in TimeConstraint for the new feature

right now, concatenate_cuts=True is supported by SimpleCutSampler,
but there are many other samplers... the question is if to modify these or not...

  • for Bucketing samplers it makes little sense to use ConcatenateCuts
  • for DynamicCutSampler the __init__() accepts constraint, so there is a way to create TimeConstraint externally with concatenate_cuts=True

@pzelasko how about other samplers, sholud I add it into some of them too ?

list of samplers, for reference:

    "BucketingSampler",
    "CutPairsSampler",
    "DynamicCutSampler",
    "DynamicBucketingSampler",
    "RoundRobinSampler",
    "SimpleCutSampler",
    "WeightedSimpleCutSampler",
    "StatelessSampler",
    "ZipSampler",

@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Oct 10, 2025

Hi, is it fine done in this way ?
(added unit test of TimeConstraint, showing the utterance durations are accumulated for the batch)

@pzelasko
Copy link
Collaborator

Hi Karel,
Sorry for such a long delay, I should be able to respond quicker now.
80% is quite arbitrary, can we remove 80% and tune max_duration instead?

- by default batch size is computed an (num_cuts x longest_dur),
  however when cuts are concatenated the sholud be sum(utt_durs)
- keeping the default behavior as it was before
@KarelVesely84 KarelVesely84 force-pushed the sampler_concatenate_cuts branch from 552744b to f23f0f1 Compare November 25, 2025 13:44
…e_to_exceeding()` method

- for `concatenate_cuts=True` the behavior of `exceeded()` and
  `close_to_exceeding()` methods becomes the same
- `close_to_exceeding()` seems to be used to decide if last batch of an
  epoch (incomplete batch) sholud be used for training or discarded.
@KarelVesely84
Copy link
Contributor Author

KarelVesely84 commented Nov 25, 2025

Hi Piotr,
okay, the 80% threshold in TimeConstraint::close_to_exceeding() is now replaced with max_duration.

Consequently, for concatenate_cuts=True the behavior of exceeded() and close_to_exceeding()
methods becomes the same.

From the code close_to_exceeding() seems to be used to decide if last batch of an epoch (i.e. incomplete batch) should
be used for training or discarded. And for ASR training it is okay to discard last batch.

K.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants