Support dynamic batch size

### Feature request

Hi thanks for the library! When training, I realize that, if a micro batch contains too few tokens, the throughput will be quite bad (i.e. average time per token is large). However, I cannot increase the batch size, because there are long (e.g. 2000 tokens) and short (e.g. 500 tokens) sequences in the training data. The batch size that make short sequences run fast will make long sequences OOM.

Therefore, I am proposing to have dynamic (micro) batch size. For example, suppose we have batch_size=16. Then, before this proposal, we have e.g. micro_batch_size=2 & grad_accum=8. After this proposal, for short sequences, use 4 samples in this micro batch; for long sequences, use 2 samples in this micro batch. After they sum up to 16 samples, we can compute the loss and consider this step is done.

### Motivation

(see above)

### Your contribution

I am happy to PR

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support dynamic batch size #34406

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support dynamic batch size #34406

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions