Skip to content

Conversation

H-Huang
Copy link
Member

@H-Huang H-Huang commented Mar 7, 2025

Allow the input batch to be split on the sequence dimension in pipeline parallelism (remove the requirement for batch_size >= num stages)

depends on pytorch/pytorch#148458

The new config to set this is pipeline_parallel_batch_split_dim = 1

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 7, 2025
of stages. Stages per rank are inferred from split points degree, and schedule.""",
)
self.parser.add_argument(
"--experimental.pipeline_parallel_batch_split_dim",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems PP has been in experimental for a while. Do you think it's time we extract pipeline_parallel into a standalone section and put all configs over there?
It doesn't have to happen in this PR.

f"of stages ({num_total_stages}) which may result in a bubble in the pipeline."
)

# validate that the batch size is divisible by the number of microbatches otherwise we'll hang or error during training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several questions here:

  1. If pipeline_parallel_batch_split_dim == 0, what would happen if if job_config.training.batch_size % num_total_stages != 0?
  2. If pipeline_parallel_batch_split_dim is on the sequence dim or other dims, don't we need similar checks in the extremal cases e.g. seq_len < num_stages
  3. Btw this divisibility requirement seems not exactly the same as "batch_size >= num stages" you mentioned in the PR summary.

@H-Huang H-Huang marked this pull request as draft March 25, 2025 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants