Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Batching to Megatron-LM server #90

Open
KlaudiaTH opened this issue Aug 4, 2023 · 0 comments
Open

Add Batching to Megatron-LM server #90

KlaudiaTH opened this issue Aug 4, 2023 · 0 comments
Assignees

Comments

@KlaudiaTH
Copy link
Collaborator

Each model has a --micro-batch-size that is set in the server startup script for each checkpoint.

Currently, the size of the batches that the client sends to the server must match the microbatch size because the server forwards them unchanged to the model without looking to see if it is the right number of samples for the model to handle.

If the server would do that correctly then the model would always get the number of samples it needs.

With the alibi it crashed when the number was not right and with the rotary, even if it shouldn't crash, i am not sure if it calculates correctly.

Here are some pointers on how the batching could be implemented:

It is about this variable prompts:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L45

It is either a list of strings or a list of lists of token IDs.

This is passed unchanged towards the model in the following try (both in if and else):
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L200

The list prompts is tokenized if it is a list of strings, and passed into a tensor:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation/tokenization.py#L88

The batch dimension of the tensor currently always has the same size as the list prompts.

One would now have to ensure at a suitable point that the list prompts is transformed into one or more lists of size args.micro_batch_size, and then these lists are passed to the model.

If prompts is longer than args.micro_batch_size, it must be split.
If it is shorter (or is not a multiple of it), then artificial batch elements (strings or lists) must be added as padding.

After the model call, the results must be recombined and any padding elements added must be removed.

Probably the try would be the best place to implement batching this way. I.e. a for loop around the if/else.

Splitting it up might work something like this:
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/utils.py#L68

But here the insertion of the padding elements is missing.

@KlaudiaTH KlaudiaTH self-assigned this Aug 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant