Add Batching to Megatron-LM server

Each model has a `--micro-batch-size` that is set in the server startup script for each checkpoint.

Currently, the size of the batches that the client sends to the server must match the microbatch size because the server forwards them unchanged to the model without looking to see if it is the right number of samples for the model to handle.

If the server would do that correctly then the model would always get the number of samples it needs.

With the alibi it crashed when the number was not right and with the rotary, even if it shouldn't crash, i am not sure if it calculates correctly.

Here are some pointers on how the batching could be implemented:

It is about this variable prompts:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L45

It is either a list of strings or a list of lists of token IDs.

This is passed unchanged towards the model in the following try (both in if and else):
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L200

The list prompts is tokenized if it is a list of strings, and passed into a tensor:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation/tokenization.py#L88

The batch dimension of the tensor currently always has the same size as the list prompts.

One would now have to ensure at a suitable point that the list prompts is transformed into one or more lists of size args.micro_batch_size, and then these lists are passed to the model.

If prompts is longer than args.micro_batch_size, it must be split.
If it is shorter (or is not a multiple of it), then artificial batch elements (strings or lists) must be added as padding.

After the model call, the results must be recombined and any padding elements added must be removed.

Probably the try would be the best place to implement batching this way. I.e. a for loop around the if/else.

Splitting it up might work something like this:
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/utils.py#L68

But here the insertion of the padding elements is missing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Batching to Megatron-LM server #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Batching to Megatron-LM server #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions