You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Each model has a --micro-batch-size that is set in the server startup script for each checkpoint.
Currently, the size of the batches that the client sends to the server must match the microbatch size because the server forwards them unchanged to the model without looking to see if it is the right number of samples for the model to handle.
If the server would do that correctly then the model would always get the number of samples it needs.
With the alibi it crashed when the number was not right and with the rotary, even if it shouldn't crash, i am not sure if it calculates correctly.
Here are some pointers on how the batching could be implemented:
The batch dimension of the tensor currently always has the same size as the list prompts.
One would now have to ensure at a suitable point that the list prompts is transformed into one or more lists of size args.micro_batch_size, and then these lists are passed to the model.
If prompts is longer than args.micro_batch_size, it must be split.
If it is shorter (or is not a multiple of it), then artificial batch elements (strings or lists) must be added as padding.
After the model call, the results must be recombined and any padding elements added must be removed.
Probably the try would be the best place to implement batching this way. I.e. a for loop around the if/else.
Each model has a
--micro-batch-size
that is set in the server startup script for each checkpoint.Currently, the size of the batches that the client sends to the server must match the microbatch size because the server forwards them unchanged to the model without looking to see if it is the right number of samples for the model to handle.
If the server would do that correctly then the model would always get the number of samples it needs.
With the alibi it crashed when the number was not right and with the rotary, even if it shouldn't crash, i am not sure if it calculates correctly.
Here are some pointers on how the batching could be implemented:
It is about this variable prompts:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L45
It is either a list of strings or a list of lists of token IDs.
This is passed unchanged towards the model in the following try (both in if and else):
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation_server.py#L200
The list prompts is tokenized if it is a list of strings, and passed into a tensor:
https://github.com/OpenGPTX/Megatron-LM/blob/megatron_lmeval_server/megatron/text_generation/tokenization.py#L88
The batch dimension of the tensor currently always has the same size as the list prompts.
One would now have to ensure at a suitable point that the list prompts is transformed into one or more lists of size args.micro_batch_size, and then these lists are passed to the model.
If prompts is longer than args.micro_batch_size, it must be split.
If it is shorter (or is not a multiple of it), then artificial batch elements (strings or lists) must be added as padding.
After the model call, the results must be recombined and any padding elements added must be removed.
Probably the try would be the best place to implement batching this way. I.e. a for loop around the if/else.
Splitting it up might work something like this:
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/utils.py#L68
But here the insertion of the padding elements is missing.
The text was updated successfully, but these errors were encountered: