At what batch size is it recommended to use LPLB? #6
-
|
As mentioned in README: "The solver takes ~100 µs for intra-node optimization (longer for inter-node), which may be non-negligible for small batches", I would like to ask what batch size is appropriate for using LPLB? And is it primarily designed for training at the moment? Can it be used in inference? Look forward to your response, thank you. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
|
We consider LPLB beneficial for both the training and inference prefilling stages. The optimal batch size threshold depends on several factors, including your model size, expert load balancing (which is influenced by data distribution), and infrastructure efficiency. We recommend first estimating the potential gains from improved load balance. It's worth trying this approach only if those gains significantly outweigh the ~100–200 µs overhead. Also, please note that the additional memory access introduced by redundant experts may reduce the actual performance improvement compared to theoretical expectations. |
Beta Was this translation helpful? Give feedback.
We consider LPLB beneficial for both the training and inference prefilling stages. The optimal batch size threshold depends on several factors, including your model size, expert load balancing (which is influenced by data distribution), and infrastructure efficiency.
We recommend first estimating the potential gains from improved load balance. It's worth trying this approach only if those gains significantly outweigh the ~100–200 µs overhead. Also, please note that the additional memory access introduced by redundant experts may reduce the actual performance improvement compared to theoretical expectations.