Source code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"
We present Occult, an algorithm-system co-design solution for communication-efficient expert parallelism.
- We merge the token replicas transmitted to the same GPU into a single one to reduce all-to-all communication volume, together with a refactored matrix multiplication kernel tailored for this communication strategy to diminish unnecessary memory footprint.
- We reschedule the expert placement in expert parallelism using a profiling dataset, aiming at clustering the frequently co-activated experts to boost the efficient all-to-all communication.
- Occult can be integrated to both training and inference for MoE-based LLMs to achieve wall-clock speedup under heavy workloads.
We examine the expert-parallelized training with 8- and 16- way expert parallelism using Occult, along with the evaluations on downstream tasks to validate the effectiveness of collaboration pruning.
Devices: 8 x NVIDIA A6000 Ada
Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 8-way expert parallelism on a single node, compared with conventional expert parallelism MegaBlocks. The label "Occult (Pruning,
DeepSeek-MoE Evaluation | ||||||||
---|---|---|---|---|---|---|---|---|
Task | Strategy | No Tuning | Pruning within 1 GPU | Pruning within 2 GPUs | Pruning within 3 GPUs | Pruning within 4 GPUs | Pruning within 5 GPUs | No Pruning |
MMLU | Router-based | 37.95 | 35.04 | 40.41 | 41.34 | 41.43 | 41.19 | 38.66 |
Similarity-based | 33.68 | 39.80 | 41.74 | 41.40 | 41.48 | |||
OpenBookQA | Router-based | 32.20 | 33.8 | 36.2 | 37.2 | 37.8 | 37.2 | 34.20 |
Similarity-based | 33.4 | 36.4 | 36.8 | 37.8 | 37.2 | |||
MathQA | Router-based | 31.19 | 32.93 | 35.08 | 34.97 | 35.95 | 36.08 | 33.77 |
Similarity-based | 33.17 | 34.94 | 35.51 | 35.24 | 35.61 | |||
RACE | Router-based | 38.85 | 38.66 | 40.38 | 39.71 | 39.71 | 39.14 | 40.10 |
Similarity-based | 37.8 | 38.85 | 39.23 | 39.71 | 39.9 | |||
SST-2 | Router-based | 64.68 | 58.72 | 64.22 | 68.12 | 72.36 | 70.76 | 78.33 |
Similarity-based | 61.7 | 59.75 | 70.64 | 71.56 | 70.53 |
Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 8-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism,
Devices: 2 x 8 x NVIDIA A6000 Ada
Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 16-way expert parallelism on 2 nodes, compared with conventional expert parallelism MegaBlocks. In this case, the 64 dynamically-routed experts are scattered on 16 GPUs,
DeepSeek-MoE Evaluation | |||||||
---|---|---|---|---|---|---|---|
Task | Strategy | No Tuning | Pruning within 2 GPUs | Pruning within 3 GPUs | Pruning within 4 GPUs | Pruning within 5 GPUs | No Pruning |
MMLU | Router-based | 37.95 | 39.69 | 40.37 | 41.23 | 41.62 | 38.66 |
Similarity-based | 39.23 | 40.25 | 41.31 | 41.61 | |||
OpenBookQA | Router-based | 32.20 | 36.2 | 36.8 | 37.6 | 37.2 | 34.20 |
Similarity-based | 36.2 | 36.4 | 37.8 | 38.6 | |||
MathQA | Router-based | 31.19 | 35.61 | 35.14 | 35.21 | 35.78 | 33.77 |
Similarity-based | 34.84 | 35.21 | 35.68 | 35.71 | |||
RACE | Router-based | 38.85 | 38.66 | 39.04 | 39.9 | 38.95 | 40.10 |
Similarity-based | 38.85 | 39.04 | 39.43 | 39.43 | |||
SST-2 | Router-based | 64.68 | 71.9 | 66.74 | 75 | 70.64 | 78.33 |
Similarity-based | 56.77 | 75.23 | 73.97 | 70.64 |
Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 16-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism,