Skip to content

[ICML 2025] Official code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"

License

Notifications You must be signed in to change notification settings

UNITES-Lab/Occult

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Occult-MoE

Source code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"

Overview

We present Occult, an algorithm-system co-design solution for communication-efficient expert parallelism.

  • We merge the token replicas transmitted to the same GPU into a single one to reduce all-to-all communication volume, together with a refactored matrix multiplication kernel tailored for this communication strategy to diminish unnecessary memory footprint.
  • We reschedule the expert placement in expert parallelism using a profiling dataset, aiming at clustering the frequently co-activated experts to boost the efficient all-to-all communication.
  • Occult can be integrated to both training and inference for MoE-based LLMs to achieve wall-clock speedup under heavy workloads.

Experiments

We examine the expert-parallelized training with 8- and 16- way expert parallelism using Occult, along with the evaluations on downstream tasks to validate the effectiveness of collaboration pruning.

8-way expert parallelism (1 node)

Devices: 8 x NVIDIA A6000 Ada

Latency Analysis

Training Latency Analysis for DeepSeek-MoE with 8-way Expert Parallelism

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 8-way expert parallelism on a single node, compared with conventional expert parallelism MegaBlocks. The label "Occult (Pruning, $m$ GPUs)" denotes the $N_d$ value, $i.e.,$ pruning the expert collaboration for each token so that an individual token only activates experts within $m$ GPUs. We examine the training efficiency of MegaBlocks using both block sparse matrix multiplication and grouped GeMM.

Performance Analysis

DeepSeek-MoE Evaluation
Task Strategy No Tuning Pruning within 1 GPU Pruning within 2 GPUs Pruning within 3 GPUs Pruning within 4 GPUs Pruning within 5 GPUs No Pruning
MMLU Router-based 37.95 35.04 40.41 41.34 41.43 41.19 38.66
Similarity-based 33.68 39.80 41.74 41.40 41.48
OpenBookQA Router-based 32.20 33.8 36.2 37.2 37.8 37.2 34.20
Similarity-based 33.4 36.4 36.8 37.8 37.2
MathQA Router-based 31.19 32.93 35.08 34.97 35.95 36.08 33.77
Similarity-based 33.17 34.94 35.51 35.24 35.61
RACE Router-based 38.85 38.66 40.38 39.71 39.71 39.14 40.10
Similarity-based 37.8 38.85 39.23 39.71 39.9
SST-2 Router-based 64.68 58.72 64.22 68.12 72.36 70.76 78.33
Similarity-based 61.7 59.75 70.64 71.56 70.53

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 8-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

16-way expert parallelism (2 nodes)

Devices: 2 x 8 x NVIDIA A6000 Ada

Latency Analysis

Training Latency Analysis for DeepSeek-MoE with 8-way Expert Parallelism

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 16-way expert parallelism on 2 nodes, compared with conventional expert parallelism MegaBlocks. In this case, the 64 dynamically-routed experts are scattered on 16 GPUs, $i.e.,$ 4 experts on each GPU$, which is consistent with DeepSeek-V3 (256 dynamically-routed experts and 64-way expert parallelism).

Performance Analysis

DeepSeek-MoE Evaluation
Task Strategy No Tuning Pruning within 2 GPUs Pruning within 3 GPUs Pruning within 4 GPUs Pruning within 5 GPUs No Pruning
MMLU Router-based 37.95 39.69 40.37 41.23 41.62 38.66
Similarity-based 39.23 40.25 41.31 41.61
OpenBookQA Router-based 32.20 36.2 36.8 37.6 37.2 34.20
Similarity-based 36.2 36.4 37.8 38.6
MathQA Router-based 31.19 35.61 35.14 35.21 35.78 33.77
Similarity-based 34.84 35.21 35.68 35.71
RACE Router-based 38.85 38.66 39.04 39.9 38.95 40.10
Similarity-based 38.85 39.04 39.43 39.43
SST-2 Router-based 64.68 71.9 66.74 75 70.64 78.33
Similarity-based 56.77 75.23 73.97 70.64

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 16-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

About

[ICML 2025] Official code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages