Occult-MoE

Source code for paper "Occult: Optimizing Collaborative Communication across Experts for Accelerated Parallel MoE Training and Inference"

Overview

We present Occult, an algorithm-system co-design solution for communication-efficient expert parallelism.

We merge the token replicas transmitted to the same GPU into a single one to reduce all-to-all communication volume, together with a refactored matrix multiplication kernel tailored for this communication strategy to diminish unnecessary memory footprint.
We reschedule the expert placement in expert parallelism using a profiling dataset, aiming at clustering the frequently co-activated experts to boost the efficient all-to-all communication.
Occult can be integrated to both training and inference for MoE-based LLMs to achieve wall-clock speedup under heavy workloads.

Experiments

We examine the expert-parallelized training with 8- and 16- way expert parallelism using Occult, along with the evaluations on downstream tasks to validate the effectiveness of collaboration pruning.

8-way expert parallelism (1 node)

Devices: 8 x NVIDIA A6000 Ada

Latency Analysis

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 8-way expert parallelism on a single node, compared with conventional expert parallelism MegaBlocks. The label "Occult (Pruning, $m$ GPUs)" denotes the $N_d$ value, $i.e.,$ pruning the expert collaboration for each token so that an individual token only activates experts within $m$ GPUs. We examine the training efficiency of MegaBlocks using both block sparse matrix multiplication and grouped GeMM.

Performance Analysis

DeepSeek-MoE Evaluation
Task	Strategy	No Tuning	Pruning within 1 GPU	Pruning within 2 GPUs	Pruning within 3 GPUs	Pruning within 4 GPUs	Pruning within 5 GPUs	No Pruning
MMLU	Router-based	37.95	35.04	40.41	41.34	41.43	41.19	38.66
MMLU	Similarity-based	37.95	33.68	39.80	41.74	41.40	41.48	38.66
OpenBookQA	Router-based	32.20	33.8	36.2	37.2	37.8	37.2	34.20
OpenBookQA	Similarity-based	32.20	33.4	36.4	36.8	37.8	37.2	34.20
MathQA	Router-based	31.19	32.93	35.08	34.97	35.95	36.08	33.77
MathQA	Similarity-based	31.19	33.17	34.94	35.51	35.24	35.61	33.77
RACE	Router-based	38.85	38.66	40.38	39.71	39.71	39.14	40.10
RACE	Similarity-based	38.85	37.8	38.85	39.23	39.71	39.9	40.10
SST-2	Router-based	64.68	58.72	64.22	68.12	72.36	70.76	78.33
SST-2	Similarity-based	64.68	61.7	59.75	70.64	71.56	70.53	78.33

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 8-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

16-way expert parallelism (2 nodes)

Devices: 2 x 8 x NVIDIA A6000 Ada

Latency Analysis

Caption: We demonstrate the training efficiency of Occult on DeepSeek-MoE (16B) with 16-way expert parallelism on 2 nodes, compared with conventional expert parallelism MegaBlocks. In this case, the 64 dynamically-routed experts are scattered on 16 GPUs, $i.e.,$ 4 experts on each GPU$, which is consistent with DeepSeek-V3 (256 dynamically-routed experts and 64-way expert parallelism).

Performance Analysis

DeepSeek-MoE Evaluation
Task	Strategy	No Tuning	Pruning within 2 GPUs	Pruning within 3 GPUs	Pruning within 4 GPUs	Pruning within 5 GPUs	No Pruning
MMLU	Router-based	37.95	39.69	40.37	41.23	41.62	38.66
MMLU	Similarity-based	37.95	39.23	40.25	41.31	41.61	38.66
OpenBookQA	Router-based	32.20	36.2	36.8	37.6	37.2	34.20
OpenBookQA	Similarity-based	32.20	36.2	36.4	37.8	38.6	34.20
MathQA	Router-based	31.19	35.61	35.14	35.21	35.78	33.77
MathQA	Similarity-based	31.19	34.84	35.21	35.68	35.71	33.77
RACE	Router-based	38.85	38.66	39.04	39.9	38.95	40.10
RACE	Similarity-based	38.85	38.85	39.04	39.43	39.43	40.10
SST-2	Router-based	64.68	71.9	66.74	75	70.64	78.33
SST-2	Similarity-based	64.68	56.77	75.23	73.97	70.64	78.33

Caption: We validate the effectiveness of the proposed collaboration pruning algorithms in Occult for 16-way expert parallelism with DeepSeek-MoE by evaluating on the popular benchmarks including MMLU, OpenBookQA, MathQA, RACE, and SST-2. The conclusions are similar to 4-way expert parallelism, $i.e.,$ pruning within 2 GPUs can obtain comparable performance than standard SFT with greatly improved training efficiency.

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
csrc		csrc
examples		examples
figures		figures
occult		occult
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Occult-MoE

Overview

Experiments

8-way expert parallelism (1 node)

Latency Analysis

Performance Analysis

16-way expert parallelism (2 nodes)

Latency Analysis

Performance Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

UNITES-Lab/Occult

Folders and files

Latest commit

History

Repository files navigation

Occult-MoE

Overview

Experiments

8-way expert parallelism (1 node)

Latency Analysis

Performance Analysis

16-way expert parallelism (2 nodes)

Latency Analysis

Performance Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages