Skip to content

Commit 3851d53

Browse files
add ENABLE_EXPERT_PARALLEL engine arg for MoE models (#239)
* enable expert parallel arg for moe models * add ENABLE_EXPERT_PARALLEL to hub config
1 parent c896438 commit 3851d53

File tree

3 files changed

+12
-0
lines changed

3 files changed

+12
-0
lines changed

.runpod/hub.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -929,6 +929,16 @@
929929
"advanced": true
930930
}
931931
},
932+
{
933+
"key": "ENABLE_EXPERT_PARALLEL",
934+
"input": {
935+
"name": "Enable Expert Parallel",
936+
"type": "boolean",
937+
"description": "Enable Expert Parallel for MoE models",
938+
"default": false,
939+
"advanced": true
940+
}
941+
},
932942
{
933943
"key": "MODEL_REVISION",
934944
"input": {

docs/configuration.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -85,6 +85,7 @@ Complete guide to all environment variables and configuration options for worker
8585
| `ENFORCE_EAGER` | False | `bool` | Always use eager-mode PyTorch. If False(`0`), will use eager mode and CUDA graph in hybrid for maximal performance and flexibility. |
8686
| `MAX_SEQ_LEN_TO_CAPTURE` | `8192` | `int` | Maximum context length covered by CUDA graphs. When a sequence has context length larger than this, we fall back to eager mode. |
8787
| `DISABLE_CUSTOM_ALL_REDUCE` | `0` | `int` | Enables or disables custom all reduce. |
88+
| `ENABLE_EXPERT_PARALLEL` | `False` | `bool` | Enable Expert Parallel for MoE models |
8889

8990
## Tokenizer Settings
9091

src/engine_args.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,7 @@
8080
"guided_decoding_backend": os.getenv('GUIDED_DECODING_BACKEND', 'outlines'),
8181
"speculative_model": os.getenv('SPECULATIVE_MODEL', None),
8282
"speculative_draft_tensor_parallel_size": int(os.getenv('SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE', 0)) or None,
83+
"enable_expert_parallel": bool(os.getenv('ENABLE_EXPERT_PARALLEL', 'False').lower() == 'true'),
8384
"num_speculative_tokens": int(os.getenv('NUM_SPECULATIVE_TOKENS', 0)) or None,
8485
"speculative_max_model_len": int(os.getenv('SPECULATIVE_MAX_MODEL_LEN', 0)) or None,
8586
"speculative_disable_by_batch_size": int(os.getenv('SPECULATIVE_DISABLE_BY_BATCH_SIZE', 0)) or None,

0 commit comments

Comments
 (0)