Enable multinode training for HF speculative decoding#922
Enable multinode training for HF speculative decoding#922yeyu-nvidia wants to merge 7 commits intomainfrom
Conversation
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
📝 WalkthroughWalkthroughThe pull request removes Medusa-related CLI options from the training launch script, introduces multi-node distributed training support through new configuration options (NUM_NODES, HEAD_NODE_IP), and adds a new Slurm batch submission script. The launch script now exclusively supports Eagle3 mode and computes total GPU allocation across nodes for resource allocation calculations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@examples/speculative_decoding/launch_train.sh`:
- Around line 197-203: When NUM_NODES != 1, add explicit validation for required
multi-node environment variables (HEAD_NODE_IP and SLURM_PROCID) before building
MULTI_NODE_ARGS: verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and
numeric if desired), and if either is missing print a clear error message
referencing the missing variable(s) and exit non-zero so the script fails fast;
update the block that constructs MULTI_NODE_ARGS to run only after these checks
succeed (use the existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate
where to insert the checks).
- Around line 123-129: The script currently risks divide-by-zero and producing
DEFAULT_SAVE_STEPS=0 when nvidia-smi returns 0 or when TOTAL_GPU is large;
update the logic around NUM_NODES, GPU_PER_NODE, TOTAL_GPU and
DEFAULT_SAVE_STEPS so GPU_PER_NODE falls back to 1 if nvidia-smi fails or
reports 0, ensure TOTAL_GPU is computed as max(1, NUM_NODES * GPU_PER_NODE), and
then compute DEFAULT_SAVE_STEPS using that safe TOTAL_GPU and finally clamp
DEFAULT_SAVE_STEPS to a minimum of 1 so the trainer never receives 0 (refer to
the variables NUM_NODES, GPU_PER_NODE, TOTAL_GPU, DEFAULT_SAVE_STEPS).
- Around line 147-149: Validate CP_SIZE after it's set: ensure it's a positive
integer, <= TOTAL_GPU, and that TOTAL_GPU is divisible by CP_SIZE (so
DP_SHARD_SIZE isn't zero/truncated). Replace the current DP_SHARD_SIZE
assignment with either a strict check that exits with a clear error when
TOTAL_GPU % CP_SIZE != 0, or compute DP_SHARD_SIZE using integer ceiling (e.g.,
DP_SHARD_SIZE=$(((TOTAL_GPU + CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE
>= 1; check variables CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block
and produce a clear error message and exit on invalid values.
In `@examples/speculative_decoding/slurm.sh`:
- Around line 9-32: The CMD uses an undefined lowercased variable head_node_ip;
define a HEAD_NODE_IP variable (or derive it from SLURM, e.g., primary node IP)
before constructing CMD and then use that exact variable name in CMD (replace
--head_node_ip $head_node_ip with --head_node_ip $HEAD_NODE_IP) to ensure
consistent casing and a non-empty value during multi-node rendezvous; update any
other references to use HEAD_NODE_IP consistently (symbols: CMD, head_node_ip,
HEAD_NODE_IP, SLURM_NNODES).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/speculative_decoding/launch_train.shexamples/speculative_decoding/slurm.sh
| NUM_NODES=${NUM_NODES:-1} | ||
| GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)} | ||
| TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE)) | ||
| echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)" | ||
| # Calculate save_steps | ||
| DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT)) | ||
| DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU)) | ||
|
|
There was a problem hiding this comment.
Guard against zero GPUs and zero save_steps.
If nvidia-smi returns 0 GPUs (or fails), TOTAL_GPU becomes 0 and the script will divide by zero. Also, large TOTAL_GPU values can drive DEFAULT_SAVE_STEPS to 0, which is invalid for trainers.
🛡️ Suggested fix
NUM_NODES=${NUM_NODES:-1}
GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
+if (( TOTAL_GPU <= 0 )); then
+ echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly."
+ exit 1
+fi
echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
# Calculate save_steps
DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))
+if (( DEFAULT_SAVE_STEPS < 1 )); then
+ DEFAULT_SAVE_STEPS=1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| NUM_NODES=${NUM_NODES:-1} | |
| GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)} | |
| TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE)) | |
| echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)" | |
| # Calculate save_steps | |
| DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT)) | |
| DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU)) | |
| NUM_NODES=${NUM_NODES:-1} | |
| GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)} | |
| TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE)) | |
| if (( TOTAL_GPU <= 0 )); then | |
| echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly." | |
| exit 1 | |
| fi | |
| echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)" | |
| # Calculate save_steps | |
| DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU)) | |
| if (( DEFAULT_SAVE_STEPS < 1 )); then | |
| DEFAULT_SAVE_STEPS=1 | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/launch_train.sh` around lines 123 - 129, The
script currently risks divide-by-zero and producing DEFAULT_SAVE_STEPS=0 when
nvidia-smi returns 0 or when TOTAL_GPU is large; update the logic around
NUM_NODES, GPU_PER_NODE, TOTAL_GPU and DEFAULT_SAVE_STEPS so GPU_PER_NODE falls
back to 1 if nvidia-smi fails or reports 0, ensure TOTAL_GPU is computed as
max(1, NUM_NODES * GPU_PER_NODE), and then compute DEFAULT_SAVE_STEPS using that
safe TOTAL_GPU and finally clamp DEFAULT_SAVE_STEPS to a minimum of 1 so the
trainer never receives 0 (refer to the variables NUM_NODES, GPU_PER_NODE,
TOTAL_GPU, DEFAULT_SAVE_STEPS).
| CP_SIZE=${CP_SIZE:-1} | ||
| DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))} | ||
| DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))} | ||
| LOG_STEPS=${LOG_STEPS:-100} |
There was a problem hiding this comment.
Validate CP_SIZE so DP_SHARD_SIZE is never 0 or truncated.
DP_SHARD_SIZE is computed via integer division. If CP_SIZE doesn’t evenly divide TOTAL_GPU (or exceeds it), DP_SHARD_SIZE becomes incorrect or 0, which will break training.
✅ Suggested fix
CP_SIZE=${CP_SIZE:-1}
-DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
+if (( TOTAL_GPU % CP_SIZE != 0 )); then
+ echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)."
+ exit 1
+fi
+DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
+if (( DP_SHARD_SIZE < 1 )); then
+ echo "DP_SHARD_SIZE must be >= 1."
+ exit 1
+fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CP_SIZE=${CP_SIZE:-1} | |
| DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))} | |
| DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))} | |
| LOG_STEPS=${LOG_STEPS:-100} | |
| CP_SIZE=${CP_SIZE:-1} | |
| if (( TOTAL_GPU % CP_SIZE != 0 )); then | |
| echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)." | |
| exit 1 | |
| fi | |
| DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))} | |
| if (( DP_SHARD_SIZE < 1 )); then | |
| echo "DP_SHARD_SIZE must be >= 1." | |
| exit 1 | |
| fi | |
| LOG_STEPS=${LOG_STEPS:-100} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/launch_train.sh` around lines 147 - 149,
Validate CP_SIZE after it's set: ensure it's a positive integer, <= TOTAL_GPU,
and that TOTAL_GPU is divisible by CP_SIZE (so DP_SHARD_SIZE isn't
zero/truncated). Replace the current DP_SHARD_SIZE assignment with either a
strict check that exits with a clear error when TOTAL_GPU % CP_SIZE != 0, or
compute DP_SHARD_SIZE using integer ceiling (e.g., DP_SHARD_SIZE=$(((TOTAL_GPU +
CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE >= 1; check variables
CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block and produce a clear
error message and exit on invalid values.
| if [[ "$NUM_NODES" != 1 ]]; then | ||
| MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \ | ||
| --num_machines $NUM_NODES \ | ||
| --machine_rank $SLURM_PROCID \ | ||
| --rdzv_backend c10d \ | ||
| --main_process_ip $HEAD_NODE_IP \ | ||
| --main_process_port 29500" |
There was a problem hiding this comment.
Fail fast when required multi-node vars are missing.
When NUM_NODES != 1, the script relies on HEAD_NODE_IP and SLURM_PROCID. If either is empty, the accelerate rendezvous will fail with confusing errors.
🚦 Suggested fix
if [[ "$NUM_NODES" != 1 ]]; then
+ if [[ -z "$HEAD_NODE_IP" ]]; then
+ echo "HEAD_NODE_IP is required when NUM_NODES > 1."
+ exit 1
+ fi
+ if [[ -z "$SLURM_PROCID" ]]; then
+ echo "SLURM_PROCID is required when NUM_NODES > 1."
+ exit 1
+ fi
MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
--num_machines $NUM_NODES \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
--main_process_ip $HEAD_NODE_IP \
--main_process_port 29500"
else
MULTI_NODE_ARGS=""
fi📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if [[ "$NUM_NODES" != 1 ]]; then | |
| MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \ | |
| --num_machines $NUM_NODES \ | |
| --machine_rank $SLURM_PROCID \ | |
| --rdzv_backend c10d \ | |
| --main_process_ip $HEAD_NODE_IP \ | |
| --main_process_port 29500" | |
| if [[ "$NUM_NODES" != 1 ]]; then | |
| if [[ -z "$HEAD_NODE_IP" ]]; then | |
| echo "HEAD_NODE_IP is required when NUM_NODES > 1." | |
| exit 1 | |
| fi | |
| if [[ -z "$SLURM_PROCID" ]]; then | |
| echo "SLURM_PROCID is required when NUM_NODES > 1." | |
| exit 1 | |
| fi | |
| MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \ | |
| --num_machines $NUM_NODES \ | |
| --machine_rank $SLURM_PROCID \ | |
| --rdzv_backend c10d \ | |
| --main_process_ip $HEAD_NODE_IP \ | |
| --main_process_port 29500" | |
| else | |
| MULTI_NODE_ARGS="" | |
| fi |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/launch_train.sh` around lines 197 - 203, When
NUM_NODES != 1, add explicit validation for required multi-node environment
variables (HEAD_NODE_IP and SLURM_PROCID) before building MULTI_NODE_ARGS:
verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and numeric if
desired), and if either is missing print a clear error message referencing the
missing variable(s) and exit non-zero so the script fails fast; update the block
that constructs MULTI_NODE_ARGS to run only after these checks succeed (use the
existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate where to insert
the checks).
| CONTAINER_IMAGE={container_image} | ||
| WORK_DIR={path_to_modelopt} | ||
|
|
||
| CONTAINER_MOUNT="${WORK_DIR}:/modelopt" | ||
|
|
||
| OUTPUT_DIR={path_to_output_dir} | ||
| MODEL={path_to_model_dir} | ||
| DATA={path_to_data_dir} | ||
| OFFLINE_DATA={path_to_offline_data_dir} | ||
|
|
||
| CMD="./launch_train.sh --model $MODEL \ | ||
| --output_dir $OUTPUT_DIR \ | ||
| --data $DATA \ | ||
| --num_epochs 1 \ | ||
| --train_bs 1 \ | ||
| --lr 1e-4 \ | ||
| --eagle_config eagle_config.json \ | ||
| --training_seq_len 4096 \ | ||
| --save_steps 1000 \ | ||
| --estimate_ar True \ | ||
| --disable_tqdm True \ | ||
| --offline-data $OFFLINE_DATA \ | ||
| --num_nodes $SLURM_NNODES \ | ||
| --head_node_ip $head_node_ip \ |
There was a problem hiding this comment.
Define HEAD_NODE_IP (and use consistent casing) before passing it.
--head_node_ip $head_node_ip references an undefined variable, so multi-node runs will pass an empty IP and fail to rendezvous. Please define it (placeholder or derive from SLURM) and use a consistent variable name.
🔧 Suggested fix
CONTAINER_IMAGE={container_image}
WORK_DIR={path_to_modelopt}
+HEAD_NODE_IP={head_node_ip}
CONTAINER_MOUNT="${WORK_DIR}:/modelopt"
OUTPUT_DIR={path_to_output_dir}
MODEL={path_to_model_dir}
DATA={path_to_data_dir}
OFFLINE_DATA={path_to_offline_data_dir}
+
+if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then
+ echo "HEAD_NODE_IP is required for multi-node runs."
+ exit 1
+fi
CMD="./launch_train.sh --model $MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs 1 \
--train_bs 1 \
--lr 1e-4 \
--eagle_config eagle_config.json \
--training_seq_len 4096 \
--save_steps 1000 \
--estimate_ar True \
--disable_tqdm True \
--offline-data $OFFLINE_DATA \
--num_nodes $SLURM_NNODES \
- --head_node_ip $head_node_ip \
+ --head_node_ip $HEAD_NODE_IP \
"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| CONTAINER_IMAGE={container_image} | |
| WORK_DIR={path_to_modelopt} | |
| CONTAINER_MOUNT="${WORK_DIR}:/modelopt" | |
| OUTPUT_DIR={path_to_output_dir} | |
| MODEL={path_to_model_dir} | |
| DATA={path_to_data_dir} | |
| OFFLINE_DATA={path_to_offline_data_dir} | |
| CMD="./launch_train.sh --model $MODEL \ | |
| --output_dir $OUTPUT_DIR \ | |
| --data $DATA \ | |
| --num_epochs 1 \ | |
| --train_bs 1 \ | |
| --lr 1e-4 \ | |
| --eagle_config eagle_config.json \ | |
| --training_seq_len 4096 \ | |
| --save_steps 1000 \ | |
| --estimate_ar True \ | |
| --disable_tqdm True \ | |
| --offline-data $OFFLINE_DATA \ | |
| --num_nodes $SLURM_NNODES \ | |
| --head_node_ip $head_node_ip \ | |
| CONTAINER_IMAGE={container_image} | |
| WORK_DIR={path_to_modelopt} | |
| HEAD_NODE_IP={head_node_ip} | |
| CONTAINER_MOUNT="${WORK_DIR}:/modelopt" | |
| OUTPUT_DIR={path_to_output_dir} | |
| MODEL={path_to_model_dir} | |
| DATA={path_to_data_dir} | |
| OFFLINE_DATA={path_to_offline_data_dir} | |
| if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then | |
| echo "HEAD_NODE_IP is required for multi-node runs." | |
| exit 1 | |
| fi | |
| CMD="./launch_train.sh --model $MODEL \ | |
| --output_dir $OUTPUT_DIR \ | |
| --data $DATA \ | |
| --num_epochs 1 \ | |
| --train_bs 1 \ | |
| --lr 1e-4 \ | |
| --eagle_config eagle_config.json \ | |
| --training_seq_len 4096 \ | |
| --save_steps 1000 \ | |
| --estimate_ar True \ | |
| --disable_tqdm True \ | |
| --offline-data $OFFLINE_DATA \ | |
| --num_nodes $SLURM_NNODES \ | |
| --head_node_ip $HEAD_NODE_IP \ |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 9-9: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 9-9: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 10-10: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 10-10: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 14-14: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 14-14: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 16-16: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 16-16: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 17-17: This { is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 17-17: This } is literal. Check expression (missing ;/\n?) or quote it.
(SC1083)
[warning] 32-32: head_node_ip is referenced but not assigned.
(SC2154)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@examples/speculative_decoding/slurm.sh` around lines 9 - 32, The CMD uses an
undefined lowercased variable head_node_ip; define a HEAD_NODE_IP variable (or
derive it from SLURM, e.g., primary node IP) before constructing CMD and then
use that exact variable name in CMD (replace --head_node_ip $head_node_ip with
--head_node_ip $HEAD_NODE_IP) to ensure consistent casing and a non-empty value
during multi-node rendezvous; update any other references to use HEAD_NODE_IP
consistently (symbols: CMD, head_node_ip, HEAD_NODE_IP, SLURM_NNODES).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #922 +/- ##
==========================================
- Coverage 73.11% 73.11% -0.01%
==========================================
Files 205 205
Lines 22281 22294 +13
==========================================
+ Hits 16291 16300 +9
- Misses 5990 5994 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change:
New example
Overview:
Modify launch_train.sh script to enable multi-node training.
Provide a slurm template script.
Usage
Add required fields in slurm.sh.
Then use the command below to submit multi-node job:
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
Improvements