Enable multinode training for HF speculative decoding by yeyu-nvidia · Pull Request #922 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-02-23T23:09:31Z

What does this PR do?

Type of change:
New example

Overview:
Modify launch_train.sh script to enable multi-node training.
Provide a slurm template script.

Usage

Add required fields in slurm.sh.
Then use the command below to submit multi-node job:

bash slurm.sh

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

New Features
- Added multi-node training support with new configuration options for distributed setups.
- Introduced Slurm batch script for streamlined job submission to cluster environments.
Improvements
- Enhanced GPU resource management for multi-GPU and distributed training configurations.
- Updated speculative decoding model support and validation handling.

Signed-off-by: Ye Yu <[email protected]>

coderabbitai · 2026-02-23T23:09:48Z

📝 Walkthrough

Walkthrough

The pull request removes Medusa-related CLI options from the training launch script, introduces multi-node distributed training support through new configuration options (NUM_NODES, HEAD_NODE_IP), and adds a new Slurm batch submission script. The launch script now exclusively supports Eagle3 mode and computes total GPU allocation across nodes for resource allocation calculations.

Changes

Cohort / File(s)	Summary
Training Launch Script `examples/speculative_decoding/launch_train.sh`	Removed Medusa mode support and CLI options; added multi-node configuration (NUM_NODES, HEAD_NODE_IP); replaced GPU_COUNT with TOTAL_GPU calculations across nodes; updated mode validation to support only Eagle3; added MULTI_NODE_ARGS for distributed execution via accelerate launch.
Slurm Job Submission `examples/speculative_decoding/slurm.sh`	New Slurm batch script with SBATCH directives for job configuration (account, node count, GPU allocation, partition, time limit); sets environment variables for containerization and paths; constructs and executes training command via srun with container mounts and MPI configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely summarizes the main change: enabling multi-node training for HF speculative decoding, which aligns with the primary modifications to launch_train.sh and addition of slurm.sh.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch yeyu/multi_node

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/launch_train.sh`:
- Around line 197-203: When NUM_NODES != 1, add explicit validation for required
multi-node environment variables (HEAD_NODE_IP and SLURM_PROCID) before building
MULTI_NODE_ARGS: verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and
numeric if desired), and if either is missing print a clear error message
referencing the missing variable(s) and exit non-zero so the script fails fast;
update the block that constructs MULTI_NODE_ARGS to run only after these checks
succeed (use the existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate
where to insert the checks).
- Around line 123-129: The script currently risks divide-by-zero and producing
DEFAULT_SAVE_STEPS=0 when nvidia-smi returns 0 or when TOTAL_GPU is large;
update the logic around NUM_NODES, GPU_PER_NODE, TOTAL_GPU and
DEFAULT_SAVE_STEPS so GPU_PER_NODE falls back to 1 if nvidia-smi fails or
reports 0, ensure TOTAL_GPU is computed as max(1, NUM_NODES * GPU_PER_NODE), and
then compute DEFAULT_SAVE_STEPS using that safe TOTAL_GPU and finally clamp
DEFAULT_SAVE_STEPS to a minimum of 1 so the trainer never receives 0 (refer to
the variables NUM_NODES, GPU_PER_NODE, TOTAL_GPU, DEFAULT_SAVE_STEPS).
- Around line 147-149: Validate CP_SIZE after it's set: ensure it's a positive
integer, <= TOTAL_GPU, and that TOTAL_GPU is divisible by CP_SIZE (so
DP_SHARD_SIZE isn't zero/truncated). Replace the current DP_SHARD_SIZE
assignment with either a strict check that exits with a clear error when
TOTAL_GPU % CP_SIZE != 0, or compute DP_SHARD_SIZE using integer ceiling (e.g.,
DP_SHARD_SIZE=$(((TOTAL_GPU + CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE
>= 1; check variables CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block
and produce a clear error message and exit on invalid values.

In `@examples/speculative_decoding/slurm.sh`:
- Around line 9-32: The CMD uses an undefined lowercased variable head_node_ip;
define a HEAD_NODE_IP variable (or derive it from SLURM, e.g., primary node IP)
before constructing CMD and then use that exact variable name in CMD (replace
--head_node_ip $head_node_ip with --head_node_ip $HEAD_NODE_IP) to ensure
consistent casing and a non-empty value during multi-node rendezvous; update any
other references to use HEAD_NODE_IP consistently (symbols: CMD, head_node_ip,
HEAD_NODE_IP, SLURM_NNODES).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52e662d and e36d0df.

📒 Files selected for processing (2)

examples/speculative_decoding/launch_train.sh
examples/speculative_decoding/slurm.sh

coderabbitai · 2026-02-23T23:14:06Z

examples/speculative_decoding/launch_train.sh

+NUM_NODES=${NUM_NODES:-1}
+GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
+TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
+echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
 # Calculate save_steps
-DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT))
+DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))



⚠️ Potential issue | 🟠 Major

Guard against zero GPUs and zero save_steps.

If nvidia-smi returns 0 GPUs (or fails), TOTAL_GPU becomes 0 and the script will divide by zero. Also, large TOTAL_GPU values can drive DEFAULT_SAVE_STEPS to 0, which is invalid for trainers.

🛡️ Suggested fix

NUM_NODES=${NUM_NODES:-1} GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)} TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE)) +if (( TOTAL_GPU <= 0 )); then + echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly." + exit 1 +fi echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)" # Calculate save_steps DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU)) +if (( DEFAULT_SAVE_STEPS < 1 )); then + DEFAULT_SAVE_STEPS=1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

NUM_NODES=${NUM_NODES:-1}

GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}

TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))

echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"

# Calculate save_steps

DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT))

DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))

NUM_NODES=${NUM_NODES:-1}

GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}

TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))

if (( TOTAL_GPU <= 0 )); then

echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly."

exit 1

fi

echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"

# Calculate save_steps

DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))

if (( DEFAULT_SAVE_STEPS < 1 )); then

DEFAULT_SAVE_STEPS=1

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/launch_train.sh` around lines 123 - 129, The script currently risks divide-by-zero and producing DEFAULT_SAVE_STEPS=0 when nvidia-smi returns 0 or when TOTAL_GPU is large; update the logic around NUM_NODES, GPU_PER_NODE, TOTAL_GPU and DEFAULT_SAVE_STEPS so GPU_PER_NODE falls back to 1 if nvidia-smi fails or reports 0, ensure TOTAL_GPU is computed as max(1, NUM_NODES * GPU_PER_NODE), and then compute DEFAULT_SAVE_STEPS using that safe TOTAL_GPU and finally clamp DEFAULT_SAVE_STEPS to a minimum of 1 so the trainer never receives 0 (refer to the variables NUM_NODES, GPU_PER_NODE, TOTAL_GPU, DEFAULT_SAVE_STEPS).

coderabbitai · 2026-02-23T23:14:06Z

examples/speculative_decoding/launch_train.sh

 CP_SIZE=${CP_SIZE:-1}
-DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))}
+DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
 LOG_STEPS=${LOG_STEPS:-100}


⚠️ Potential issue | 🟠 Major

Validate CP_SIZE so DP_SHARD_SIZE is never 0 or truncated.

DP_SHARD_SIZE is computed via integer division. If CP_SIZE doesn’t evenly divide TOTAL_GPU (or exceeds it), DP_SHARD_SIZE becomes incorrect or 0, which will break training.

✅ Suggested fix

CP_SIZE=${CP_SIZE:-1} -DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))} +if (( TOTAL_GPU % CP_SIZE != 0 )); then + echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)." + exit 1 +fi +DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))} +if (( DP_SHARD_SIZE < 1 )); then + echo "DP_SHARD_SIZE must be >= 1." + exit 1 +fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CP_SIZE=${CP_SIZE:-1}

DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))}

DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}

LOG_STEPS=${LOG_STEPS:-100}

CP_SIZE=${CP_SIZE:-1}

if (( TOTAL_GPU % CP_SIZE != 0 )); then

echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)."

exit 1

fi

DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}

if (( DP_SHARD_SIZE < 1 )); then

echo "DP_SHARD_SIZE must be >= 1."

exit 1

fi

LOG_STEPS=${LOG_STEPS:-100}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/launch_train.sh` around lines 147 - 149, Validate CP_SIZE after it's set: ensure it's a positive integer, <= TOTAL_GPU, and that TOTAL_GPU is divisible by CP_SIZE (so DP_SHARD_SIZE isn't zero/truncated). Replace the current DP_SHARD_SIZE assignment with either a strict check that exits with a clear error when TOTAL_GPU % CP_SIZE != 0, or compute DP_SHARD_SIZE using integer ceiling (e.g., DP_SHARD_SIZE=$(((TOTAL_GPU + CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE >= 1; check variables CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block and produce a clear error message and exit on invalid values.

coderabbitai · 2026-02-23T23:14:06Z

examples/speculative_decoding/launch_train.sh

+if [[ "$NUM_NODES" != 1 ]]; then
+  MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
+                   --num_machines $NUM_NODES \
+                   --machine_rank $SLURM_PROCID \
+                   --rdzv_backend c10d \
+                   --main_process_ip $HEAD_NODE_IP \
+                   --main_process_port 29500"


⚠️ Potential issue | 🟠 Major

Fail fast when required multi-node vars are missing.

When NUM_NODES != 1, the script relies on HEAD_NODE_IP and SLURM_PROCID. If either is empty, the accelerate rendezvous will fail with confusing errors.

🚦 Suggested fix

if [[ "$NUM_NODES" != 1 ]]; then + if [[ -z "$HEAD_NODE_IP" ]]; then + echo "HEAD_NODE_IP is required when NUM_NODES > 1." + exit 1 + fi + if [[ -z "$SLURM_PROCID" ]]; then + echo "SLURM_PROCID is required when NUM_NODES > 1." + exit 1 + fi MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \ --num_machines $NUM_NODES \ --machine_rank $SLURM_PROCID \ --rdzv_backend c10d \ --main_process_ip $HEAD_NODE_IP \ --main_process_port 29500" else MULTI_NODE_ARGS="" fi

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if [[ "$NUM_NODES" != 1 ]]; then

MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \

--num_machines $NUM_NODES \

--machine_rank $SLURM_PROCID \

--rdzv_backend c10d \

--main_process_ip $HEAD_NODE_IP \

--main_process_port 29500"

if [[ "$NUM_NODES" != 1 ]]; then

if [[ -z "$HEAD_NODE_IP" ]]; then

echo "HEAD_NODE_IP is required when NUM_NODES > 1."

exit 1

fi

if [[ -z "$SLURM_PROCID" ]]; then

echo "SLURM_PROCID is required when NUM_NODES > 1."

exit 1

fi

MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \

--num_machines $NUM_NODES \

--machine_rank $SLURM_PROCID \

--rdzv_backend c10d \

--main_process_ip $HEAD_NODE_IP \

--main_process_port 29500"

else

MULTI_NODE_ARGS=""

fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/launch_train.sh` around lines 197 - 203, When NUM_NODES != 1, add explicit validation for required multi-node environment variables (HEAD_NODE_IP and SLURM_PROCID) before building MULTI_NODE_ARGS: verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and numeric if desired), and if either is missing print a clear error message referencing the missing variable(s) and exit non-zero so the script fails fast; update the block that constructs MULTI_NODE_ARGS to run only after these checks succeed (use the existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate where to insert the checks).

coderabbitai · 2026-02-23T23:14:06Z

examples/speculative_decoding/slurm.sh

+CONTAINER_IMAGE={container_image}
+WORK_DIR={path_to_modelopt}
+
+CONTAINER_MOUNT="${WORK_DIR}:/modelopt"
+
+OUTPUT_DIR={path_to_output_dir}
+MODEL={path_to_model_dir}
+DATA={path_to_data_dir}
+OFFLINE_DATA={path_to_offline_data_dir}
+
+CMD="./launch_train.sh --model $MODEL \
+            --output_dir $OUTPUT_DIR \
+            --data $DATA \
+            --num_epochs 1 \
+            --train_bs 1 \
+            --lr 1e-4 \
+            --eagle_config eagle_config.json \
+            --training_seq_len 4096 \
+            --save_steps 1000 \
+            --estimate_ar True \
+            --disable_tqdm True \
+            --offline-data $OFFLINE_DATA \
+            --num_nodes $SLURM_NNODES \
+            --head_node_ip $head_node_ip \


⚠️ Potential issue | 🟠 Major

Define HEAD_NODE_IP (and use consistent casing) before passing it.

--head_node_ip $head_node_ip references an undefined variable, so multi-node runs will pass an empty IP and fail to rendezvous. Please define it (placeholder or derive from SLURM) and use a consistent variable name.

🔧 Suggested fix

CONTAINER_IMAGE={container_image} WORK_DIR={path_to_modelopt} +HEAD_NODE_IP={head_node_ip} CONTAINER_MOUNT="${WORK_DIR}:/modelopt" OUTPUT_DIR={path_to_output_dir} MODEL={path_to_model_dir} DATA={path_to_data_dir} OFFLINE_DATA={path_to_offline_data_dir} + +if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then + echo "HEAD_NODE_IP is required for multi-node runs." + exit 1 +fi CMD="./launch_train.sh --model $MODEL \ --output_dir $OUTPUT_DIR \ --data $DATA \ --num_epochs 1 \ --train_bs 1 \ --lr 1e-4 \ --eagle_config eagle_config.json \ --training_seq_len 4096 \ --save_steps 1000 \ --estimate_ar True \ --disable_tqdm True \ --offline-data $OFFLINE_DATA \ --num_nodes $SLURM_NNODES \ - --head_node_ip $head_node_ip \ + --head_node_ip $HEAD_NODE_IP \ "

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

CONTAINER_IMAGE={container_image}

WORK_DIR={path_to_modelopt}

CONTAINER_MOUNT="${WORK_DIR}:/modelopt"

OUTPUT_DIR={path_to_output_dir}

MODEL={path_to_model_dir}

DATA={path_to_data_dir}

OFFLINE_DATA={path_to_offline_data_dir}

CMD="./launch_train.sh --model $MODEL \

--output_dir $OUTPUT_DIR \

--data $DATA \

--num_epochs 1 \

--train_bs 1 \

--lr 1e-4 \

--eagle_config eagle_config.json \

--training_seq_len 4096 \

--save_steps 1000 \

--estimate_ar True \

--disable_tqdm True \

--offline-data $OFFLINE_DATA \

--num_nodes $SLURM_NNODES \

--head_node_ip $head_node_ip \

CONTAINER_IMAGE={container_image}

WORK_DIR={path_to_modelopt}

HEAD_NODE_IP={head_node_ip}

CONTAINER_MOUNT="${WORK_DIR}:/modelopt"

OUTPUT_DIR={path_to_output_dir}

MODEL={path_to_model_dir}

DATA={path_to_data_dir}

OFFLINE_DATA={path_to_offline_data_dir}

if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then

echo "HEAD_NODE_IP is required for multi-node runs."

exit 1

fi

CMD="./launch_train.sh --model $MODEL \

--output_dir $OUTPUT_DIR \

--data $DATA \

--num_epochs 1 \

--train_bs 1 \

--lr 1e-4 \

--eagle_config eagle_config.json \

--training_seq_len 4096 \

--save_steps 1000 \

--estimate_ar True \

--disable_tqdm True \

--offline-data $OFFLINE_DATA \

--num_nodes $SLURM_NNODES \

--head_node_ip $HEAD_NODE_IP \

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 9-9: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 9-9: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 10-10: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 10-10: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 14-14: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 14-14: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 16-16: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 16-16: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 17-17: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 17-17: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)

[warning] 32-32: head_node_ip is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@examples/speculative_decoding/slurm.sh` around lines 9 - 32, The CMD uses an undefined lowercased variable head_node_ip; define a HEAD_NODE_IP variable (or derive it from SLURM, e.g., primary node IP) before constructing CMD and then use that exact variable name in CMD (replace --head_node_ip $head_node_ip with --head_node_ip $HEAD_NODE_IP) to ensure consistent casing and a non-empty value during multi-node rendezvous; update any other references to use HEAD_NODE_IP consistently (symbols: CMD, head_node_ip, HEAD_NODE_IP, SLURM_NNODES).

codecov · 2026-02-24T00:07:17Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.11%. Comparing base (02fa362) to head (e36d0df).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   73.11%   73.11%   -0.01%     
==========================================
  Files         205      205              
  Lines       22281    22294      +13     
==========================================
+ Hits        16291    16300       +9     
- Misses       5990     5994       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yeyu-nvidia added 7 commits February 23, 2026 11:36

update launch script for multi node training

7c07ca4

Signed-off-by: Ye Yu <[email protected]>

debug

297fc61

Signed-off-by: Ye Yu <[email protected]>

debug

223b7fe

Signed-off-by: Ye Yu <[email protected]>

add machine rank

8ffee25

Signed-off-by: Ye Yu <[email protected]>

debug

7fb3666

Signed-off-by: Ye Yu <[email protected]>

debug

db14b2c

Signed-off-by: Ye Yu <[email protected]>

add slurm template

e36d0df

Signed-off-by: Ye Yu <[email protected]>

yeyu-nvidia requested a review from a team as a code owner February 23, 2026 23:09

yeyu-nvidia requested a review from ChenhanYu February 23, 2026 23:09

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Enable multinode training for HF speculative decoding#922

Enable multinode training for HF speculative decoding#922
yeyu-nvidia wants to merge 7 commits intomainfrom
yeyu/multi_node

yeyu-nvidia commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 23, 2026

Uh oh!

coderabbitai bot Feb 23, 2026

Uh oh!

coderabbitai bot Feb 23, 2026

Uh oh!

coderabbitai bot Feb 23, 2026

Uh oh!

codecov bot commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

yeyu-nvidia commented Feb 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yeyu-nvidia commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading