Skip to content

Comments

Enable multinode training for HF speculative decoding#922

Open
yeyu-nvidia wants to merge 7 commits intomainfrom
yeyu/multi_node
Open

Enable multinode training for HF speculative decoding#922
yeyu-nvidia wants to merge 7 commits intomainfrom
yeyu/multi_node

Conversation

@yeyu-nvidia
Copy link
Contributor

@yeyu-nvidia yeyu-nvidia commented Feb 23, 2026

What does this PR do?

Type of change:
New example

Overview:
Modify launch_train.sh script to enable multi-node training.
Provide a slurm template script.

Usage

Add required fields in slurm.sh.
Then use the command below to submit multi-node job:

bash slurm.sh

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

  • New Features

    • Added multi-node training support with new configuration options for distributed setups.
    • Introduced Slurm batch script for streamlined job submission to cluster environments.
  • Improvements

    • Enhanced GPU resource management for multi-GPU and distributed training configurations.
    • Updated speculative decoding model support and validation handling.

Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
Signed-off-by: Ye Yu <[email protected]>
@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner February 23, 2026 23:09
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

The pull request removes Medusa-related CLI options from the training launch script, introduces multi-node distributed training support through new configuration options (NUM_NODES, HEAD_NODE_IP), and adds a new Slurm batch submission script. The launch script now exclusively supports Eagle3 mode and computes total GPU allocation across nodes for resource allocation calculations.

Changes

Cohort / File(s) Summary
Training Launch Script
examples/speculative_decoding/launch_train.sh
Removed Medusa mode support and CLI options; added multi-node configuration (NUM_NODES, HEAD_NODE_IP); replaced GPU_COUNT with TOTAL_GPU calculations across nodes; updated mode validation to support only Eagle3; added MULTI_NODE_ARGS for distributed execution via accelerate launch.
Slurm Job Submission
examples/speculative_decoding/slurm.sh
New Slurm batch script with SBATCH directives for job configuration (account, node count, GPU allocation, partition, time limit); sets environment variables for containerization and paths; constructs and executes training command via srun with container mounts and MPI configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: enabling multi-node training for HF speculative decoding, which aligns with the primary modifications to launch_train.sh and addition of slurm.sh.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch yeyu/multi_node

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@examples/speculative_decoding/launch_train.sh`:
- Around line 197-203: When NUM_NODES != 1, add explicit validation for required
multi-node environment variables (HEAD_NODE_IP and SLURM_PROCID) before building
MULTI_NODE_ARGS: verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and
numeric if desired), and if either is missing print a clear error message
referencing the missing variable(s) and exit non-zero so the script fails fast;
update the block that constructs MULTI_NODE_ARGS to run only after these checks
succeed (use the existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate
where to insert the checks).
- Around line 123-129: The script currently risks divide-by-zero and producing
DEFAULT_SAVE_STEPS=0 when nvidia-smi returns 0 or when TOTAL_GPU is large;
update the logic around NUM_NODES, GPU_PER_NODE, TOTAL_GPU and
DEFAULT_SAVE_STEPS so GPU_PER_NODE falls back to 1 if nvidia-smi fails or
reports 0, ensure TOTAL_GPU is computed as max(1, NUM_NODES * GPU_PER_NODE), and
then compute DEFAULT_SAVE_STEPS using that safe TOTAL_GPU and finally clamp
DEFAULT_SAVE_STEPS to a minimum of 1 so the trainer never receives 0 (refer to
the variables NUM_NODES, GPU_PER_NODE, TOTAL_GPU, DEFAULT_SAVE_STEPS).
- Around line 147-149: Validate CP_SIZE after it's set: ensure it's a positive
integer, <= TOTAL_GPU, and that TOTAL_GPU is divisible by CP_SIZE (so
DP_SHARD_SIZE isn't zero/truncated). Replace the current DP_SHARD_SIZE
assignment with either a strict check that exits with a clear error when
TOTAL_GPU % CP_SIZE != 0, or compute DP_SHARD_SIZE using integer ceiling (e.g.,
DP_SHARD_SIZE=$(((TOTAL_GPU + CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE
>= 1; check variables CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block
and produce a clear error message and exit on invalid values.

In `@examples/speculative_decoding/slurm.sh`:
- Around line 9-32: The CMD uses an undefined lowercased variable head_node_ip;
define a HEAD_NODE_IP variable (or derive it from SLURM, e.g., primary node IP)
before constructing CMD and then use that exact variable name in CMD (replace
--head_node_ip $head_node_ip with --head_node_ip $HEAD_NODE_IP) to ensure
consistent casing and a non-empty value during multi-node rendezvous; update any
other references to use HEAD_NODE_IP consistently (symbols: CMD, head_node_ip,
HEAD_NODE_IP, SLURM_NNODES).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 52e662d and e36d0df.

📒 Files selected for processing (2)
  • examples/speculative_decoding/launch_train.sh
  • examples/speculative_decoding/slurm.sh

Comment on lines +123 to 129
NUM_NODES=${NUM_NODES:-1}
GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
# Calculate save_steps
DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT))
DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard against zero GPUs and zero save_steps.

If nvidia-smi returns 0 GPUs (or fails), TOTAL_GPU becomes 0 and the script will divide by zero. Also, large TOTAL_GPU values can drive DEFAULT_SAVE_STEPS to 0, which is invalid for trainers.

🛡️ Suggested fix
 NUM_NODES=${NUM_NODES:-1}
 GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
 TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
+if (( TOTAL_GPU <= 0 )); then
+  echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly."
+  exit 1
+fi
 echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
 # Calculate save_steps
 DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))
+if (( DEFAULT_SAVE_STEPS < 1 )); then
+  DEFAULT_SAVE_STEPS=1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
NUM_NODES=${NUM_NODES:-1}
GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
# Calculate save_steps
DEFAULT_SAVE_STEPS=$((8192 / GPU_COUNT))
DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))
NUM_NODES=${NUM_NODES:-1}
GPU_PER_NODE=${GPU_PER_NODE:-$(nvidia-smi --query-gpu=name --format=csv,noheader | wc -l)}
TOTAL_GPU=$((NUM_NODES * GPU_PER_NODE))
if (( TOTAL_GPU <= 0 )); then
echo "No GPUs detected. Set GPU_PER_NODE/NUM_NODES explicitly."
exit 1
fi
echo "Total GPUs: $TOTAL_GPU (NUM_NODES: $NUM_NODES, GPU_PER_NODE: $GPU_PER_NODE)"
# Calculate save_steps
DEFAULT_SAVE_STEPS=$((8192 / TOTAL_GPU))
if (( DEFAULT_SAVE_STEPS < 1 )); then
DEFAULT_SAVE_STEPS=1
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/launch_train.sh` around lines 123 - 129, The
script currently risks divide-by-zero and producing DEFAULT_SAVE_STEPS=0 when
nvidia-smi returns 0 or when TOTAL_GPU is large; update the logic around
NUM_NODES, GPU_PER_NODE, TOTAL_GPU and DEFAULT_SAVE_STEPS so GPU_PER_NODE falls
back to 1 if nvidia-smi fails or reports 0, ensure TOTAL_GPU is computed as
max(1, NUM_NODES * GPU_PER_NODE), and then compute DEFAULT_SAVE_STEPS using that
safe TOTAL_GPU and finally clamp DEFAULT_SAVE_STEPS to a minimum of 1 so the
trainer never receives 0 (refer to the variables NUM_NODES, GPU_PER_NODE,
TOTAL_GPU, DEFAULT_SAVE_STEPS).

Comment on lines 147 to 149
CP_SIZE=${CP_SIZE:-1}
DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))}
DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
LOG_STEPS=${LOG_STEPS:-100}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Validate CP_SIZE so DP_SHARD_SIZE is never 0 or truncated.

DP_SHARD_SIZE is computed via integer division. If CP_SIZE doesn’t evenly divide TOTAL_GPU (or exceeds it), DP_SHARD_SIZE becomes incorrect or 0, which will break training.

✅ Suggested fix
 CP_SIZE=${CP_SIZE:-1}
-DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
+if (( TOTAL_GPU % CP_SIZE != 0 )); then
+  echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)."
+  exit 1
+fi
+DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
+if (( DP_SHARD_SIZE < 1 )); then
+  echo "DP_SHARD_SIZE must be >= 1."
+  exit 1
+fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CP_SIZE=${CP_SIZE:-1}
DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((GPU_COUNT/CP_SIZE))}
DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
LOG_STEPS=${LOG_STEPS:-100}
CP_SIZE=${CP_SIZE:-1}
if (( TOTAL_GPU % CP_SIZE != 0 )); then
echo "CP_SIZE ($CP_SIZE) must evenly divide TOTAL_GPU ($TOTAL_GPU)."
exit 1
fi
DP_SHARD_SIZE=${DP_SHARD_SIZE:-$((TOTAL_GPU/CP_SIZE))}
if (( DP_SHARD_SIZE < 1 )); then
echo "DP_SHARD_SIZE must be >= 1."
exit 1
fi
LOG_STEPS=${LOG_STEPS:-100}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/launch_train.sh` around lines 147 - 149,
Validate CP_SIZE after it's set: ensure it's a positive integer, <= TOTAL_GPU,
and that TOTAL_GPU is divisible by CP_SIZE (so DP_SHARD_SIZE isn't
zero/truncated). Replace the current DP_SHARD_SIZE assignment with either a
strict check that exits with a clear error when TOTAL_GPU % CP_SIZE != 0, or
compute DP_SHARD_SIZE using integer ceiling (e.g., DP_SHARD_SIZE=$(((TOTAL_GPU +
CP_SIZE - 1) / CP_SIZE))) and ensure DP_SHARD_SIZE >= 1; check variables
CP_SIZE, DP_SHARD_SIZE, and TOTAL_GPU in the same block and produce a clear
error message and exit on invalid values.

Comment on lines +197 to +203
if [[ "$NUM_NODES" != 1 ]]; then
MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
--num_machines $NUM_NODES \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
--main_process_ip $HEAD_NODE_IP \
--main_process_port 29500"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when required multi-node vars are missing.

When NUM_NODES != 1, the script relies on HEAD_NODE_IP and SLURM_PROCID. If either is empty, the accelerate rendezvous will fail with confusing errors.

🚦 Suggested fix
 if [[ "$NUM_NODES" != 1 ]]; then
+  if [[ -z "$HEAD_NODE_IP" ]]; then
+    echo "HEAD_NODE_IP is required when NUM_NODES > 1."
+    exit 1
+  fi
+  if [[ -z "$SLURM_PROCID" ]]; then
+    echo "SLURM_PROCID is required when NUM_NODES > 1."
+    exit 1
+  fi
   MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
                    --num_machines $NUM_NODES \
                    --machine_rank $SLURM_PROCID \
                    --rdzv_backend c10d \
                    --main_process_ip $HEAD_NODE_IP \
                    --main_process_port 29500"
 else
   MULTI_NODE_ARGS=""
 fi
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if [[ "$NUM_NODES" != 1 ]]; then
MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
--num_machines $NUM_NODES \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
--main_process_ip $HEAD_NODE_IP \
--main_process_port 29500"
if [[ "$NUM_NODES" != 1 ]]; then
if [[ -z "$HEAD_NODE_IP" ]]; then
echo "HEAD_NODE_IP is required when NUM_NODES > 1."
exit 1
fi
if [[ -z "$SLURM_PROCID" ]]; then
echo "SLURM_PROCID is required when NUM_NODES > 1."
exit 1
fi
MULTI_NODE_ARGS="--num_processes $TOTAL_GPU \
--num_machines $NUM_NODES \
--machine_rank $SLURM_PROCID \
--rdzv_backend c10d \
--main_process_ip $HEAD_NODE_IP \
--main_process_port 29500"
else
MULTI_NODE_ARGS=""
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/launch_train.sh` around lines 197 - 203, When
NUM_NODES != 1, add explicit validation for required multi-node environment
variables (HEAD_NODE_IP and SLURM_PROCID) before building MULTI_NODE_ARGS:
verify HEAD_NODE_IP is non-empty and SLURM_PROCID is set (and numeric if
desired), and if either is missing print a clear error message referencing the
missing variable(s) and exit non-zero so the script fails fast; update the block
that constructs MULTI_NODE_ARGS to run only after these checks succeed (use the
existing NUM_NODES, TOTAL_GPU, MULTI_NODE_ARGS names to locate where to insert
the checks).

Comment on lines +9 to +32
CONTAINER_IMAGE={container_image}
WORK_DIR={path_to_modelopt}

CONTAINER_MOUNT="${WORK_DIR}:/modelopt"

OUTPUT_DIR={path_to_output_dir}
MODEL={path_to_model_dir}
DATA={path_to_data_dir}
OFFLINE_DATA={path_to_offline_data_dir}

CMD="./launch_train.sh --model $MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs 1 \
--train_bs 1 \
--lr 1e-4 \
--eagle_config eagle_config.json \
--training_seq_len 4096 \
--save_steps 1000 \
--estimate_ar True \
--disable_tqdm True \
--offline-data $OFFLINE_DATA \
--num_nodes $SLURM_NNODES \
--head_node_ip $head_node_ip \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Define HEAD_NODE_IP (and use consistent casing) before passing it.

--head_node_ip $head_node_ip references an undefined variable, so multi-node runs will pass an empty IP and fail to rendezvous. Please define it (placeholder or derive from SLURM) and use a consistent variable name.

🔧 Suggested fix
 CONTAINER_IMAGE={container_image}
 WORK_DIR={path_to_modelopt}
+HEAD_NODE_IP={head_node_ip}

 CONTAINER_MOUNT="${WORK_DIR}:/modelopt"

 OUTPUT_DIR={path_to_output_dir}
 MODEL={path_to_model_dir}
 DATA={path_to_data_dir}
 OFFLINE_DATA={path_to_offline_data_dir}
+
+if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then
+  echo "HEAD_NODE_IP is required for multi-node runs."
+  exit 1
+fi

 CMD="./launch_train.sh --model $MODEL \
             --output_dir $OUTPUT_DIR \
             --data $DATA \
             --num_epochs 1 \
             --train_bs 1 \
             --lr 1e-4 \
             --eagle_config eagle_config.json \
             --training_seq_len 4096 \
             --save_steps 1000 \
             --estimate_ar True \
             --disable_tqdm True \
             --offline-data $OFFLINE_DATA \
             --num_nodes $SLURM_NNODES \
-            --head_node_ip $head_node_ip \
+            --head_node_ip $HEAD_NODE_IP \
 "
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
CONTAINER_IMAGE={container_image}
WORK_DIR={path_to_modelopt}
CONTAINER_MOUNT="${WORK_DIR}:/modelopt"
OUTPUT_DIR={path_to_output_dir}
MODEL={path_to_model_dir}
DATA={path_to_data_dir}
OFFLINE_DATA={path_to_offline_data_dir}
CMD="./launch_train.sh --model $MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs 1 \
--train_bs 1 \
--lr 1e-4 \
--eagle_config eagle_config.json \
--training_seq_len 4096 \
--save_steps 1000 \
--estimate_ar True \
--disable_tqdm True \
--offline-data $OFFLINE_DATA \
--num_nodes $SLURM_NNODES \
--head_node_ip $head_node_ip \
CONTAINER_IMAGE={container_image}
WORK_DIR={path_to_modelopt}
HEAD_NODE_IP={head_node_ip}
CONTAINER_MOUNT="${WORK_DIR}:/modelopt"
OUTPUT_DIR={path_to_output_dir}
MODEL={path_to_model_dir}
DATA={path_to_data_dir}
OFFLINE_DATA={path_to_offline_data_dir}
if [[ "${SLURM_NNODES:-1}" -gt 1 && -z "$HEAD_NODE_IP" ]]; then
echo "HEAD_NODE_IP is required for multi-node runs."
exit 1
fi
CMD="./launch_train.sh --model $MODEL \
--output_dir $OUTPUT_DIR \
--data $DATA \
--num_epochs 1 \
--train_bs 1 \
--lr 1e-4 \
--eagle_config eagle_config.json \
--training_seq_len 4096 \
--save_steps 1000 \
--estimate_ar True \
--disable_tqdm True \
--offline-data $OFFLINE_DATA \
--num_nodes $SLURM_NNODES \
--head_node_ip $HEAD_NODE_IP \
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 9-9: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 9-9: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 10-10: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 10-10: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 14-14: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 14-14: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 15-15: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 15-15: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 16-16: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 16-16: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 17-17: This { is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 17-17: This } is literal. Check expression (missing ;/\n?) or quote it.

(SC1083)


[warning] 32-32: head_node_ip is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/speculative_decoding/slurm.sh` around lines 9 - 32, The CMD uses an
undefined lowercased variable head_node_ip; define a HEAD_NODE_IP variable (or
derive it from SLURM, e.g., primary node IP) before constructing CMD and then
use that exact variable name in CMD (replace --head_node_ip $head_node_ip with
--head_node_ip $HEAD_NODE_IP) to ensure consistent casing and a non-empty value
during multi-node rendezvous; update any other references to use HEAD_NODE_IP
consistently (symbols: CMD, head_node_ip, HEAD_NODE_IP, SLURM_NNODES).

@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.11%. Comparing base (02fa362) to head (e36d0df).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #922      +/-   ##
==========================================
- Coverage   73.11%   73.11%   -0.01%     
==========================================
  Files         205      205              
  Lines       22281    22294      +13     
==========================================
+ Hits        16291    16300       +9     
- Misses       5990     5994       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant