-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
Environment
- verl version: 0.5.0.dev
- training script: GRPO multi-GPU training with vLLM rollout
- hardware: 8× 5090 GPUs (each 32GB), 64 CPU cores
- cluster scheduler: SLURM
- OS: Ubuntu 22.04.5 LTS
- Python: 3.10
- CUDA version: 12.8
- ray==2.52.0
- grpcio==1.76.0 (I also tested 1.51.1)
What happened
When launching multi-GPU training with verl (single node, 8 GPUs), Ray repeatedly crashes before the first iteration. The key error:
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
2025-11-23 10:34:55,610 INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at �[1m�[32m127.0.0.1:8265 �[39m�[22m
/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
warnings.warn(
ray init kwargs: {'num_cpus': 32, 'runtime_env': {'env_vars': {'TOKENIZERS_PARALLELISM': 'true', 'NCCL_DEBUG': 'WARN', 'VLLM_LOGGING_LEVEL': 'WARN', 'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_CUMEM_ENABLE': '0'}, 'working_dir': None}}
�[33m(raylet)�[0m The node with node id: 7b510278eb9733fa1e95c9d513a9ce0ffd2a022568d89169efb76b93 and address: 10.254.32.19 and node name: 10.254.32.19 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a (1) raylet crashes unexpectedly (OOM, etc.)
(2) raylet has lagging heartbeats due to slow network or busy workload.
...
Traceback (most recent call last):
File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 42, in main
run_ppo(config)
File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 85, in run_ppo
ray.get(runner.run.remote(config))
File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2972, in get
values, debugger_breakpoint = worker.get_objects(
File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 1033, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: TaskRunner
actor_id: 36b5d3c0c2a6f745e1d45cee01000000
namespace: f88edbf6-78c5-4c3d-8a57-b4ec71aa6629
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[33m(raylet)�[0m [2025-11-23 10:34:57,911 E 3960861 3960930] (raylet) agent_manager.cc:87: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
�[33m(raylet)�[0m The raylet fate shares with the agent. This can happen because
�[33m(raylet)�[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
�[33m(raylet)�[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
�[33m(raylet)�[0m - The agent is killed by the OS (e.g., out of memory).
/var/spool/slurmd/job62881/slurm_script: line 62: 3962932 Segmentation fault (core dumped) watch -n 0.5 nvidia-smiThis is my bash:
#!/bin/bash
#SBATCH --gpus=8
ulimit -n 65535
module load miniforge/25.3.0-3
source activate verl
cd /data/home/scvi921/run/verl
python3 -m verl.trainer.main_ppo \
algorithm.adv_estimator=grpo \
data.train_files=/data/home/scvi921/run/verl/data/finetune/connect_train.parquet \
data.val_files=/data/home/scvi921/run/verl/data/finetune/connect_test.parquet \
data.train_batch_size=320 \
data.max_prompt_length=512 \
data.max_response_length=4096 \
data.filter_overlong_prompts=True \
data.truncation='error' \
actor_rollout_ref.model.path=/data/home/scvi921/run/verl/merge/global_step_50 \
actor_rollout_ref.model.lora_rank=16 \
actor_rollout_ref.model.lora_alpha=32 \
actor_rollout_ref.model.target_modules=all-linear \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.model.use_shm=False \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.model.use_remove_padding=True \
actor_rollout_ref.model.enable_gradient_checkpointing=True \
actor_rollout_ref.actor.ppo_mini_batch_size=160 \
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
actor_rollout_ref.actor.use_kl_loss=True \
actor_rollout_ref.actor.kl_loss_coef=0.001 \
actor_rollout_ref.actor.kl_loss_type=low_var_kl \
actor_rollout_ref.actor.entropy_coeff=0 \
actor_rollout_ref.actor.strategy=fsdp2 \
actor_rollout_ref.actor.fsdp_config.fsdp_size=2 \
actor_rollout_ref.actor.fsdp_config.param_offload=True \
actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
actor_rollout_ref.rollout.name=vllm \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
actor_rollout_ref.rollout.max_num_seqs=256 \
actor_rollout_ref.rollout.dtype=float16 \
actor_rollout_ref.rollout.n=4 \
actor_rollout_ref.rollout.max_model_len=4096 \
actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
actor_rollout_ref.rollout.load_format=safetensors \
actor_rollout_ref.rollout.layered_summon=True \
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
actor_rollout_ref.ref.fsdp_config.param_offload=True \
actor_rollout_ref.ref.strategy=fsdp2 \
algorithm.use_kl_in_reward=False \
trainer.critic_warmup=0 \
trainer.logger=['console','wandb'] \
trainer.project_name='connect-sft1-glm4' \
trainer.experiment_name='connect-sft1-GLM-4-9B-0414-8gpu' \
trainer.n_gpus_per_node=2 \
trainer.nnodes=1 \
trainer.save_freq=100 \
trainer.test_freq=3 \
trainer.log_val_generations=5 \
trainer.total_epochs=200 Expected behavior
I expect the multi-GPU training to start normally and run without crashing.
Ray actors should initialize correctly, and the training should proceed through the first iteration.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working