Skip to content

Multi-GPU training crashes with ActorDiedError because Ray dashboard agent fails #4242

@yogurtNie

Description

@yogurtNie

System Info

Environment

  • verl version: 0.5.0.dev
  • training script: GRPO multi-GPU training with vLLM rollout
  • hardware: 8× 5090 GPUs (each 32GB), 64 CPU cores
  • cluster scheduler: SLURM
  • OS: Ubuntu 22.04.5 LTS
  • Python: 3.10
  • CUDA version: 12.8
  • ray==2.52.0
  • grpcio==1.76.0 (I also tested 1.51.1)

What happened

When launching multi-GPU training with verl (single node, 8 GPUs), Ray repeatedly crashes before the first iteration. The key error:

The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

2025-11-23 10:34:55,610	INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at �[1m�[32m127.0.0.1:8265 �[39m�[22m
/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
ray init kwargs: {'num_cpus': 32, 'runtime_env': {'env_vars': {'TOKENIZERS_PARALLELISM': 'true', 'NCCL_DEBUG': 'WARN', 'VLLM_LOGGING_LEVEL': 'WARN', 'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_CUMEM_ENABLE': '0'}, 'working_dir': None}}
�[33m(raylet)�[0m The node with node id: 7b510278eb9733fa1e95c9d513a9ce0ffd2a022568d89169efb76b93 and address: 10.254.32.19 and node name: 10.254.32.19 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.
...
Traceback (most recent call last):
  File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 42, in main
    run_ppo(config)
  File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 85, in run_ppo
    ray.get(runner.run.remote(config))
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2972, in get
    values, debugger_breakpoint = worker.get_objects(
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 1033, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: TaskRunner
	actor_id: 36b5d3c0c2a6f745e1d45cee01000000
	namespace: f88edbf6-78c5-4c3d-8a57-b4ec71aa6629
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
�[33m(raylet)�[0m [2025-11-23 10:34:57,911 E 3960861 3960930] (raylet) agent_manager.cc:87: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
�[33m(raylet)�[0m The raylet fate shares with the agent. This can happen because
�[33m(raylet)�[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
�[33m(raylet)�[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
�[33m(raylet)�[0m - The agent is killed by the OS (e.g., out of memory).
/var/spool/slurmd/job62881/slurm_script: line 62: 3962932 Segmentation fault      (core dumped) watch -n 0.5 nvidia-smi

This is my bash:

#!/bin/bash
#SBATCH --gpus=8
ulimit -n 65535
module load miniforge/25.3.0-3
source activate verl

cd /data/home/scvi921/run/verl
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/data/home/scvi921/run/verl/data/finetune/connect_train.parquet \
    data.val_files=/data/home/scvi921/run/verl/data/finetune/connect_test.parquet \
    data.train_batch_size=320 \
    data.max_prompt_length=512 \
    data.max_response_length=4096 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=/data/home/scvi921/run/verl/merge/global_step_50 \
    actor_rollout_ref.model.lora_rank=16 \
    actor_rollout_ref.model.lora_alpha=32 \
    actor_rollout_ref.model.target_modules=all-linear \
    actor_rollout_ref.rollout.load_format=safetensors \
    actor_rollout_ref.model.use_shm=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=160 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.strategy=fsdp2 \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=2 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.max_num_seqs=256 \
    actor_rollout_ref.rollout.dtype=float16 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.rollout.max_model_len=4096 \
    actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
    actor_rollout_ref.rollout.load_format=safetensors \
    actor_rollout_ref.rollout.layered_summon=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.strategy=fsdp2 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='connect-sft1-glm4' \
    trainer.experiment_name='connect-sft1-GLM-4-9B-0414-8gpu' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=100 \
    trainer.test_freq=3 \
    trainer.log_val_generations=5 \
    trainer.total_epochs=200 

Expected behavior

I expect the multi-GPU training to start normally and run without crashing.
Ray actors should initialize correctly, and the training should proceed through the first iteration.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions