Multi-GPU training crashes with ActorDiedError because Ray dashboard agent fails

### System Info

## Environment

- verl version: 0.5.0.dev
- training script: GRPO multi-GPU training with vLLM rollout
- hardware: 8× 5090 GPUs (each 32GB), 64 CPU cores
- cluster scheduler: SLURM
- OS: Ubuntu 22.04.5 LTS
- Python: 3.10
- CUDA version: 12.8
- ray==2.52.0
- grpcio==1.76.0  (I also tested 1.51.1)

## What happened

When launching multi-GPU training with verl (single node, 8 GPUs), Ray repeatedly crashes before the first iteration. The key error:

The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

```cmd
2025-11-23 10:34:55,610	INFO worker.py:2014 -- Started a local Ray instance. View the dashboard at [1m[32m127.0.0.1:8265 [39m[22m
/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py:2062: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
  warnings.warn(
ray init kwargs: {'num_cpus': 32, 'runtime_env': {'env_vars': {'TOKENIZERS_PARALLELISM': 'true', 'NCCL_DEBUG': 'WARN', 'VLLM_LOGGING_LEVEL': 'WARN', 'VLLM_ALLOW_RUNTIME_LORA_UPDATING': 'true', 'CUDA_DEVICE_MAX_CONNECTIONS': '1', 'NCCL_CUMEM_ENABLE': '0'}, 'working_dir': None}}
[33m(raylet)[0m The node with node id: 7b510278eb9733fa1e95c9d513a9ce0ffd2a022568d89169efb76b93 and address: 10.254.32.19 and node name: 10.254.32.19 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.
...
Traceback (most recent call last):
  File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 42, in main
    run_ppo(config)
  File "/data/run01/scvi921/verl/verl/trainer/main_ppo.py", line 85, in run_ppo
    ray.get(runner.run.remote(config))
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
    return func(*args, **kwargs)
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 2972, in get
    values, debugger_breakpoint = worker.get_objects(
  File "/data/home/scvi921/.conda/envs/verl/lib/python3.10/site-packages/ray/_private/worker.py", line 1033, in get_objects
    raise value
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
	class_name: TaskRunner
	actor_id: 36b5d3c0c2a6f745e1d45cee01000000
	namespace: f88edbf6-78c5-4c3d-8a57-b4ec71aa6629
The actor is dead because its owner has died. Owner Id: 01000000ffffffffffffffffffffffffffffffffffffffffffffffff Owner Ip address: 10.254.32.19 Owner worker exit type: SYSTEM_ERROR Worker exit detail: Owner's node has crashed.
The actor never ran - it was cancelled before it started running.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
[33m(raylet)[0m [2025-11-23 10:34:57,911 E 3960861 3960930] (raylet) agent_manager.cc:87: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent.
[33m(raylet)[0m The raylet fate shares with the agent. This can happen because
[33m(raylet)[0m - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
[33m(raylet)[0m - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#logging-directory-structure.
[33m(raylet)[0m - The agent is killed by the OS (e.g., out of memory).
/var/spool/slurmd/job62881/slurm_script: line 62: 3962932 Segmentation fault      (core dumped) watch -n 0.5 nvidia-smi
```

This is my bash:
```cmd
#!/bin/bash
#SBATCH --gpus=8
ulimit -n 65535
module load miniforge/25.3.0-3
source activate verl

cd /data/home/scvi921/run/verl
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=/data/home/scvi921/run/verl/data/finetune/connect_train.parquet \
    data.val_files=/data/home/scvi921/run/verl/data/finetune/connect_test.parquet \
    data.train_batch_size=320 \
    data.max_prompt_length=512 \
    data.max_response_length=4096 \
    data.filter_overlong_prompts=True \
    data.truncation='error' \
    actor_rollout_ref.model.path=/data/home/scvi921/run/verl/merge/global_step_50 \
    actor_rollout_ref.model.lora_rank=16 \
    actor_rollout_ref.model.lora_alpha=32 \
    actor_rollout_ref.model.target_modules=all-linear \
    actor_rollout_ref.rollout.load_format=safetensors \
    actor_rollout_ref.model.use_shm=False \
    actor_rollout_ref.actor.optim.lr=1e-6 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=160 \
    actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=16 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.actor.entropy_coeff=0 \
    actor_rollout_ref.actor.strategy=fsdp2 \
    actor_rollout_ref.actor.fsdp_config.fsdp_size=2 \
    actor_rollout_ref.actor.fsdp_config.param_offload=True \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=True \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.max_num_seqs=256 \
    actor_rollout_ref.rollout.dtype=float16 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.rollout.max_model_len=4096 \
    actor_rollout_ref.rollout.max_num_batched_tokens=4096 \
    actor_rollout_ref.rollout.load_format=safetensors \
    actor_rollout_ref.rollout.layered_summon=True \
    actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=8 \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    actor_rollout_ref.ref.strategy=fsdp2 \
    algorithm.use_kl_in_reward=False \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='connect-sft1-glm4' \
    trainer.experiment_name='connect-sft1-GLM-4-9B-0414-8gpu' \
    trainer.n_gpus_per_node=2 \
    trainer.nnodes=1 \
    trainer.save_freq=100 \
    trainer.test_freq=3 \
    trainer.log_val_generations=5 \
    trainer.total_epochs=200 
```

### Expected behavior

I expect the multi-GPU training to start normally and run without crashing. 
Ray actors should initialize correctly, and the training should proceed through the first iteration.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-GPU training crashes with ActorDiedError because Ray dashboard agent fails #4242

System Info

Environment

What happened

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-GPU training crashes with ActorDiedError because Ray dashboard agent fails #4242

Description

System Info

Environment

What happened

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions