Skip to content

online_dpo:RuntimeError: no running event loop #4220

@direction-yxf

Description

@direction-yxf

System Info

bash recipe/spin/run_spin.sh

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

CUDA_VISIBLE_DEVICES=${VISIBLE_DEVICES} python3 -m recipe.spin.main_spin \
  data.train_files=$HOME/data/gsm8k_pre/train.parquet \
  data.val_files=$HOME/data/gsm8k_pre/test.parquet \
  data.train_batch_size=1024 \
  data.max_prompt_length=1024 \
  data.max_response_length=1024 \
  actor_rollout_ref.model.path=$HOME/model/Qwen3-0.6B \
  actor_rollout_ref.actor.optim.lr=1e-6 \
  actor_rollout_ref.actor.ppo_mini_batch_size=64 \
  actor_rollout_ref.actor.ppo_micro_batch_size=8 \
  actor_rollout_ref.rollout.log_prob_micro_batch_size=64 \
  actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
  actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
  actor_rollout_ref.ref.log_prob_micro_batch_size=64 \
  algorithm.kl_ctrl.kl_coef=0.001 \
  trainer.logger=console \
  trainer.val_before_train=True \
  trainer.n_gpus_per_node=1 \
  trainer.nnodes=1 \
  trainer.save_freq=-1 \
  trainer.test_freq=1 \
  +trainer.log_freq=1 \
  trainer.ref_update_freq=1 \
  trainer.total_epochs=1000 2>&1 | tee verl_demo.log

Expected behavior

`ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_init_model() (pid=136203, ip=10.237.176.198, actor_id=c06a5e7d66d8be504804547701000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f33ab701930>)
  File "/workspace/_model/code/verl/verl/single_controller/ray/base.py", line 700, in func
    return getattr(self.worker_dict[key], name)(*args, **kwargs)
  File "/workspace/_model/code/verl/verl/single_controller/base/decorator.py", line 442, in inner
    return func(*args, **kwargs)
  File "/workspace/_model/code/verl/verl/utils/transferqueue_utils.py", line 199, in dummy_inner
    return func(*args, **kwargs)
  File "/workspace/_model/code/verl/recipe/spin/fsdp_workers.py", line 131, in init_model
    self._build_rollout(trust_remote_code=self.config.model.get("trust_remote_code", False))
  File "/workspace/_model/code/verl/verl/workers/fsdp_workers.py", line 605, in _build_rollout
    self.rollout = get_rollout_class(rollout_config.name, rollout_config.mode)(
  File "/workspace/_model/code/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 537, in __init__
    self.address = self._init_zeromq()
  File "/workspace/_model/code/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 575, in _init_zeromq
    loop = asyncio.get_running_loop()
RuntimeError: no running event loop`

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions