-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Open
Labels
bugSomething isn't workingSomething isn't working
Description
System Info
bash recipe/spin/run_spin.sh
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
CUDA_VISIBLE_DEVICES=${VISIBLE_DEVICES} python3 -m recipe.spin.main_spin \
data.train_files=$HOME/data/gsm8k_pre/train.parquet \
data.val_files=$HOME/data/gsm8k_pre/test.parquet \
data.train_batch_size=1024 \
data.max_prompt_length=1024 \
data.max_response_length=1024 \
actor_rollout_ref.model.path=$HOME/model/Qwen3-0.6B \
actor_rollout_ref.actor.optim.lr=1e-6 \
actor_rollout_ref.actor.ppo_mini_batch_size=64 \
actor_rollout_ref.actor.ppo_micro_batch_size=8 \
actor_rollout_ref.rollout.log_prob_micro_batch_size=64 \
actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
actor_rollout_ref.ref.log_prob_micro_batch_size=64 \
algorithm.kl_ctrl.kl_coef=0.001 \
trainer.logger=console \
trainer.val_before_train=True \
trainer.n_gpus_per_node=1 \
trainer.nnodes=1 \
trainer.save_freq=-1 \
trainer.test_freq=1 \
+trainer.log_freq=1 \
trainer.ref_update_freq=1 \
trainer.total_epochs=1000 2>&1 | tee verl_demo.logExpected behavior
`ray.exceptions.RayTaskError(RuntimeError): ray::WorkerDict.actor_rollout_init_model() (pid=136203, ip=10.237.176.198, actor_id=c06a5e7d66d8be504804547701000000, repr=<verl.single_controller.ray.base.WorkerDict object at 0x7f33ab701930>)
File "/workspace/_model/code/verl/verl/single_controller/ray/base.py", line 700, in func
return getattr(self.worker_dict[key], name)(*args, **kwargs)
File "/workspace/_model/code/verl/verl/single_controller/base/decorator.py", line 442, in inner
return func(*args, **kwargs)
File "/workspace/_model/code/verl/verl/utils/transferqueue_utils.py", line 199, in dummy_inner
return func(*args, **kwargs)
File "/workspace/_model/code/verl/recipe/spin/fsdp_workers.py", line 131, in init_model
self._build_rollout(trust_remote_code=self.config.model.get("trust_remote_code", False))
File "/workspace/_model/code/verl/verl/workers/fsdp_workers.py", line 605, in _build_rollout
self.rollout = get_rollout_class(rollout_config.name, rollout_config.mode)(
File "/workspace/_model/code/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 537, in __init__
self.address = self._init_zeromq()
File "/workspace/_model/code/verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py", line 575, in _init_zeromq
loop = asyncio.get_running_loop()
RuntimeError: no running event loop`
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working