vllm0.8.5.postv1 双卡部署 训练合并导出后的模型出现问题 #8790
Unanswered
xiaoheiyue
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
INFO 07-30 17:25:44 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_3490520c'), local_subscribe_addr='ipc:///tmp/f6abc4c0-f035-4926-9703-737ca36f0a77', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 07-30 17:25:49 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-30 17:25:49 [importing.py:53] Triton module has been replaced with a placeholder.
INFO 07-30 17:25:49 [init.py:239] Automatically detected platform cuda.
INFO 07-30 17:25:49 [init.py:239] Automatically detected platform cuda.
WARNING 07-30 17:25:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7407deeac2d0>
WARNING 07-30 17:25:52 [utils.py:2522] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x73c08d186f90>
(VllmWorker rank=0 pid=592277) INFO 07-30 17:25:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_559ccad2'), local_subscribe_addr='ipc:///tmp/92b4d35f-f888-4416-ab69-aa12982a5093', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=592278) INFO 07-30 17:25:52 [shm_broadcast.py:266] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_16036050'), local_subscribe_addr='ipc:///tmp/a82b0cb4-5e4d-4e3f-894c-10e0f9844885', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=0 pid=592277) INFO 07-30 17:25:53 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=592277) INFO 07-30 17:25:53 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=592278) INFO 07-30 17:25:53 [utils.py:1055] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=592278) INFO 07-30 17:25:53 [pynccl.py:69] vLLM is using nccl==2.21.5
qyht-virtual-machine:592277:592277 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592277:592277 [0] NCCL INFO Bootstrap : Using ens34:192.168.0.115<0>
qyht-virtual-machine:592277:592277 [0] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
qyht-virtual-machine:592277:592277 [0] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
qyht-virtual-machine:592277:592277 [0] NCCL INFO NET/Plugin: Using internal network plugin.
qyht-virtual-machine:592277:592277 [0] NCCL INFO cudaDriverVersion 12060
NCCL version 2.21.5+cuda12.4
qyht-virtual-machine:592278:592278 [1] NCCL INFO cudaDriverVersion 12060
qyht-virtual-machine:592278:592278 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592278:592278 [1] NCCL INFO Bootstrap : Using ens34:192.168.0.115<0>
qyht-virtual-machine:592278:592278 [1] NCCL INFO NET/Plugin: No plugin found (libnccl-net.so)
qyht-virtual-machine:592278:592278 [1] NCCL INFO NET/Plugin: Plugin load returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory : when loading libnccl-net.so
qyht-virtual-machine:592278:592278 [1] NCCL INFO NET/Plugin: Using internal network plugin.
qyht-virtual-machine:592277:592277 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qyht-virtual-machine:592278:592278 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
qyht-virtual-machine:592277:592277 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592278:592278 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592277:592277 [0] NCCL INFO NET/IB : No device found.
qyht-virtual-machine:592277:592277 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592278:592278 [1] NCCL INFO NET/IB : No device found.
qyht-virtual-machine:592278:592278 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to ens34
qyht-virtual-machine:592277:592277 [0] NCCL INFO NET/Socket : Using [0]ens34:192.168.0.115<0>
qyht-virtual-machine:592278:592278 [1] NCCL INFO NET/Socket : Using [0]ens34:192.168.0.115<0>
qyht-virtual-machine:592277:592277 [0] NCCL INFO Using non-device net plugin version 0
qyht-virtual-machine:592277:592277 [0] NCCL INFO Using network Socket
qyht-virtual-machine:592278:592278 [1] NCCL INFO Using non-device net plugin version 0
qyht-virtual-machine:592278:592278 [1] NCCL INFO Using network Socket
qyht-virtual-machine:592278:592278 [1] NCCL INFO ncclCommInitRank comm 0x3b1b9380 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 24000 commId 0x845c6655bcb0a8da - Init START
qyht-virtual-machine:592277:592277 [0] NCCL INFO ncclCommInitRank comm 0x4c118fd0 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 23000 commId 0x845c6655bcb0a8da - Init START
qyht-virtual-machine:592277:592277 [0] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
qyht-virtual-machine:592278:592278 [1] NCCL INFO NCCL_P2P_LEVEL set by environment to LOC
ERROR 07-30 17:25:56 [core.py:396] EngineCore failed to start.
ERROR 07-30 17:25:56 [core.py:396] Traceback (most recent call last):
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
ERROR 07-30 17:25:56 [core.py:396] engine_core = EngineCoreProc(*args, **kwargs)
ERROR 07-30 17:25:56 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in init
ERROR 07-30 17:25:56 [core.py:396] super().init(vllm_config, executor_class, log_stats,
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 64, in init
ERROR 07-30 17:25:56 [core.py:396] self.model_executor = executor_class(vllm_config)
ERROR 07-30 17:25:56 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
ERROR 07-30 17:25:56 [core.py:396] self._init_executor()
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
ERROR 07-30 17:25:56 [core.py:396] self.workers = WorkerProc.wait_for_ready(unready_workers)
ERROR 07-30 17:25:56 [core.py:396] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 07-30 17:25:56 [core.py:396] File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
ERROR 07-30 17:25:56 [core.py:396] raise e from None
ERROR 07-30 17:25:56 [core.py:396] Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Process EngineCore_0:
Traceback (most recent call last):
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 400, in run_engine_core
raise e
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 387, in run_engine_core
engine_core = EngineCoreProc(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 329, in init
super().init(vllm_config, executor_class, log_stats,
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 64, in init
self.model_executor = executor_class(vllm_config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 52, in init
self._init_executor()
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 91, in _init_executor
self.workers = WorkerProc.wait_for_ready(unready_workers)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 370, in wait_for_ready
raise e from None
Exception: WorkerProc initialization failed due to an exception in a background process. See stack trace for root cause.
Traceback (most recent call last):
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/weakref.py", line 666, in _exitfunc
f()
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/weakref.py", line 590, in call
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/executor/multiproc_executor.py", line 228, in shutdown
for w in self.workers:
^^^^^^^^^^^^
AttributeError: 'MultiprocExecutor' object has no attribute 'workers'
Traceback (most recent call last):
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/bin/llamafactory-cli", line 8, in
sys.exit(main())
^^^^^^
File "/home/qyht/projects/LLaMA-Factory/src/llamafactory/cli.py", line 151, in main
COMMAND_MAPcommand
File "/home/qyht/projects/LLaMA-Factory/src/llamafactory/api/app.py", line 146, in run_api
chat_model = ChatModel()
^^^^^^^^^^^
File "/home/qyht/projects/LLaMA-Factory/src/llamafactory/chat/chat_model.py", line 55, in init
self.engine: BaseEngine = VllmEngine(model_args, data_args, finetuning_args, generating_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/projects/LLaMA-Factory/src/llamafactory/chat/vllm_engine.py", line 97, in init
self.model = AsyncLLMEngine.from_engine_args(AsyncEngineArgs(**engine_args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 684, in from_engine_args
return async_engine_cls.from_vllm_config(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 150, in from_vllm_config
return cls(
^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/async_llm.py", line 118, in init
self.engine_core = core_client_class(
^^^^^^^^^^^^^^^^^^
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 642, in init
super().init(
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 398, in init
self._wait_for_engine_startup()
File "/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/site-packages/vllm/v1/engine/core_client.py", line 430, in _wait_for_engine_startup
raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above.
/home/qyht/softwares/anaconda3/envs/qwen2.5/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 2 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
Beta Was this translation helpful? Give feedback.
All reactions