Skip to content

VLLM 部署 GLM4.5V ,Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. #208

@xns0318

Description

@xns0318

System Info / 系統信息

INFO 10-14 11:32:11 [init.py:216] Automatically detected platform cuda.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:32:14 [api_server.py:1896] vLLM API server version 0.10.2
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:32:14 [utils.py:328] non-default args: {'model_tag': '/opt/host_data/models/GLM-4.5V', 'port': 8722, 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm45', 'model': '/opt/host_data/models/GLM-4.5V', 'allowed_local_media_path': '/', 'served_model_name': ['glm-4.5v'], 'reasoning_parser': 'glm45', 'tensor_parallel_size': 4, 'media_io_kwargs': {'video': {'num_frames': -1}}}
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:32:21 [init.py:742] Resolved architecture: Glm4vMoeForConditionalGeneration
�[1;36m(APIServer pid=47974)�[0;0m torch_dtype is deprecated! Use dtype instead!
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:32:21 [init.py:1815] Using max model len 65536
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:32:21 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 10-14 11:32:27 [init.py:216] Automatically detected platform cuda.
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:32:30 [core.py:654] Waiting for init message from front-end.
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:32:30 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/opt/host_data/models/GLM-4.5V', speculative_config=None, tokenizer='/opt/host_data/models/GLM-4.5V', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='glm45'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=glm-4.5v, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m WARNING 10-14 11:32:30 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:32:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_7a308a56'), local_subscribe_addr='ipc:///tmp/1b31e8da-48b5-4f24-9154-38afeba97f2a', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:34 [init.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [init.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [init.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [init.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_806680cf'), local_subscribe_addr='ipc:///tmp/549461c5-6c2c-46fb-b01c-e8169727bb80', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3fdd42d1'), local_subscribe_addr='ipc:///tmp/cbd9094e-86fb-4cd0-a6db-f4099c0ad09e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_402564bd'), local_subscribe_addr='ipc:///tmp/45ad49d1-3b28-46d4-87b3-fd6bc026dcc4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_53ea996f'), local_subscribe_addr='ipc:///tmp/f6cd80bb-c5de-4ef8-8595-7e39ea458ddc', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1014 11:32:38.497863747 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.527111444 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.534238635 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.539133097 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-14 11:32:38 [init.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [init.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [init.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [init.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_7aaa55b2'), local_subscribe_addr='ipc:///tmp/d67ff548-284b-4886-8c0b-5f60c197b543', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
�[1;36m(Worker_TP3 pid=48397)�[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
�[1;36m(Worker_TP0 pid=48394)�[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
�[1;36m(Worker_TP2 pid=48396)�[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 0% Completed | 0/46 [00:00<?, ?it/s]
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:04 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:04 [gpu_model_runner.py:2370] Loading model from scratch...
�[1;36m(Worker_TP1 pid=48395)�[0;0m WARNING 10-14 11:33:04 [cuda.py:217] Current vllm-flash-attn has a bug inside vision module, so we use xformers backend instead. You can run pip install flash-attn to use flash-attention backend.
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 2% Completed | 1/46 [00:00<00:44, 1.02it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 4% Completed | 2/46 [00:02<00:44, 1.01s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 7% Completed | 3/46 [00:03<00:43, 1.01s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 9% Completed | 4/46 [00:04<00:42, 1.01s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 11% Completed | 5/46 [00:05<00:42, 1.02s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 13% Completed | 6/46 [00:06<00:40, 1.02s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 15% Completed | 7/46 [00:07<00:38, 1.01it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 17% Completed | 8/46 [00:07<00:36, 1.03it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 20% Completed | 9/46 [00:08<00:35, 1.04it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 22% Completed | 10/46 [00:09<00:34, 1.05it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 24% Completed | 11/46 [00:10<00:33, 1.06it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 26% Completed | 12/46 [00:11<00:32, 1.05it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 28% Completed | 13/46 [00:12<00:32, 1.02it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 30% Completed | 14/46 [00:13<00:30, 1.04it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 33% Completed | 15/46 [00:14<00:29, 1.05it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 35% Completed | 16/46 [00:15<00:23, 1.27it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 37% Completed | 17/46 [00:16<00:32, 1.11s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 39% Completed | 18/46 [00:19<00:41, 1.48s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 41% Completed | 19/46 [00:20<00:34, 1.28s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 43% Completed | 20/46 [00:20<00:29, 1.14s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 46% Completed | 21/46 [00:21<00:26, 1.04s/it]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 48% Completed | 22/46 [00:22<00:23, 1.03it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 50% Completed | 23/46 [00:23<00:21, 1.09it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 52% Completed | 24/46 [00:24<00:19, 1.13it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 54% Completed | 25/46 [00:24<00:17, 1.17it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 57% Completed | 26/46 [00:25<00:16, 1.19it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 59% Completed | 27/46 [00:26<00:16, 1.17it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 61% Completed | 28/46 [00:27<00:15, 1.15it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 63% Completed | 29/46 [00:28<00:14, 1.18it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 65% Completed | 30/46 [00:29<00:13, 1.20it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 67% Completed | 31/46 [00:29<00:12, 1.18it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 70% Completed | 32/46 [00:30<00:11, 1.20it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 72% Completed | 33/46 [00:31<00:10, 1.21it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 74% Completed | 34/46 [00:32<00:09, 1.22it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 76% Completed | 35/46 [00:33<00:08, 1.22it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 78% Completed | 36/46 [00:34<00:08, 1.20it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 80% Completed | 37/46 [00:34<00:07, 1.21it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 83% Completed | 38/46 [00:35<00:06, 1.22it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 85% Completed | 39/46 [00:36<00:05, 1.23it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 87% Completed | 40/46 [00:37<00:04, 1.23it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 89% Completed | 41/46 [00:38<00:04, 1.23it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 91% Completed | 42/46 [00:38<00:03, 1.24it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 93% Completed | 43/46 [00:39<00:02, 1.23it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 96% Completed | 44/46 [00:40<00:01, 1.22it/s]
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:44 [default_loader.py:268] Loading weights took 41.11 seconds
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 98% Completed | 45/46 [00:41<00:00, 1.22it/s]
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:45 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 41.580987 seconds
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:42<00:00, 1.23it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:42<00:00, 1.09it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:46 [default_loader.py:268] Loading weights took 42.20 seconds
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:46 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 42.690719 seconds
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:55 [default_loader.py:268] Loading weights took 50.66 seconds
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:55 [default_loader.py:268] Loading weights took 51.90 seconds
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 51.112266 seconds
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 52.377807 seconds
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_2_0/backbone for vLLM's torch.compile
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.38 s
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_1_0/backbone for vLLM's torch.compile
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.46 s
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_3_0/backbone for vLLM's torch.compile
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.55 s
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_0_0/backbone for vLLM's torch.compile
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.60 s
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.208 s
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.237 s
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.260 s
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.209 s
�[1;36m(Worker_TP2 pid=48396)�[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
�[1;36m(Worker_TP1 pid=48395)�[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
�[1;36m(Worker_TP3 pid=48397)�[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
�[1;36m(Worker_TP0 pid=48394)�[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.46 s in total
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.38 s in total
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.55 s in total
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.60 s in total
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
�[1;36m(Worker_TP0 pid=48394)�[0;0m
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 0%| | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 1%|▏ | 1/67 [00:00<00:14, 4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 3%|▎ | 2/67 [00:00<00:13, 4.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 4%|▍ | 3/67 [00:00<00:13, 4.78it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 6%|▌ | 4/67 [00:00<00:13, 4.70it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 7%|▋ | 5/67 [00:01<00:13, 4.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 9%|▉ | 6/67 [00:01<00:13, 4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 10%|█ | 7/67 [00:01<00:13, 4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 12%|█▏ | 8/67 [00:01<00:13, 4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 13%|█▎ | 9/67 [00:01<00:12, 4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 15%|█▍ | 10/67 [00:02<00:12, 4.61it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 16%|█▋ | 11/67 [00:02<00:11, 4.74it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 18%|█▊ | 12/67 [00:02<00:11, 4.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 19%|█▉ | 13/67 [00:02<00:10, 4.91it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 21%|██ | 14/67 [00:02<00:10, 4.87it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 22%|██▏ | 15/67 [00:03<00:10, 4.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 24%|██▍ | 16/67 [00:03<00:10, 4.75it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 25%|██▌ | 17/67 [00:03<00:10, 4.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 27%|██▋ | 18/67 [00:03<00:10, 4.58it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 28%|██▊ | 19/67 [00:04<00:10, 4.54it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 30%|██▉ | 20/67 [00:04<00:10, 4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 31%|███▏ | 21/67 [00:04<00:10, 4.50it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 33%|███▎ | 22/67 [00:04<00:10, 4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 34%|███▍ | 23/67 [00:04<00:09, 4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 36%|███▌ | 24/67 [00:05<00:09, 4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 37%|███▋ | 25/67 [00:05<00:09, 4.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 39%|███▉ | 26/67 [00:05<00:09, 4.54it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 40%|████ | 27/67 [00:05<00:08, 4.49it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 42%|████▏ | 28/67 [00:06<00:08, 4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 43%|████▎ | 29/67 [00:06<00:08, 4.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 45%|████▍ | 30/67 [00:06<00:08, 4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 46%|████▋ | 31/67 [00:06<00:07, 4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 48%|████▊ | 32/67 [00:06<00:07, 4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 49%|████▉ | 33/67 [00:07<00:07, 4.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 51%|█████ | 34/67 [00:07<00:07, 4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 52%|█████▏ | 35/67 [00:07<00:06, 4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 54%|█████▎ | 36/67 [00:07<00:06, 4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 55%|█████▌ | 37/67 [00:08<00:06, 4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 57%|█████▋ | 38/67 [00:08<00:06, 4.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 58%|█████▊ | 39/67 [00:08<00:06, 4.46it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 60%|█████▉ | 40/67 [00:08<00:06, 4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 61%|██████ | 41/67 [00:08<00:05, 4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 63%|██████▎ | 42/67 [00:09<00:05, 4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 64%|██████▍ | 43/67 [00:09<00:05, 4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 66%|██████▌ | 44/67 [00:09<00:05, 4.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 67%|██████▋ | 45/67 [00:09<00:04, 4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 69%|██████▊ | 46/67 [00:10<00:04, 4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 70%|███████ | 47/67 [00:10<00:04, 4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 72%|███████▏ | 48/67 [00:10<00:04, 4.64it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 73%|███████▎ | 49/67 [00:10<00:03, 4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 75%|███████▍ | 50/67 [00:10<00:03, 4.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 76%|███████▌ | 51/67 [00:11<00:03, 4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 78%|███████▊ | 52/67 [00:11<00:03, 4.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 79%|███████▉ | 53/67 [00:11<00:03, 4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 81%|████████ | 54/67 [00:11<00:02, 4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 82%|████████▏ | 55/67 [00:12<00:02, 4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 84%|████████▎ | 56/67 [00:12<00:02, 4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 85%|████████▌ | 57/67 [00:12<00:02, 4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 87%|████████▋ | 58/67 [00:12<00:02, 4.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 88%|████████▊ | 59/67 [00:12<00:01, 4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 90%|████████▉ | 60/67 [00:13<00:01, 4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 91%|█████████ | 61/67 [00:13<00:01, 4.34it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 93%|█████████▎| 62/67 [00:13<00:01, 4.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 94%|█████████▍| 63/67 [00:13<00:00, 4.38it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 96%|█████████▌| 64/67 [00:14<00:00, 4.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 97%|█████████▋| 65/67 [00:14<00:00, 4.35it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 99%|█████████▊| 66/67 [00:14<00:00, 4.39it/s]�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:15<00:00, 2.79it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:15<00:00, 4.40it/s]
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=9167749734 to fit into requested memory, or --kv-cache-memory=17161968640 to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=9167749734 to fit into requested memory, or --kv-cache-memory=17161968640 to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=9167749734 to fit into requested memory, or --kv-cache-memory=17161968640 to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with --kv-cache-memory=9167749734 to fit into requested memory, or --kv-cache-memory=17161968640 to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m INFO 10-14 11:34:38 [core.py:218] init engine (profile, create kv cache, warmup model) took 42.06 seconds
�[1;36m(EngineCore_DP0 pid=48260)�[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 20879
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [api_server.py:1692] Supported_tasks: ['generate']
�[1;36m(APIServer pid=47974)�[0;0m WARNING 10-14 11:35:01 [init.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with --generation-config vllm.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [serving_responses.py:130] Using default chat sampling params from model: {'top_k': 1, 'top_p': 0.0001}
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [serving_responses.py:159] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [serving_chat.py:97] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [serving_chat.py:137] Using default chat sampling params from model: {'top_k': 1, 'top_p': 0.0001}
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [serving_completion.py:76] Using default completion sampling params from model: {'top_k': 1, 'top_p': 0.0001}
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8722
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:36] Available routes are:
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /docs, Methods: HEAD, GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /health, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /load, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /ping, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /ping, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /tokenize, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /detokenize, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/models, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /version, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/completions, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /pooling, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /classify, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /score, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/score, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /rerank, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /invocations, Methods: POST
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /metrics, Methods: GET
�[1;36m(APIServer pid=47974)�[0;0m INFO: Started server process [47974]
�[1;36m(APIServer pid=47974)�[0;0m INFO: Waiting for application startup.
�[1;36m(APIServer pid=47974)�[0;0m INFO: Application startup complete.
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:49563 - "GET /v1/health HTTP/1.1" 404 Not Found
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:49566 - "GET /health HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m Using a slow image processor as use_fast is unset and a slow processor was saved with this model. use_fast=True will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with use_fast=False.
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:40:11 [chat_utils.py:538] Detected the chat template content format to be 'openai'. You can set --chat-template-content-format to override this.
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:40:24 [loggers.py:123] Engine 000: Avg prompt throughput: 1915.2 tokens/s, Avg generation throughput: 47.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:40:34 [loggers.py:123] Engine 000: Avg prompt throughput: 319.8 tokens/s, Avg generation throughput: 38.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:40:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:42:04 [loggers.py:123] Engine 000: Avg prompt throughput: 5749.8 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.3%, Prefix cache hit rate: 2.8%
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:42:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.7%, Prefix cache hit rate: 2.6%
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 200 OK
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
�[1;36m(Worker_TP0 pid=48394)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
�[1;36m(Worker_TP2 pid=48396)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] AsyncLLM output_handler failed.
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] Traceback (most recent call last):
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] File "/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] outputs = await engine_core.get_output_async()
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] File "/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] raise self._format_exception(outputs) from None
�[1;36m(APIServer pid=47974)�[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
�[1;36m(Worker_TP3 pid=48397)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
�[1;36m(Worker_TP1 pid=48395)�[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
�[1;36m(APIServer pid=47974)�[0;0m INFO 10-14 13:42:24 [loggers.py:123] Engine 000: Avg prompt throughput: 5654.2 tokens/s, Avg generation throughput: 31.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.6%
�[1;36m(APIServer pid=47974)�[0;0m INFO: Shutting down
�[1;36m(APIServer pid=47974)�[0;0m INFO: 172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
�[1;36m(APIServer pid=47974)�[0;0m INFO: Waiting for application shutdown.
�[1;36m(APIServer pid=47974)�[0;0m INFO: Application shutdown complete.
�[1;36m(APIServer pid=47974)�[0;0m INFO: Finished server process [47974]
/opt/miniconda3/envs/py10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

export CUDA_VISIBLE_DEVICES=4,5,6,7
nohup ./.venv/bin/vllm serve /opt/host_data/models/GLM-4.5V
--tensor-parallel-size 4
--tool-call-parser glm45
--reasoning-parser glm45
--enable-auto-tool-choice
--served-model-name glm-4.5v
--allowed-local-media-path /
--media-io-kwargs '{"video": {"num_frames": -1}}'
--port 8722
>> ./logs/vllm_glm45v_$(date '+%Y%m%d-%H%M%S').log 2>&1 &

Expected behavior / 期待表现

组里的人员用一些AI工具的时候发出多个请求,但最后几个就服务端就报错了,我感觉好像是oom,但不知道是不是vllm的问题,我用了两个vllm版本都有问题,分别是0.10.2,0.11.0.大佬可以帮忙看一下这个是哪方面的问题吗?如果是vllm的问题,我就过去vllm提issue

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions