VLLM 部署 GLM4.5V ,Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.

### System Info / 系統信息

INFO 10-14 11:32:11 [__init__.py:216] Automatically detected platform cuda.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:32:14 [api_server.py:1896] vLLM API server version 0.10.2
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:32:14 [utils.py:328] non-default args: {'model_tag': '/opt/host_data/models/GLM-4.5V', 'port': 8722, 'enable_auto_tool_choice': True, 'tool_call_parser': 'glm45', 'model': '/opt/host_data/models/GLM-4.5V', 'allowed_local_media_path': '/', 'served_model_name': ['glm-4.5v'], 'reasoning_parser': 'glm45', 'tensor_parallel_size': 4, 'media_io_kwargs': {'video': {'num_frames': -1}}}
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:32:21 [__init__.py:742] Resolved architecture: Glm4vMoeForConditionalGeneration
[1;36m(APIServer pid=47974)[0;0m `torch_dtype` is deprecated! Use `dtype` instead!
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:32:21 [__init__.py:1815] Using max model len 65536
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:32:21 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 10-14 11:32:27 [__init__.py:216] Automatically detected platform cuda.
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:32:30 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:32:30 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='/opt/host_data/models/GLM-4.5V', speculative_config=None, tokenizer='/opt/host_data/models/GLM-4.5V', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend='glm45'), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=glm-4.5v, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":1,"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null}
[1;36m(EngineCore_DP0 pid=48260)[0;0m WARNING 10-14 11:32:30 [multiproc_worker_utils.py:273] Reducing Torch parallelism from 128 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:32:30 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1, 2, 3], buffer_handle=(4, 16777216, 10, 'psm_7a308a56'), local_subscribe_addr='ipc:///tmp/1b31e8da-48b5-4f24-9154-38afeba97f2a', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:34 [__init__.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [__init__.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [__init__.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:34 [__init__.py:216] Automatically detected platform cuda.
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_806680cf'), local_subscribe_addr='ipc:///tmp/549461c5-6c2c-46fb-b01c-e8169727bb80', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_3fdd42d1'), local_subscribe_addr='ipc:///tmp/cbd9094e-86fb-4cd0-a6db-f4099c0ad09e', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_402564bd'), local_subscribe_addr='ipc:///tmp/45ad49d1-3b28-46d4-87b3-fd6bc026dcc4', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 10-14 11:32:38 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_53ea996f'), local_subscribe_addr='ipc:///tmp/f6cd80bb-c5de-4ef8-8595-7e39ea458ddc', remote_subscribe_addr=None, remote_addr_ipv6=False)
[W1014 11:32:38.497863747 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.527111444 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.534238635 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[W1014 11:32:38.539133097 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator())
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-14 11:32:38 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [__init__.py:1433] Found nccl from library libnccl.so.2
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:38 [pynccl.py:70] vLLM is using nccl==2.27.3
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [custom_all_reduce.py:35] Skipping P2P check and trusting the driver's P2P report.
INFO 10-14 11:32:39 [shm_broadcast.py:289] vLLM message queue communication handle: Handle(local_reader_ranks=[1, 2, 3], buffer_handle=(3, 4194304, 6, 'psm_7aaa55b2'), local_subscribe_addr='ipc:///tmp/d67ff548-284b-4886-8c0b-5f60c197b543', remote_subscribe_addr=None, remote_addr_ipv6=False)
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 1 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 1, EP rank 1
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 2, EP rank 2
INFO 10-14 11:32:39 [parallel_state.py:1165] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, TP rank 3, EP rank 3
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
WARNING 10-14 11:32:40 [topk_topp_sampler.py:69] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(Worker_TP3 pid=48397)[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP0 pid=48394)[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:03 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(Worker_TP2 pid=48396)[0;0m WARNING 10-14 11:33:03 [cuda.py:217] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:03 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:04 [gpu_model_runner.py:2338] Starting to load model /opt/host_data/models/GLM-4.5V...
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:04 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(Worker_TP1 pid=48395)[0;0m WARNING 10-14 11:33:04 [cuda.py:217] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:04 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:   2% Completed | 1/46 [00:00<00:44,  1.02it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:02<00:44,  1.01s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:03<00:43,  1.01s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:04<00:42,  1.01s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:05<00:42,  1.02s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:06<00:40,  1.02s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:07<00:38,  1.01it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:07<00:36,  1.03it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:08<00:35,  1.04it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:09<00:34,  1.05it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  24% Completed | 11/46 [00:10<00:33,  1.06it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  26% Completed | 12/46 [00:11<00:32,  1.05it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  28% Completed | 13/46 [00:12<00:32,  1.02it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  30% Completed | 14/46 [00:13<00:30,  1.04it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  33% Completed | 15/46 [00:14<00:29,  1.05it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  35% Completed | 16/46 [00:15<00:23,  1.27it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  37% Completed | 17/46 [00:16<00:32,  1.11s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  39% Completed | 18/46 [00:19<00:41,  1.48s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  41% Completed | 19/46 [00:20<00:34,  1.28s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  43% Completed | 20/46 [00:20<00:29,  1.14s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  46% Completed | 21/46 [00:21<00:26,  1.04s/it]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  48% Completed | 22/46 [00:22<00:23,  1.03it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  50% Completed | 23/46 [00:23<00:21,  1.09it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  52% Completed | 24/46 [00:24<00:19,  1.13it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  54% Completed | 25/46 [00:24<00:17,  1.17it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  57% Completed | 26/46 [00:25<00:16,  1.19it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  59% Completed | 27/46 [00:26<00:16,  1.17it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  61% Completed | 28/46 [00:27<00:15,  1.15it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  63% Completed | 29/46 [00:28<00:14,  1.18it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  65% Completed | 30/46 [00:29<00:13,  1.20it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  67% Completed | 31/46 [00:29<00:12,  1.18it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  70% Completed | 32/46 [00:30<00:11,  1.20it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  72% Completed | 33/46 [00:31<00:10,  1.21it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  74% Completed | 34/46 [00:32<00:09,  1.22it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  76% Completed | 35/46 [00:33<00:08,  1.22it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  78% Completed | 36/46 [00:34<00:08,  1.20it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  80% Completed | 37/46 [00:34<00:07,  1.21it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  83% Completed | 38/46 [00:35<00:06,  1.22it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  85% Completed | 39/46 [00:36<00:05,  1.23it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  87% Completed | 40/46 [00:37<00:04,  1.23it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  89% Completed | 41/46 [00:38<00:04,  1.23it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  91% Completed | 42/46 [00:38<00:03,  1.24it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  93% Completed | 43/46 [00:39<00:02,  1.23it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  96% Completed | 44/46 [00:40<00:01,  1.22it/s]
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:44 [default_loader.py:268] Loading weights took 41.11 seconds
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards:  98% Completed | 45/46 [00:41<00:00,  1.22it/s]
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:45 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 41.580987 seconds
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:42<00:00,  1.23it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
Loading safetensors checkpoint shards: 100% Completed | 46/46 [00:42<00:00,  1.09it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m 
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:46 [default_loader.py:268] Loading weights took 42.20 seconds
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:46 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 42.690719 seconds
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:55 [default_loader.py:268] Loading weights took 50.66 seconds
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:55 [default_loader.py:268] Loading weights took 51.90 seconds
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 51.112266 seconds
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:2392] Model loading took 50.3142 GiB and 52.377807 seconds
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:33:56 [gpu_model_runner.py:3000] Encoder cache will be initialized with a budget of 30970 tokens, and profiled with 1 video items of the maximum feature size.
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_2_0/backbone for vLLM's torch.compile
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.38 s
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_1_0/backbone for vLLM's torch.compile
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.46 s
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_3_0/backbone for vLLM's torch.compile
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.55 s
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:10 [backends.py:539] Using cache directory: /root/.cache/vllm/torch_compile_cache/232ff56f65/rank_0_0/backbone for vLLM's torch.compile
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:10 [backends.py:550] Dynamo bytecode transform time: 11.60 s
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.208 s
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.237 s
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.260 s
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:17 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 6.209 s
[1;36m(Worker_TP2 pid=48396)[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
[1;36m(Worker_TP1 pid=48395)[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
[1;36m(Worker_TP3 pid=48397)[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
[1;36m(Worker_TP0 pid=48394)[0;0m WARNING 10-14 11:34:19 [fused_moe.py:727] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=352,device_name=NVIDIA_A800-SXM4-80GB.json']
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.46 s in total
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.38 s in total
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.55 s in total
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:20 [monitor.py:34] torch.compile takes 11.60 s in total
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:21 [gpu_worker.py:298] Available KV cache memory: 14.66 GiB
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:864] GPU KV cache size: 334,064 tokens
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:21 [kv_cache_utils.py:868] Maximum concurrency for 65,536 tokens per request: 5.10x
[1;36m(Worker_TP0 pid=48394)[0;0m 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/67 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   1%|▏         | 1/67 [00:00<00:14,  4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   3%|▎         | 2/67 [00:00<00:13,  4.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   4%|▍         | 3/67 [00:00<00:13,  4.78it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   6%|▌         | 4/67 [00:00<00:13,  4.70it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   7%|▋         | 5/67 [00:01<00:13,  4.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   9%|▉         | 6/67 [00:01<00:13,  4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  10%|█         | 7/67 [00:01<00:13,  4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  12%|█▏        | 8/67 [00:01<00:13,  4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  13%|█▎        | 9/67 [00:01<00:12,  4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  15%|█▍        | 10/67 [00:02<00:12,  4.61it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  16%|█▋        | 11/67 [00:02<00:11,  4.74it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  18%|█▊        | 12/67 [00:02<00:11,  4.83it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  19%|█▉        | 13/67 [00:02<00:10,  4.91it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  21%|██        | 14/67 [00:02<00:10,  4.87it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  22%|██▏       | 15/67 [00:03<00:10,  4.82it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  24%|██▍       | 16/67 [00:03<00:10,  4.75it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  25%|██▌       | 17/67 [00:03<00:10,  4.73it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  27%|██▋       | 18/67 [00:03<00:10,  4.58it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  28%|██▊       | 19/67 [00:04<00:10,  4.54it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  30%|██▉       | 20/67 [00:04<00:10,  4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  31%|███▏      | 21/67 [00:04<00:10,  4.50it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  33%|███▎      | 22/67 [00:04<00:10,  4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  34%|███▍      | 23/67 [00:04<00:09,  4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  36%|███▌      | 24/67 [00:05<00:09,  4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  37%|███▋      | 25/67 [00:05<00:09,  4.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  39%|███▉      | 26/67 [00:05<00:09,  4.54it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  40%|████      | 27/67 [00:05<00:08,  4.49it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  42%|████▏     | 28/67 [00:06<00:08,  4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  43%|████▎     | 29/67 [00:06<00:08,  4.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  45%|████▍     | 30/67 [00:06<00:08,  4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  46%|████▋     | 31/67 [00:06<00:07,  4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  48%|████▊     | 32/67 [00:06<00:07,  4.53it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  49%|████▉     | 33/67 [00:07<00:07,  4.56it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  51%|█████     | 34/67 [00:07<00:07,  4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  52%|█████▏    | 35/67 [00:07<00:06,  4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  54%|█████▎    | 36/67 [00:07<00:06,  4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  55%|█████▌    | 37/67 [00:08<00:06,  4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  57%|█████▋    | 38/67 [00:08<00:06,  4.45it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  58%|█████▊    | 39/67 [00:08<00:06,  4.46it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  60%|█████▉    | 40/67 [00:08<00:06,  4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  61%|██████    | 41/67 [00:08<00:05,  4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  63%|██████▎   | 42/67 [00:09<00:05,  4.47it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  64%|██████▍   | 43/67 [00:09<00:05,  4.48it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  66%|██████▌   | 44/67 [00:09<00:05,  4.52it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  67%|██████▋   | 45/67 [00:09<00:04,  4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  69%|██████▊   | 46/67 [00:10<00:04,  4.59it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  70%|███████   | 47/67 [00:10<00:04,  4.60it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  72%|███████▏  | 48/67 [00:10<00:04,  4.64it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  73%|███████▎  | 49/67 [00:10<00:03,  4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  75%|███████▍  | 50/67 [00:10<00:03,  4.72it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  76%|███████▌  | 51/67 [00:11<00:03,  4.67it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  78%|███████▊  | 52/67 [00:11<00:03,  4.55it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  79%|███████▉  | 53/67 [00:11<00:03,  4.57it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  81%|████████  | 54/67 [00:11<00:02,  4.51it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  82%|████████▏ | 55/67 [00:12<00:02,  4.42it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  84%|████████▎ | 56/67 [00:12<00:02,  4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  85%|████████▌ | 57/67 [00:12<00:02,  4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  87%|████████▋ | 58/67 [00:12<00:02,  4.32it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  88%|████████▊ | 59/67 [00:12<00:01,  4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  90%|████████▉ | 60/67 [00:13<00:01,  4.28it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  91%|█████████ | 61/67 [00:13<00:01,  4.34it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  93%|█████████▎| 62/67 [00:13<00:01,  4.36it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  94%|█████████▍| 63/67 [00:13<00:00,  4.38it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  96%|█████████▌| 64/67 [00:14<00:00,  4.30it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  97%|█████████▋| 65/67 [00:14<00:00,  4.35it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  99%|█████████▊| 66/67 [00:14<00:00,  4.39it/s][1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:15<00:00,  2.79it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:15<00:00,  4.40it/s]
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:37 [custom_all_reduce.py:203] Registering 6164 cuda graph addresses
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=9167749734` to fit into requested memory, or `--kv-cache-memory=17161968640` to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=9167749734` to fit into requested memory, or `--kv-cache-memory=17161968640` to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=9167749734` to fit into requested memory, or `--kv-cache-memory=17161968640` to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:38 [gpu_model_runner.py:3118] Graph capturing finished in 17 secs, took 5.97 GiB
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 11:34:38 [gpu_worker.py:391] Free memory on device (78.84/79.33 GiB) on startup. Desired GPU memory utilization is (0.9, 71.39 GiB). Actual usage is 50.31 GiB for weight, 5.7 GiB for peak activation, 0.72 GiB for non-torch memory, and 5.97 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=9167749734` to fit into requested memory, or `--kv-cache-memory=17161968640` to fully utilize gpu memory. Current kv cache memory in use is 15736029798 bytes.
[1;36m(EngineCore_DP0 pid=48260)[0;0m INFO 10-14 11:34:38 [core.py:218] init engine (profile, create kv cache, warmup model) took 42.06 seconds
[1;36m(EngineCore_DP0 pid=48260)[0;0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 20879
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [async_llm.py:180] Torch profiler disabled. AsyncLLM CPU traces will not be collected.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [api_server.py:1692] Supported_tasks: ['generate']
[1;36m(APIServer pid=47974)[0;0m WARNING 10-14 11:35:01 [__init__.py:1695] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [serving_responses.py:130] Using default chat sampling params from model: {'top_k': 1, 'top_p': 0.0001}
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [serving_responses.py:159] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [serving_chat.py:97] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [serving_chat.py:137] Using default chat sampling params from model: {'top_k': 1, 'top_p': 0.0001}
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [serving_completion.py:76] Using default completion sampling params from model: {'top_k': 1, 'top_p': 0.0001}
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [api_server.py:1971] Starting vLLM API server 0 on http://0.0.0.0:8722
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:36] Available routes are:
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /openapi.json, Methods: HEAD, GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /docs, Methods: HEAD, GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /docs/oauth2-redirect, Methods: HEAD, GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /redoc, Methods: HEAD, GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /health, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /load, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /ping, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /ping, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /tokenize, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /detokenize, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/models, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /version, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses/{response_id}, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/responses/{response_id}/cancel, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/chat/completions, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/completions, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/embeddings, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /pooling, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /classify, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /score, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/score, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/audio/transcriptions, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/audio/translations, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /rerank, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v1/rerank, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /v2/rerank, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /scale_elastic_ep, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /is_scaling_elastic_ep, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /invocations, Methods: POST
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 11:35:01 [launcher.py:44] Route: /metrics, Methods: GET
[1;36m(APIServer pid=47974)[0;0m INFO:     Started server process [47974]
[1;36m(APIServer pid=47974)[0;0m INFO:     Waiting for application startup.
[1;36m(APIServer pid=47974)[0;0m INFO:     Application startup complete.
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:49563 - "GET /v1/health HTTP/1.1" 404 Not Found
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:49566 - "GET /health HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:40:11 [chat_utils.py:538] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:40:24 [loggers.py:123] Engine 000: Avg prompt throughput: 1915.2 tokens/s, Avg generation throughput: 47.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 2.8%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:62312 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:40:34 [loggers.py:123] Engine 000: Avg prompt throughput: 319.8 tokens/s, Avg generation throughput: 38.3 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:40:44 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:42:04 [loggers.py:123] Engine 000: Avg prompt throughput: 5749.8 tokens/s, Avg generation throughput: 19.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 17.3%, Prefix cache hit rate: 2.8%
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:42:14 [loggers.py:123] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 12.7%, Prefix cache hit rate: 2.6%
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [core_client.py:564] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP0 pid=48394)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:546] Parent process exited, terminating worker
[1;36m(Worker_TP2 pid=48396)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] AsyncLLM output_handler failed.
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] Traceback (most recent call last):
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485]   File "/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 444, in output_handler
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485]     outputs = await engine_core.get_output_async()
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485]   File "/workspace/project/vllm_deploy/GLM4_5/.venv/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 845, in get_output_async
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485]     raise self._format_exception(outputs) from None
[1;36m(APIServer pid=47974)[0;0m ERROR 10-14 13:42:23 [async_llm.py:485] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[1;36m(Worker_TP3 pid=48397)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(Worker_TP1 pid=48395)[0;0m INFO 10-14 13:42:23 [multiproc_executor.py:587] WorkerProc shutting down.
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[1;36m(APIServer pid=47974)[0;0m INFO 10-14 13:42:24 [loggers.py:123] Engine 000: Avg prompt throughput: 5654.2 tokens/s, Avg generation throughput: 31.9 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 2.6%
[1;36m(APIServer pid=47974)[0;0m INFO:     Shutting down
[1;36m(APIServer pid=47974)[0;0m INFO:     172.16.20.25:50061 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
[1;36m(APIServer pid=47974)[0;0m INFO:     Waiting for application shutdown.
[1;36m(APIServer pid=47974)[0;0m INFO:     Application shutdown complete.
[1;36m(APIServer pid=47974)[0;0m INFO:     Finished server process [47974]
/opt/miniconda3/envs/py10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

### Who can help? / 谁可以帮助到您？

_No response_

### Information / 问题信息

- [x] The official example scripts / 官方的示例脚本
- [ ] My own modified scripts / 我自己修改的脚本和任务

### Reproduction / 复现过程

export CUDA_VISIBLE_DEVICES=4,5,6,7
nohup ./.venv/bin/vllm serve /opt/host_data/models/GLM-4.5V \
    --tensor-parallel-size 4 \
     --tool-call-parser glm45 \
     --reasoning-parser glm45 \
     --enable-auto-tool-choice \
     --served-model-name glm-4.5v \
     --allowed-local-media-path / \
     --media-io-kwargs '{"video": {"num_frames": -1}}' \
     --port 8722 \
    >> ./logs/vllm_glm45v_$(date '+%Y%m%d-%H%M%S').log 2>&1 & 

### Expected behavior / 期待表现

组里的人员用一些AI工具的时候发出多个请求，但最后几个就服务端就报错了，我感觉好像是oom，但不知道是不是vllm的问题，我用了两个vllm版本都有问题，分别是0.10.2，0.11.0.大佬可以帮忙看一下这个是哪方面的问题吗？如果是vllm的问题，我就过去vllm提issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

VLLM 部署 GLM4.5V ,Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. #208

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

VLLM 部署 GLM4.5V ,Engine core proc EngineCore_DP0 died unexpectedly, shutting down client. #208

Description

System Info / 系統信息

Who can help? / 谁可以帮助到您？

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions