Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

vaibhavjainwiz · 2025-07-23T04:35:21Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

Signed-off-by: mgoin <[email protected]>

An issue was reported with Mistral model on Blackwell and B200 hardware with the error below: <details> <summary>Error log from pod</summary> ``` INFO 07-15 15:17:43 [__init__.py:244] Automatically detected platform cuda. INFO 07-15 15:17:45 [api_server.py:1395] vLLM API server version 0.9.2 INFO 07-15 15:17:45 [cli_args.py:325] non-default args: {'uvicorn_log_level': 'debug', 'model': 'RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', 'trust_remote_code': True, 'max_model_len': 10000, 'limit_mm_per_prompt': {'image': 5, 'video': 5}, 'enable_chunked_prefill': True} INFO 07-15 15:17:50 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'. INFO 07-15 15:17:50 [config.py:1472] Using max model len 10000 INFO 07-15 15:17:50 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192. INFO 07-15 15:17:52 [core.py:526] Waiting for init message from front-end. INFO 07-15 15:17:52 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} INFO 07-15 15:17:53 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. INFO 07-15 15:17:56 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. INFO 07-15 15:17:56 [gpu_model_runner.py:1770] Starting to load model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16... INFO 07-15 15:17:56 [gpu_model_runner.py:1775] Loading model from scratch... INFO 07-15 15:17:56 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16 INFO 07-15 15:17:57 [cuda.py:284] Using Flash Attention backend on V1 engine. INFO 07-15 15:17:57 [weight_utils.py:292] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 4.68it/s] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:01, 1.87it/s] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.51it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.56it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.67it/s] INFO 07-15 15:18:00 [default_loader.py:272] Loading weights took 2.45 seconds INFO 07-15 15:18:04 [gpu_model_runner.py:1801] Model loading took 14.0460 GiB and 6.938856 seconds INFO 07-15 15:18:04 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size. CUDA error (/mnt/work-dir/xformers-0.0.30/xformers-0.0.30/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last): File "/opt/app-root/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main args.dispatch_function(args) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config return cls( ^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 124, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client return AsyncMPClient(*client_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 666, in __init__ super().__init__( File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 403, in __init__ with launch_core_engines(vllm_config, executor_class, File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__ next(self.gen) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines wait_for_engine_startup( File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} ``` </details> Image built with this PR: quay.io/vllm/automation-vllm:cuda-16300391547 Manual test on Blackwell was successful. For details see comments in: https://issues.redhat.com/browse/INFERENG-1126 A100 ocp-test validation is green (ie. https://github.com/neuralmagic/nm-cicd/actions/runs/16303950486) Accept-sync: CUDA: https://github.com/neuralmagic/nm-cicd/actions/runs/16304501784 ROCM: https://github.com/neuralmagic/nm-cicd/actions/runs/16304505641

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

…at completions] (#19126) Signed-off-by: Alex-Brooks <[email protected]>

mgoin and others added 5 commits July 15, 2025 13:38

Fix Mistral3 support on SM100/SM120

6674885

Signed-off-by: mgoin <[email protected]>

[tech debt] Revisit lora request model checker (#20636)

1b1d75f

Signed-off-by: Kourosh Hakhamaneshi <[email protected]>

[Core] Add Support for Default Modality Specific LoRAs [generate / ch…

082d4f5

…at completions] (#19126) Signed-off-by: Alex-Brooks <[email protected]>

Merge 'nm/v0.9.2-ibm-cherry-picks' into 'rhds/main'

972b2ea

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

Uh oh!

vaibhavjainwiz commented Jul 23, 2025

Uh oh!

Uh oh!

Uh oh!

Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

Are you sure you want to change the base?

Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

Uh oh!

Conversation

vaibhavjainwiz commented Jul 23, 2025

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

Uh oh!