Skip to content

Merge 'rhds/main' into 'rhds/rhoai-2.23' #62

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: rhoai-2.23
Choose a base branch
from
Open

Conversation

vaibhavjainwiz
Copy link
Contributor

Essential Elements of an Effective PR Description Checklist

  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.

Purpose

Test Plan

Test Result

(Optional) Documentation Update

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)

mgoin and others added 5 commits July 15, 2025 13:38
An issue was reported with Mistral model on Blackwell and B200 hardware
with the error below:
<details>
<summary>Error log from pod</summary>

```
INFO 07-15 15:17:43 [__init__.py:244] Automatically detected platform cuda.
INFO 07-15 15:17:45 [api_server.py:1395] vLLM API server version 0.9.2
INFO 07-15 15:17:45 [cli_args.py:325] non-default args: {'uvicorn_log_level': 'debug', 'model': 'RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', 'trust_remote_code': True, 'max_model_len': 10000, 'limit_mm_per_prompt': {'image': 5, 'video': 5}, 'enable_chunked_prefill': True}
INFO 07-15 15:17:50 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-15 15:17:50 [config.py:1472] Using max model len 10000
INFO 07-15 15:17:50 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192.
INFO 07-15 15:17:52 [core.py:526] Waiting for init message from front-end.
INFO 07-15 15:17:52 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null}
INFO 07-15 15:17:53 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
INFO 07-15 15:17:56 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling.
INFO 07-15 15:17:56 [gpu_model_runner.py:1770] Starting to load model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16...
INFO 07-15 15:17:56 [gpu_model_runner.py:1775] Loading model from scratch...
INFO 07-15 15:17:56 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16
INFO 07-15 15:17:57 [cuda.py:284] Using Flash Attention backend on V1 engine.
INFO 07-15 15:17:57 [weight_utils.py:292] Using model weights format ['*.safetensors']

Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]

Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  4.68it/s]

Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:00<00:01,  1.87it/s]

Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:01<00:00,  1.51it/s]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.56it/s]

Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00,  1.67it/s]

INFO 07-15 15:18:00 [default_loader.py:272] Loading weights took 2.45 seconds
INFO 07-15 15:18:04 [gpu_model_runner.py:1801] Model loading took 14.0460 GiB and 6.938856 seconds
INFO 07-15 15:18:04 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size.
CUDA error (/mnt/work-dir/xformers-0.0.30/xformers-0.0.30/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device
Traceback (most recent call last):
  File "/opt/app-root/bin/vllm", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main
    args.dispatch_function(args)
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd
    uvloop.run(run_server(args))
  File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 109, in run
    return __asyncio.run(
           ^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
           ^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server
    await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker
    async with build_async_engine_client(args, client_config) as engine_client:
  File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__
    return await anext(self.gen)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
    async_llm = AsyncLLM.from_vllm_config(
                ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config
    return cls(
           ^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 124, in __init__
    self.engine_core = EngineCoreClient.make_async_mp_client(
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client
    return AsyncMPClient(*client_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 666, in __init__
    super().__init__(
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 403, in __init__
    with launch_core_engines(vllm_config, executor_class,
  File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__
    next(self.gen)
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines
    wait_for_engine_startup(
  File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup
    raise RuntimeError("Engine core initialization failed. "
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
```
</details>

Image built with this PR: quay.io/vllm/automation-vllm:cuda-16300391547
Manual test on Blackwell was successful. For details see comments in:
https://issues.redhat.com/browse/INFERENG-1126

A100 ocp-test validation is green (ie.
https://github.com/neuralmagic/nm-cicd/actions/runs/16303950486)
Accept-sync:
CUDA: https://github.com/neuralmagic/nm-cicd/actions/runs/16304501784
ROCM: https://github.com/neuralmagic/nm-cicd/actions/runs/16304505641
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants