-
-
Notifications
You must be signed in to change notification settings - Fork 7
Merge 'rhds/main' into 'rhds/rhoai-2.23' #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
vaibhavjainwiz
wants to merge
5
commits into
rhoai-2.23
Choose a base branch
from
main
base: rhoai-2.23
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: mgoin <[email protected]>
An issue was reported with Mistral model on Blackwell and B200 hardware with the error below: <details> <summary>Error log from pod</summary> ``` INFO 07-15 15:17:43 [__init__.py:244] Automatically detected platform cuda. INFO 07-15 15:17:45 [api_server.py:1395] vLLM API server version 0.9.2 INFO 07-15 15:17:45 [cli_args.py:325] non-default args: {'uvicorn_log_level': 'debug', 'model': 'RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', 'trust_remote_code': True, 'max_model_len': 10000, 'limit_mm_per_prompt': {'image': 5, 'video': 5}, 'enable_chunked_prefill': True} INFO 07-15 15:17:50 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'. INFO 07-15 15:17:50 [config.py:1472] Using max model len 10000 INFO 07-15 15:17:50 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=8192. INFO 07-15 15:17:52 [core.py:526] Waiting for init message from front-end. INFO 07-15 15:17:52 [core.py:69] Initializing a V1 LLM engine (v0.9.2) with config: model='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', speculative_config=None, tokenizer='RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=10000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"max_capture_size":512,"local_cache_dir":null} INFO 07-15 15:17:53 [parallel_state.py:1076] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`. INFO 07-15 15:17:56 [topk_topp_sampler.py:49] Using FlashInfer for top-p & top-k sampling. INFO 07-15 15:17:56 [gpu_model_runner.py:1770] Starting to load model RedHatAI/Mistral-Small-3.1-24B-Instruct-2503-quantized.w4a16... INFO 07-15 15:17:56 [gpu_model_runner.py:1775] Loading model from scratch... INFO 07-15 15:17:56 [compressed_tensors_wNa16.py:95] Using MarlinLinearKernel for CompressedTensorsWNA16 INFO 07-15 15:17:57 [cuda.py:284] Using Flash Attention backend on V1 engine. INFO 07-15 15:17:57 [weight_utils.py:292] Using model weights format ['*.safetensors'] Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:00<00:00, 4.68it/s] Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:00<00:01, 1.87it/s] Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:01<00:00, 1.51it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.56it/s] Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:02<00:00, 1.67it/s] INFO 07-15 15:18:00 [default_loader.py:272] Loading weights took 2.45 seconds INFO 07-15 15:18:04 [gpu_model_runner.py:1801] Model loading took 14.0460 GiB and 6.938856 seconds INFO 07-15 15:18:04 [gpu_model_runner.py:2238] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 image items of the maximum feature size. CUDA error (/mnt/work-dir/xformers-0.0.30/xformers-0.0.30/third_party/flash-attention/hopper/flash_fwd_launch_template.h:175): no kernel image is available for execution on the device Traceback (most recent call last): File "/opt/app-root/bin/vllm", line 8, in <module> sys.exit(main()) ^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/main.py", line 65, in main args.dispatch_function(args) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/cli/serve.py", line 55, in cmd uvloop.run(run_server(args)) File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 109, in run return __asyncio.run( ^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 194, in run return runner.run(main) ^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.12/asyncio/runners.py", line 118, in run return self._loop.run_until_complete(task) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete File "/opt/app-root/lib64/python3.12/site-packages/uvloop/__init__.py", line 61, in wrapper return await main ^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1431, in run_server await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 1451, in run_server_worker async with build_async_engine_client(args, client_config) as engine_client: File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 158, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib64/python3.12/contextlib.py", line 210, in __aenter__ return await anext(self.gen) ^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args async_llm = AsyncLLM.from_vllm_config( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 162, in from_vllm_config return cls( ^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 124, in __init__ self.engine_core = EngineCoreClient.make_async_mp_client( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 96, in make_async_mp_client return AsyncMPClient(*client_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 666, in __init__ super().__init__( File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/core_client.py", line 403, in __init__ with launch_core_engines(vllm_config, executor_class, File "/usr/lib64/python3.12/contextlib.py", line 144, in __exit__ next(self.gen) File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 434, in launch_core_engines wait_for_engine_startup( File "/opt/app-root/lib64/python3.12/site-packages/vllm/v1/engine/utils.py", line 484, in wait_for_engine_startup raise RuntimeError("Engine core initialization failed. " RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} ``` </details> Image built with this PR: quay.io/vllm/automation-vllm:cuda-16300391547 Manual test on Blackwell was successful. For details see comments in: https://issues.redhat.com/browse/INFERENG-1126 A100 ocp-test validation is green (ie. https://github.com/neuralmagic/nm-cicd/actions/runs/16303950486) Accept-sync: CUDA: https://github.com/neuralmagic/nm-cicd/actions/runs/16304501784 ROCM: https://github.com/neuralmagic/nm-cicd/actions/runs/16304505641
Signed-off-by: Kourosh Hakhamaneshi <[email protected]>
…at completions] (#19126) Signed-off-by: Alex-Brooks <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Essential Elements of an Effective PR Description Checklist
supported_models.md
andexamples
for a new model.PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS ABOVE HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
(Optional) Documentation Update
BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing (anything written below this line will be removed by GitHub Actions)