Open
Description
Hi all,
there is not currently multi modal format in quants supporting concurrent requests like INT4 FP8 w8a8.
I tried to run this model but when I make request I never get anything back.
The log from docker
O 06-24 10:54:10 [loggers.py:118] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO 06-24 11:56:05 [logger.py:43] Received request chatcmpl-f6be9ea61b154be2a46a2a99f2ec559f: prompt: '<|user|>Write a poem about summer<|end|><|assistant|>', params: SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4088, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None, guided_decoding=None, extra_args=None), prompt_token_ids: None, prompt_embeds shape: None, lora_request: None, prompt_adapter_request: None.
INFO 06-24 11:56:05 [async_llm.py:271] Added request chatcmpl-f6be9ea61b154be2a46a2a99f2ec559f.
INFO 06-24 11:56:21 [loggers.py:118] Engine 000: Avg prompt throughput: 0.8 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
My docker setup:
sudo docker run --runtime nvidia --gpus all \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HUGGING_FACE_HUB_TOKEN=" \
-p 8000:8000 \
--ipc=host \
--name phi4 \
vllm/vllm-openai:v0.9.1 \
--model microsoft/Phi-4-multimodal-instruct \
--gpu_memory_utilization=0.80 \
--max_model_len=4096 \
--trust-remote-code
Metadata
Metadata
Assignees
Labels
No labels