Description
I run vllm in RTX-4090 with this parameter:
CUDA_VISIBLE_DEVICES=3 vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 19999
And the parameter I used to start genai-bench is:
genai-bench benchmark --api-backend openai \ --api-base "http://localhost:19999" \ --api-key "test" \ --api-model-name "Qwen/Qwen2.5-1.5B-Instruct" \ --model-tokenizer "Qwen/Qwen2.5-1.5B-Instruct" \ --task text-to-text \ --max-time-per-run 15 \ --max-requests-per-run 300 \ --server-engine "vLLM" \ --server-version "v0.9.2"
But, I got the huge out throughput at the level of 200k tokens/s as screenshot below:
as you can see the metrics in "Out throughput" is about 100k~200k tokens/s.
why it's so huge? it's abnormal
in the sub-dashboard of "Output Latency vs Output Throughput of Server"
the “Output Throughput of Server” value is 126 tokens/sec. This value matches vllm's log well, which is about 131 tokens/sec
so, is it an issue for the sub-dashboard of "Out throughput"?