huge out throughput

I run vllm in RTX-4090  with this parameter:
`CUDA_VISIBLE_DEVICES=3 vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 19999`

And the parameter I used to start genai-bench is:
`genai-bench benchmark --api-backend openai \
            --api-base "http://localhost:19999" \
            --api-key "test" \
            --api-model-name "Qwen/Qwen2.5-1.5B-Instruct" \
            --model-tokenizer "Qwen/Qwen2.5-1.5B-Instruct" \
            --task text-to-text \
            --max-time-per-run 15 \
            --max-requests-per-run 300 \
            --server-engine "vLLM" \
            --server-version "v0.9.2" `


But, I got the huge out throughput at the level of 200k tokens/s as screenshot below:

![Image](https://github.com/user-attachments/assets/4c5f7a0c-41ab-4b04-b2ec-c009c95374cf)

as you can see the metrics in "Out throughput" is about 100k~200k tokens/s.
why it's so huge? it's abnormal

in the sub-dashboard of "Output Latency vs Output Throughput of Server"
the “Output Throughput of Server” value is 126 tokens/sec.  This value matches vllm's log well, which is about 131 tokens/sec

so, is it an issue for the sub-dashboard of "Out throughput"?





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

huge out throughput #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

huge out throughput #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions