Skip to content

huge out throughput #31

Open
Open
@gimphammer

Description

@gimphammer

I run vllm in RTX-4090 with this parameter:
CUDA_VISIBLE_DEVICES=3 vllm serve Qwen/Qwen2.5-1.5B-Instruct --port 19999

And the parameter I used to start genai-bench is:
genai-bench benchmark --api-backend openai \ --api-base "http://localhost:19999" \ --api-key "test" \ --api-model-name "Qwen/Qwen2.5-1.5B-Instruct" \ --model-tokenizer "Qwen/Qwen2.5-1.5B-Instruct" \ --task text-to-text \ --max-time-per-run 15 \ --max-requests-per-run 300 \ --server-engine "vLLM" \ --server-version "v0.9.2"

But, I got the huge out throughput at the level of 200k tokens/s as screenshot below:

Image

as you can see the metrics in "Out throughput" is about 100k~200k tokens/s.
why it's so huge? it's abnormal

in the sub-dashboard of "Output Latency vs Output Throughput of Server"
the “Output Throughput of Server” value is 126 tokens/sec. This value matches vllm's log well, which is about 131 tokens/sec

so, is it an issue for the sub-dashboard of "Out throughput"?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions