feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

yuyanpeng-google · 2025-05-23T08:35:35Z

…roughput)

This script introduces a new benchmark for evaluating the performance of inference requests in MaxText. It measures:

Time To First Token (TTFT)
Time Per Output Token (TPOT)
Request throughput (requests per second)
The benchmark supports both standard prefill and chunked prefill, allowing for a comprehensive analysis of different prefill strategies. It initializes a MaxEngine, loads model parameters, and sends a configurable number of requests to measure these key performance indicators.

Tests

python -m MaxText.benchmark_inference_request MaxText/configs/inference.yml --request_num=90 tokenizer_path=assets/tokenizer.mistral-v1 max_prefill_predict_length=1024 max_target_length=1280 model_name=mixtral-8x7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=8 scan_layers=true weight_dtype=bfloat16 per_device_batch_size=8 megablox=False quantization=int8 quantize_kvcache=False checkpoint_is_quantized=True capacity_factor=1 attention=dot_product model_call_mode=inference sparse_matmul=False use_chunked_prefill=true prefill_chunk_size=128

Get result
TTFT: 94.060 ms, TPOT: 88.579 ms, Requests/s: 2.038

Compare to benchmark serving in jetstream, get 90 requests.
--min-input-length 900
--max-input-length 1024
--max-output-length 256
--dataset openorca

Request throughput: 1.77 requests/s
Mean ttft: 954.47 ms
Mean TPOT: 86.27 ms

The TTFT is affected by generated block the prefill requests., while the benchmark script doesn't.
The TPOT and request throughput are similar.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

…roughput) This script introduces a new benchmark for evaluating the performance of inference requests in MaxText. It measures: Time To First Token (TTFT) Time Per Output Token (TPOT) Request throughput (requests per second) The benchmark supports both standard prefill and chunked prefill, allowing for a comprehensive analysis of different prefill strategies. It initializes a MaxEngine, loads model parameters, and sends a configurable number of requests to measure these key performance indicators.

yuyanpeng-google requested review from gobbleturk, khatwanimohit, bvandermoon, vipannalla, RissyRan, richjames0, gagika, shralex, yangyuwei, SurbhiJainUSC, hengtaoguo, A9isha and aireenmei as code owners May 23, 2025 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

Uh oh!

yuyanpeng-google commented May 23, 2025

Uh oh!

Uh oh!

feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

Are you sure you want to change the base?

feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

Uh oh!

Conversation

yuyanpeng-google commented May 23, 2025

Tests

Checklist

Uh oh!

Uh oh!