feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…roughput)
This script introduces a new benchmark for evaluating the performance of inference requests in MaxText. It measures:
Time To First Token (TTFT)
Time Per Output Token (TPOT)
Request throughput (requests per second)
The benchmark supports both standard prefill and chunked prefill, allowing for a comprehensive analysis of different prefill strategies. It initializes a MaxEngine, loads model parameters, and sends a configurable number of requests to measure these key performance indicators.
Tests
python -m MaxText.benchmark_inference_request MaxText/configs/inference.yml --request_num=90 tokenizer_path=assets/tokenizer.mistral-v1 max_prefill_predict_length=1024 max_target_length=1280 model_name=mixtral-8x7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=8 scan_layers=true weight_dtype=bfloat16 per_device_batch_size=8 megablox=False quantization=int8 quantize_kvcache=False checkpoint_is_quantized=True capacity_factor=1 attention=dot_product model_call_mode=inference sparse_matmul=False use_chunked_prefill=true prefill_chunk_size=128
Get result
TTFT: 94.060 ms, TPOT: 88.579 ms, Requests/s: 2.038
Compare to benchmark serving in jetstream, get 90 requests.
--min-input-length 900
--max-input-length 1024
--max-output-length 256
--dataset openorca
Request throughput: 1.77 requests/s
Mean ttft: 954.47 ms
Mean TPOT: 86.27 ms
The TTFT is affected by generated block the prefill requests., while the benchmark script doesn't.
The TPOT and request throughput are similar.
Checklist
Before submitting this PR, please make sure (put X in square brackets):