Skip to content

feat: Add benchmark for inference request performance (TTFT, TPOT, th… #1773

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

yuyanpeng-google
Copy link
Collaborator

…roughput)

This script introduces a new benchmark for evaluating the performance of inference requests in MaxText. It measures:

Time To First Token (TTFT)
Time Per Output Token (TPOT)
Request throughput (requests per second)
The benchmark supports both standard prefill and chunked prefill, allowing for a comprehensive analysis of different prefill strategies. It initializes a MaxEngine, loads model parameters, and sends a configurable number of requests to measure these key performance indicators.

Tests

python -m MaxText.benchmark_inference_request MaxText/configs/inference.yml --request_num=90 tokenizer_path=assets/tokenizer.mistral-v1 max_prefill_predict_length=1024 max_target_length=1280 model_name=mixtral-8x7b ici_fsdp_parallelism=1 ici_autoregressive_parallelism=1 ici_tensor_parallelism=8 scan_layers=true weight_dtype=bfloat16 per_device_batch_size=8 megablox=False quantization=int8 quantize_kvcache=False checkpoint_is_quantized=True capacity_factor=1 attention=dot_product model_call_mode=inference sparse_matmul=False use_chunked_prefill=true prefill_chunk_size=128

Get result
TTFT: 94.060 ms, TPOT: 88.579 ms, Requests/s: 2.038

Compare to benchmark serving in jetstream, get 90 requests.
--min-input-length 900
--max-input-length 1024
--max-output-length 256
--dataset openorca

Request throughput: 1.77 requests/s
Mean ttft: 954.47 ms
Mean TPOT: 86.27 ms

The TTFT is affected by generated block the prefill requests., while the benchmark script doesn't.
The TPOT and request throughput are similar.

Checklist

Before submitting this PR, please make sure (put X in square brackets):

  • I have performed a self-review of my code.
  • I have necessary comments in my code, particularly in hard-to-understand areas.
  • I have run end-to-end tests tests and provided workload links above if applicable.
  • I have made or will make corresponding changes to the doc if needed.

…roughput)

This script introduces a new benchmark for evaluating the performance of inference requests in MaxText. It measures:

Time To First Token (TTFT)
Time Per Output Token (TPOT)
Request throughput (requests per second)
The benchmark supports both standard prefill and chunked prefill, allowing for a comprehensive analysis of different prefill strategies. It initializes a MaxEngine, loads model parameters, and sends a configurable number of requests to measure these key performance indicators.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant