server: add request aggregation functionallity #10660
Open
+333
−28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces an aggregation functionality for the /completion API. Instead of processing each request individually as it arrives, this functionality—enabled using the aggregate (-ag or --aggregate) parameter—groups requests for more efficient processing.
When aggregation is enabled, requests are collected over a defined time window or until the specified buffer size (-bs, --buffer-size) is reached. Once the buffer is full, the requests are organized into an array and processed together or in smaller chunks, depending on the command-line specified block size (-bks, --block-size).
We conducted multiple experiments to compare the performance of this method with standard request processing. Results demonstrate up to a 20% reduction in average processing time, particularly with smaller block sizes.
The two main advantages of this method come from:
This is one of the experiments we conducted to evaluate different thread sizes and modes of aggregation. "length sorting" refers to a configuration where the batch size is set to 1. We used a variety of prompt sizes to simulate real-world workloads.