server: add request aggregation functionallity #10660

kalabYibeltal · 2024-12-04T17:51:23Z

This PR introduces an aggregation functionality for the /completion API. Instead of processing each request individually as it arrives, this functionality—enabled using the aggregate (-ag or --aggregate) parameter—groups requests for more efficient processing.

When aggregation is enabled, requests are collected over a defined time window or until the specified buffer size (-bs, --buffer-size) is reached. Once the buffer is full, the requests are organized into an array and processed together or in smaller chunks, depending on the command-line specified block size (-bks, --block-size).

We conducted multiple experiments to compare the performance of this method with standard request processing. Results demonstrate up to a 20% reduction in average processing time, particularly with smaller block sizes.

The two main advantages of this method come from:

Optimized Request Sorting: During the specified time window, sorting requests based on input prompt size reduces the overall average processing duration.

Efficient Grouping: Grouping prompts of similar sizes into the same block (array) further improves processing efficiency.

This is one of the experiments we conducted to evaluate different thread sizes and modes of aggregation. "length sorting" refers to a configuration where the batch size is set to 1. We used a variety of prompt sizes to simulate real-world workloads.

ngxson · 2024-12-04T18:49:35Z

I had a quick look at the implementation. The idea sounds good but this introduce quite a lot of unnecessary complexity to our code base by adding more std::lock and std::future to the code. This can be quite risky when talking about thread safety.

Also, I would say that we do kinda aggregate requests internally by having multiple slots, we also do continuous batching which further assure that the batch is full if the number of requests is high. I'm not sure which kind of benchmark allow you to gain that 20% performance boost (how many users? many requests? length of each requests? on which hardware?), can you explain more on that?

kalabYibeltal · 2024-12-04T21:15:42Z

I had a quick look at the implementation. The idea sounds good but this introduce quite a lot of unnecessary complexity to our code base by adding more std::lock and std::future to the code. This can be quite risky when talking about thread safety.

Also, I would say that we do kinda aggregate requests internally by having multiple slots, we also do continuous batching which further assure that the batch is full if the number of requests is high. I'm not sure which kind of benchmark allow you to gain that 20% performance boost (how many users? many requests? length of each requests? on which hardware?), can you explain more on that?

Thank you for your response,

The code could still work without std::future I suspect it might be a little slower without it but I have to test it to make sure.
But we used std::lock to make sure we do not have RAW or WAR or WAW hazards between threads for the shared variables (buffers). We could limit the use of "std::lock" to just when aggregation is defined to guarantee the unmodified server's behavior is intact.
Sorry for my unclear statment, so we used Intel E5-2683v3 CPU with 14 cores and 8GB RAM ( a server from cloudlab ). As for the benchmarks and clients, we did not use offical benchamarks (instead sample workloads/prompts we gathered from opensource websites like "The Stanford Question Answering Dataset" ), and for cleints we just used custom python scripts to make a call to the api to imitate real users.

One last thing to note: although we were able to achieve a good average duration across all requests in a single run, the advantage comes with a trade-off in latency for some requests, particularly those with long prompts that first arrive to the server.

ngxson · 2024-12-04T22:09:10Z

I would say that we don't expect to add more std::mutex very soon, since it add too much complexity and may introduce deadlock situation in the future.

Also, I believe that this feature can be implemented with much less code and less complexity, but we need to verify if it really increase the performance or not.

Regarding testing, we're still missing many info to conclude that this really increases the performance. You should include a reproducible script somewhere. Also, just to remind, there are some server parameters that can be tweaked to (maybe) archive the same result, specially batch size and number of slots.

add request aggregation functionality

fb93f70

github-actions bot added examples server labels Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server: add request aggregation functionallity #10660

server: add request aggregation functionallity #10660

Uh oh!

kalabYibeltal commented Dec 4, 2024

Uh oh!

ngxson commented Dec 4, 2024 •

edited

Loading

Uh oh!

kalabYibeltal commented Dec 4, 2024

Uh oh!

ngxson commented Dec 4, 2024

Uh oh!

Uh oh!

server: add request aggregation functionallity #10660

Are you sure you want to change the base?

server: add request aggregation functionallity #10660

Uh oh!

Conversation

kalabYibeltal commented Dec 4, 2024

Uh oh!

ngxson commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalabYibeltal commented Dec 4, 2024

Uh oh!

ngxson commented Dec 4, 2024

Uh oh!

Uh oh!

ngxson commented Dec 4, 2024 •

edited

Loading