Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: add request aggregation functionallity #10660

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

kalabYibeltal
Copy link

This PR introduces an aggregation functionality for the /completion API. Instead of processing each request individually as it arrives, this functionality—enabled using the aggregate (-ag or --aggregate) parameter—groups requests for more efficient processing.

When aggregation is enabled, requests are collected over a defined time window or until the specified buffer size (-bs, --buffer-size) is reached. Once the buffer is full, the requests are organized into an array and processed together or in smaller chunks, depending on the command-line specified block size (-bks, --block-size).

We conducted multiple experiments to compare the performance of this method with standard request processing. Results demonstrate up to a 20% reduction in average processing time, particularly with smaller block sizes.

The two main advantages of this method come from:

  1. Optimized Request Sorting: During the specified time window, sorting requests based on input prompt size reduces the overall average processing duration.
  2. Efficient Grouping: Grouping prompts of similar sizes into the same block (array) further improves processing efficiency.

image

This is one of the experiments we conducted to evaluate different thread sizes and modes of aggregation. "length sorting" refers to a configuration where the batch size is set to 1. We used a variety of prompt sizes to simulate real-world workloads.

@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2024

I had a quick look at the implementation. The idea sounds good but this introduce quite a lot of unnecessary complexity to our code base by adding more std::lock and std::future to the code. This can be quite risky when talking about thread safety.

Also, I would say that we do kinda aggregate requests internally by having multiple slots, we also do continuous batching which further assure that the batch is full if the number of requests is high. I'm not sure which kind of benchmark allow you to gain that 20% performance boost (how many users? many requests? length of each requests? on which hardware?), can you explain more on that?

@kalabYibeltal
Copy link
Author

I had a quick look at the implementation. The idea sounds good but this introduce quite a lot of unnecessary complexity to our code base by adding more std::lock and std::future to the code. This can be quite risky when talking about thread safety.

Also, I would say that we do kinda aggregate requests internally by having multiple slots, we also do continuous batching which further assure that the batch is full if the number of requests is high. I'm not sure which kind of benchmark allow you to gain that 20% performance boost (how many users? many requests? length of each requests? on which hardware?), can you explain more on that?

Thank you for your response,

  1. The code could still work without std::future I suspect it might be a little slower without it but I have to test it to make sure.
  2. But we used std::lock to make sure we do not have RAW or WAR or WAW hazards between threads for the shared variables (buffers). We could limit the use of "std::lock" to just when aggregation is defined to guarantee the unmodified server's behavior is intact.
  3. Sorry for my unclear statment, so we used Intel E5-2683v3 CPU with 14 cores and 8GB RAM ( a server from cloudlab ). As for the benchmarks and clients, we did not use offical benchamarks (instead sample workloads/prompts we gathered from opensource websites like "The Stanford Question Answering Dataset" ), and for cleints we just used custom python scripts to make a call to the api to imitate real users.

One last thing to note: although we were able to achieve a good average duration across all requests in a single run, the advantage comes with a trade-off in latency for some requests, particularly those with long prompts that first arrive to the server.

@ngxson
Copy link
Collaborator

ngxson commented Dec 4, 2024

I would say that we don't expect to add more std::mutex very soon, since it add too much complexity and may introduce deadlock situation in the future.

Also, I believe that this feature can be implemented with much less code and less complexity, but we need to verify if it really increase the performance or not.

Regarding testing, we're still missing many info to conclude that this really increases the performance. You should include a reproducible script somewhere. Also, just to remind, there are some server parameters that can be tweaked to (maybe) archive the same result, specially batch size and number of slots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants