-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heterogeneous parallel processing to avoid CPU & GPU Idle time #258
base: main
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
hi @Usama3059, appreciate the PR! however, it is not clear -
Regarding the diagram you attached above, we batch the 8 requests so GPU will process them in parallel but your diagram show them sequentially. |
Hello @aniketmaurya, consider total 32 reqs with max_batch_size = 8, so hence there are 4 sequential processes, right? In Complex AI system there are combination of CPU and GPU based workloads, CPU abased preprocessing involves for example Opencv image transformation, filtration, external data downloading etc. In NLP for example external API calls, Vector search, prompt formatting or filtration, collecting from DBs, etc. Now in the bottom part of diag, both CPU and GPU is busy and hence less cost and more resource efficient and faster response. Let me know if you need more information, |
sounds exciting @Usama3059! can you show benchmarks to see how it compares with current master. The speed impact needs to be very clear and meaningful for us to consider merging it with the added internal complexity 😊. But sounds promising! cc @lantiga |
hey @Usama3059, great contribution I was actually thinking about options to get to the same effect (overlap preprocessing and gpu compute). What I had in mind originally was to have decode request be executed prior However your idea probably goes in a better direction. I just need to think about it a bit more. The UX could probably be improved. We also need to refine the performance profile. Overall though this is a good direction to take, let’s explore it further! |
Sounds great @Usama3059, thanks for the explanation!! Looking forward to it. |
Thanks, @williamFalcon Yes will soon share the detailed benchmarks in different scenarios as speed improvements depends on them. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
@lantiga @aniketmaurya I have added the initial code for separate process workers. Now, we can process multiple request batches in multiple workers on both CPU and GPU simultaneously, making both CPU & GPU busy and better memory usage. Please do let me know if there is anything I've missed. Also, @aniketmaurya for |
So here's my proposal in order to clarify the API:
We can't call the regular I believe this would make it simpler for users to think about a What do you think @Usama3059 @aniketmaurya @bhimrazy ? |
@lantiga I think the proposed API is a great idea for initial simplicity at this stage, and I also agree with |
No, preprocess will take the same batch size as predict, otherwise it’ll get complicated for users The lifecycle will be As a side note, I’m imagining that if you are implementing preprocess you might leave batch with the default implementation. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #258 +/- ##
====================================
- Coverage 95% 89% -6%
====================================
Files 14 14
Lines 1082 1246 +164
====================================
+ Hits 1025 1106 +81
- Misses 57 140 +83 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @Usama3059!
- Can we rename
ready_to_inference_queue
to something likepreprocess_queue
. The process is calledpreprocess_process
, so I think it makes sense to use ^ this name. - CI is failing for the following issue, could you please look into it:
tests/test_loops.py::test_run_streaming_loop_timeout
/opt/conda/lib/python3.10/site-packages/_pytest/threadexception.py:82: PytestUnhandledThreadExceptionWarning: Exception in thread Thread-10 (run_streaming_loop)
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/opt/conda/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/opt/conda/lib/python3.10/site-packages/litserve/loops.py", line 286, in run_streaming_loop
time.monotonic() - timestamp > lit_api.request_timeout
TypeError: unsupported operand type(s) for -: 'float' and 'NoneType'
src/litserve/server.py
Outdated
process_list.append(process) | ||
else: | ||
for worker_id, device in enumerate(self.inference_workers): | ||
self.ready_to_inference_queue = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why set to self (self.ready_to_inference_queue
), it could be local variable too.
for more information, see https://pre-commit.ci
@aniketmaurya I think this warning |
for more information, see https://pre-commit.ci
target=preprocess_worker, | ||
args=( | ||
self.lit_api, | ||
self.lit_spec, | ||
device, | ||
worker_id, | ||
self.request_queue, | ||
self.preprocess_queue, | ||
self.response_queues, | ||
self.max_batch_size, | ||
self.batch_timeout, | ||
), | ||
) | ||
process.start() | ||
process_list.append(process) | ||
|
||
for worker_id, device in enumerate(self.inference_workers): | ||
if len(device) == 1: | ||
device = device[0] | ||
worker_id = f"inference_{worker_id}" | ||
self.workers_setup_status[worker_id] = False | ||
ctx = mp.get_context("spawn") | ||
process = ctx.Process( | ||
target=inference_worker, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good except one concern @lantiga @Usama3059 that if we start a new process with litapi
then we also have to call litapi.setup
which will load the model on "non inference worker" process too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aniketmaurya Based on discussion, we're considering preprocess workers as optional and not requiring any setup for simplicity. The initialized LitAPI will be passed to the preprocess workers, and the setup will only happen on the inference workers, Let me know if I’m missing any information
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's consider the following example where we create tokenizer and model in the setup method, now with preprocess enabled it won't be able to access self.tokenizer
unless we setup the LitAPI in preprocess
process as well.
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from litserve import LitAPI, LitServer
class BERTLitAPI(LitAPI):
def setup(self, device):
self.tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
self.model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
def decode_request(self, request):
return request["text"]
def preprocess(self, inputs):
return self.tokenizer(inputs, return_tensors="pt", padding=True, truncation=True, max_length=512)
....
if __name__ == "__main__":
api = BERTLitAPI()
server = LitServer(api, accelerator='cuda', devices=1)
server.run(port=8000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankitsharma07 Could we provide a error/warning to the user as shown in the attached image? your thoughts on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes warning can solve one part where users know what is wrong. Another thing is that setup
would be called #num_preprocess_workers
times and thus the model would be initialized more time than expected possibly leading to OOM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aniketmaurya I guess for simplicity, by 'warning,' I mean it will prevent the launch_inference_worker
from starting and raise this warning to correct the code. We can place this check after the enable_process_worker
, and since preprocess workers are optional and will only be used if needed.
Let me know if you have any other ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a valid concern @aniketmaurya
we could have an optional setup_preprocessing
hook as well in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aniketmaurya is there anything else that needs to be worked on for this PR?
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Before submitting
Processing requires efficient use of both CPU and GPUs, and complex inference pipelines requires some preprocessing and transformations which requires CPU and then GPU for model inference; to put them in parallel batch it would make both device busy and hence fast inference
GOOD:
This will help to separate the process on each device and hence use them efficiently to handle large scale processing pipelines.
BAD:
I guess implementation can be done much better so post processing can also be make parallel on cpu or better API interface can be introduce also, just want to put the idea first to get the reviews from the repo owners.
PRs without this will not be merged.
What does this PR do?
PR review
Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.
Did you have fun?
Make sure you had fun coding 🙃