Production deployment via HuggingFace #227

kdaniel21 · 2024-11-04T08:51:22Z

kdaniel21
Nov 4, 2024

Hi, first of all, thank you for the project, I've been testing it in the last few days, and it seems perform great!

We'd love to try Docling in production, but we were wondering about what the best way would be to deploy it. It doesn't need to "hyperscale" to hundreds of documents per minute, but we still want to make sure that it can deal with a dozen of documents per minute, and it doesn't crash for smaller spikes.

The "intuitive" idea i had was deploying it on a "beefy" VM with dedicated GPUs, and running it from there as usual (probably with a queue to avoid overloading it). However, I was wondering if it'd possible to deploy the models (EasyOCR, layout recognition, table extraction etc.) separately e.g. on SageMaker/AzureML through their HuggingFace integration, and run the rest of the pipeline run on a "regular" CPU-only VM. This would allow scaling the computationally heavy tasks independently from the code that's responsible for executing the pipeline, and thus could be easily integrated into an existing service/backend as one would only have to install the docling package, deploy the models, and use the "managed models" for inference.

Additionally, this would also make it more accessible to use it locally, as one could just deploy the pre-trained models remotely, and use them even with tighter hardware constraints (despite docling having a great performance even on consumer grade hardware!).

I also saw that there was a docling-ibm-models repo, which seems to contain the actual code for the models, and also docling-serve, which seems to be a simple API wrapper for docling.

Thanks for taking the time, and looking forward to using Docling in production!

cau-git · 2024-11-05T20:32:10Z

cau-git
Nov 5, 2024
Maintainer

Hi @kdaniel21, great to see you thinking towards production usage of docling! I am happy to discuss further with you on how docling can be leveraged in production scenarios. As a starting point, let me share with you a few pieces that may give you relevant context.

Some background: We have distilled the docling package out of Deep Search, a sophisticated platform for document knowledge extraction, which we have been engineering over the past several years in IBM Research. It is a research asset which can be licensed commercially, and has built-in all the distributed scaling capabilities, can deploy models on large-scale cloud infrastucture, and more. It is the result of years of systems and AI engineering research, which we have also published about here: https://arxiv.org/abs/2206.00785.
One important realization from this work is that the I/O and network overhead of leveraging distributed/remote models can be very significant, and a more monolithic solution with all models running locally where the input documents first arrive, will in many scenarios beat a highly distributed processing scheme. This is at least valid if you have highly optimized and computationally moderately expensive models, such as we provide in docling so far (OCR is the exception here).
You can get a good idea of how to run docling at massive scale by checking out IBM's data-prep-kit. data-prep-kit is an open-source toolkit to scale data pre-processing (e.g. for LLM training) with so-called transforms, such as this one using docling. It supports running either locally or on a ray cluster, where each ray worker will own an independent instance of a docling converter. We are successfully using it to bulk-process PDF libraries across hundreds of nodes.

1 reply

kdaniel21 Nov 6, 2024
Author

Hi @cau-git, thank you for elaborate response!
First of all, thanks for the paper, I'll give it a read. Your research seems to be indeed pretty interesting and relevant. :)

To be a bit more specific with our use case, we have two priorities:

We want to integrate Docling into our architecture/infrastructure as simply as possible. We have a monolith in a language other than Python, and therefore the easiest would be:
1. Creating a "binding" for the docling (Python) library to do everything that's "computationally cheap" as part of our monolith (likely in the same process).
2. Deploying the models on a serverless inference platform (SageMaker, HF, AzureML), so that the Docling binding from point 1 can call these.
Performance to the extent where the service runs reliably and doesn't crash even if a longer document comes in. Having a few seconds of network overhead doesn't matter, and we could partially solve this with a e.g. a queue.

I can see how a monolithic approach leads to better performance in this case, but we fear that in our case, this comes at the cost of infrastructure complexity:

Either we'd run the whole thing in the same process as our monolith, and have to allocate more resources, and risk e.g. crashes*.
We'd have to deploy a service resembling docling-serve that can scale independently from our monolith (with more resources). This would work, just requires us to maintain/monitor/manage one more service.

*: My performance concerns also come from my local experience. I'm using normal M2, and with ~8 short (1-5 pages) documents, the process constantly crashed. This could be mitigated with queuing, but our production instances are running with even fewer resources (AWS ECS, scale vertically if needed). This makes me think that it could cause problems there (especially with peaks), and having the option to simply use hosted models would make local development also easier.

While this is specific to us, I believe that this is actually a common use case where people want to add document extraction to their existing system as easily as possible. The scale is not big enough to justify a completely separate service, but at the same time running it in the "main" service isn't a realistic option either, especially with OCR (e.g. for a "regular REST API"). Then, using a managed service is probably the cheapest option (almost no maintenance, high reliability).

Thanks again for taking the time! I wanted to give some more context to our use case and why we believe that deploying it on a serverless platform could be a common use case.
Hence the question is:

do you have any recommendations to integrate Docling that's reliable and simple?

As I said in the other GH issue, we'd be ready to try to contribute depending on the scope.

PeterStaar-IBM · 2024-11-06T07:49:37Z

PeterStaar-IBM
Nov 6, 2024
Maintainer

@kdaniel21 To follow uo with what @cau-git said, we are currently writing a 2nd version of the technical report, which will include a lot more benchmarking on a variety of hardware. Could you inform us where you intend to run it, specifically the specs (pure cpu, GPU accelerated, etc). This could give us a good indication on what to benchmark.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Production deployment via HuggingFace #227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Production deployment via HuggingFace #227

kdaniel21 Nov 4, 2024

Replies: 2 comments · 1 reply

cau-git Nov 5, 2024 Maintainer

kdaniel21 Nov 6, 2024 Author

PeterStaar-IBM Nov 6, 2024 Maintainer

kdaniel21
Nov 4, 2024

Replies: 2 comments 1 reply

cau-git
Nov 5, 2024
Maintainer

kdaniel21 Nov 6, 2024
Author

PeterStaar-IBM
Nov 6, 2024
Maintainer