Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add option to parallelize pdfs #17

Closed
awalker4 opened this issue Dec 6, 2023 · 5 comments · Fixed by #41
Closed

feat: add option to parallelize pdfs #17

awalker4 opened this issue Dec 6, 2023 · 5 comments · Fixed by #41

Comments

@awalker4
Copy link
Collaborator

awalker4 commented Dec 6, 2023

We aren’t planning to bring server side parallel mode back, but this can be a game changer for big documents. We should have the client optionally break a pdf into pages and send these off async. See this example.

We need to see if this it's possible to hook into the autogenerated code for something like this.

@cragwolfe
Copy link
Contributor

This should likely be the default. IMO, Unstructured-IO/unstructured#2461 is a blocker.

@awalker4
Copy link
Collaborator Author

awalker4 commented Feb 6, 2024

Note that #32 has established the pattern for hooking into the autogenerated code. Our custom logic will live in the new decorator dir, and we'll wrap the endpoint function. When Speakeasy makes a new pr, we'll just add the one liners back (we'll have a script to do this soon).

@hubert-rutkowski85
Copy link
Contributor

It's not easy with new client parameter split_pdf_page (boolean), because the python client is now generated based on the openapi.yaml , which in turn is generated based on code by unstructured-api. So, if we add there a new param, it will be also visible in the Swagger preview (like http://127.0.0.1:8000/general/docs) of endpoints, misleading users that backend API takes such parameter. But it's not, as clearly the param is for client only.

I looked for ways of hiding the endpoint param in Openapi spec, but it's not supported even for the more basic case of endpoints OAI/OpenAPI-Specification#433 Perhaps some clever hacking would work here, but it might not be pretty. So I wanted to ask if it's ok for now (or forever), if we have a parameter which will work only on client side, and will be ignored backend side? We could rename it to client_split_pdf_page.

@awalker4
Copy link
Collaborator Author

awalker4 commented Feb 13, 2024

Yeah, that's a good question. Adding something on the server for this does feel messy. I wonder if we could establish a better pattern for configuring these sorts of things on the client - some sort of Configuration object that we pass in to the sdk constructor. We can see what Speakeasy thinks about this, or figure out how to support the behavior on our own.

I think, for now, we can add the logic and assume that it will be the default behavior. We could add configuration via env variable if users really want to be able to turn it off.

@hubert-rutkowski85 hubert-rutkowski85 linked a pull request Feb 14, 2024 that will close this issue
@hubert-rutkowski85
Copy link
Contributor

Ok I asked the Speakeasy support and they suggested a way for adding the parameter only on client side, through overlay. This is relatively simple and looks like the best solution. The draft implementation is already in PR.

hubert-rutkowski85 added a commit that referenced this issue Mar 5, 2024
This is implementation of #17

It adds a boolean `split_pdf_page` to PartitionParameters, which if
True, causes the pdf to be split at client side to 1-page chunks, and
send to API. The returned elements are joined to a single result list.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants