-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add option to parallelize pdfs #41
feat: add option to parallelize pdfs #41
Conversation
fef5ccc
to
0ab389a
Compare
…k for api readiness.
…tation about development in this repo
…ion tests, and each type can be easily run.
@hubert-rutkowski85, Can you provide an explanation in the description of the PR the changes you're making? I will start testing these changes locally, but it'd be best to also provide instructions for reviewers on how to test in the future. |
@Coniferish just updated description for how to test it, tomorrow will update add description of changes. In my prev project the issues contained description of what the change was about, and PR was only code (to not duplicate the content). If here are different habits here, I will gladly adjust. |
@hubert-rutkowski85, there's no reference to an open issue in the PR, but it would be good to generally include some kind of summary similar to the template for PRs in Unstructured (another thing I just added to the list for #43) |
I've created the branch from issue page, so it's automatically connected. But I do agree that on github UI this information is weakly presented. Gitlab does this in much better way. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a few notes but approving in advance.
- I think I'd prefer to see the openapi.json stay in the api repo, otherwise it becomes difficult to know where the source of truth is. As long as we maintain this file outside of FastAPI, I think it makes sense to just add
pdf_page_split
as a param over there. - We may need to add deepdiff and pypdf to the
make install
- The tests didn't run for me at first - the PYTHONPATH needs to be
src
- The last speakeasy pr added hooks, which will be our way to insert logic without having to edit the prs every time. It's probably worth a new pr, but keep in mind that we'll want to move our logic over soon.
# Conflicts: # src/unstructured_client/general.py
Done.
Yeah I also had problem with running tests at beginning, but don't remember exactly what was the reason and solution. However, make unit-test works for me with PYTHONPATH=src and .
Good to know. I won't make such changes in this PR, as I want to close it ASAP. |
LGTM! |
This is implementation of #17
It adds a boolean
split_pdf_page
to PartitionParameters, which if True, causes the pdf to be split at client side to 1-page chunks, and send to API. The returned elements are joined to a single result list.In order to run new tests that have been added for it, run
make test-integration-docker