Skip to content

Commit 1ceee21

Browse files
feat: add option to parallelize pdfs (#41)
This is implementation of #17 It adds a boolean `split_pdf_page` to PartitionParameters, which if True, causes the pdf to be split at client side to 1-page chunks, and send to API. The returned elements are joined to a single result list.
1 parent b677352 commit 1ceee21

17 files changed

+413
-56
lines changed

.github/workflows/ci.yaml

+4-4
Original file line numberDiff line numberDiff line change
@@ -19,22 +19,22 @@ jobs:
1919
python-version: ["3.9","3.10","3.11"]
2020
runs-on: ubuntu-latest
2121
steps:
22-
- uses: actions/checkout@v3
22+
- uses: actions/checkout@v4
2323
- name: Set up Python ${{ matrix.python-version }}
24-
uses: actions/setup-python@v4
24+
uses: actions/setup-python@v5
2525
with:
2626
python-version: ${{ matrix.python-version }}
2727
- name: Install dependencies
2828
run: make install-test
2929
- name: Run unit tests
3030
run: |
3131
pip install .
32-
make test
32+
make test-unit
3333
3434
lint:
3535
runs-on: ubuntu-latest
3636
steps:
37-
- uses: actions/checkout@v3
37+
- uses: actions/checkout@v4
3838
- name: Install dependencies
3939
run: pip install .[dev]
4040
- name: Lint

.github/workflows/speakeasy_sdk_generation.yml

+2
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,8 @@ jobs:
2525
openapi_doc_auth_header: x-api-key
2626
openapi_docs: |
2727
- https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json
28+
overlay_docs: |
29+
- ../../overlay_client.yaml
2830
publish_python: true
2931
speakeasy_version: latest
3032
secrets:

.gitignore

+3
Original file line numberDiff line numberDiff line change
@@ -7,3 +7,6 @@ __pycache__/
77

88
# human-added igore files
99
.ipynb_checkpoints/
10+
.idea/
11+
openapi.json
12+
openapi_client.json

Makefile

+32-5
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
PACKAGE_NAME := unstructured-python-client
22
CURRENT_DIR := $(shell pwd)
33
ARCH := $(shell uname -m)
4+
DOCKER_IMAGE ?= downloads.unstructured.io/unstructured-io/unstructured-api:latest
45

56
###########
67
# Install #
78
###########
89

910
.PHONY: install-test
1011
install-test:
11-
pip install pytest
12-
pip install requests_mock
12+
pip install pytest requests_mock pypdf deepdiff
1313

1414
.PHONY: install-dev
1515
install-dev:
@@ -25,14 +25,41 @@ install: install-test install-dev
2525
#################
2626

2727
.PHONY: test
28-
test:
29-
PYTHONPATH=. pytest \
30-
_test_unstructured_client
28+
test: test-unit test-integration-docker
29+
30+
.PHONY: test-unit
31+
test-unit:
32+
PYTHONPATH=. pytest _test_unstructured_client -v -k "unit"
33+
34+
# Assumes you have unstructured-api running on localhost:8000
35+
.PHONY: test-integration
36+
test-integration:
37+
PYTHONPATH=. pytest _test_unstructured_client -v -k "integration"
38+
39+
# Runs the unstructured-api in docker for tests
40+
.PHONY: test-integration-docker
41+
test-integration-docker:
42+
-docker stop unstructured-api && docker kill unstructured-api
43+
docker run --name unstructured-api -p 8000:8000 -d --rm ${DOCKER_IMAGE} --host 0.0.0.0 && \
44+
curl -s -o /dev/null --retry 10 --retry-delay 5 --retry-all-errors http://localhost:8000/general/docs && \
45+
PYTHONPATH=. pytest _test_unstructured_client -v -k "integration" && \
46+
docker kill unstructured-api
3147

3248
.PHONY: lint
3349
lint:
3450
pylint --rcfile=pylintrc src
3551

52+
#############
53+
# Speakeasy #
54+
#############
55+
56+
.PHONY: client-generate
57+
client-generate:
58+
wget -nv -q -O openapi.json https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json
59+
speakeasy overlay validate -o ./overlay_client.yaml
60+
speakeasy overlay apply -s ./openapi.json -o ./overlay_client.yaml > ./openapi_client.json
61+
speakeasy generate sdk -s ./openapi_client.json -o ./ -l python
62+
3663
###########
3764
# Jupyter #
3865
###########

README.md

+68-17
Original file line numberDiff line numberDiff line change
@@ -23,10 +23,8 @@ pip install unstructured-client
2323
```
2424
<!-- End SDK Installation [installation] -->
2525

26-
/docs/models/shared/partitionparameters.md
27-
2826
## Usage
29-
Only the `files` parameter is required. See the [general partition](/docs/models/shared/partitionparameters.md) page for all available parameters. 
27+
Only the `files` parameter is required.
3028

3129
```python
3230
from unstructured_client import UnstructuredClient
@@ -35,7 +33,7 @@ from unstructured_client.models.errors import SDKError
3533

3634
s = UnstructuredClient(api_key_auth="YOUR_API_KEY")
3735

38-
filename = "sample-docs/layout-parser-paper-fast.pdf"
36+
filename = "_sample_docs/layout-parser-paper-fast.pdf"
3937

4038
with open(filename, "rb") as f:
4139
# Note that this currently only supports a single file
@@ -56,20 +54,26 @@ try:
5654
print(resp.elements[0])
5755
except SDKError as e:
5856
print(e)
59-
60-
# {
61-
# 'type': 'UncategorizedText',
62-
# 'element_id': 'fc550084fda1e008e07a0356894f5816',
63-
# 'metadata': {
64-
# 'filename': 'layout-parser-paper-fast.pdf',
65-
# 'filetype': 'application/pdf',
66-
# 'languages': ['eng'],
67-
# 'page_number': 1
68-
# }
69-
# }
57+
```
58+
59+
Result:
60+
61+
```json
62+
{
63+
'type': 'UncategorizedText',
64+
'element_id': 'fc550084fda1e008e07a0356894f5816',
65+
'metadata': {
66+
'filename': 'layout-parser-paper-fast.pdf',
67+
'filetype': 'application/pdf',
68+
'languages': ['eng'],
69+
'page_number': 1
70+
}
71+
}
7072
```
7173

72-
## Change the base URL
74+
### UnstructuredClient
75+
76+
#### Change the base URL
7377

7478
If you are self hosting the API, or developing locally, you can change the server URL when setting up the client.
7579

@@ -86,6 +90,24 @@ s = unstructured_client.UnstructuredClient(
8690
api_key_auth=api_key,
8791
)
8892
```
93+
94+
### PartitionParameters
95+
96+
See the [general partition](/docs/models/shared/partitionparameters.md) page for all available parameters. 
97+
98+
#### Splitting PDF by pages
99+
100+
In order to speed up processing of long PDF files, set `split_pdf_page=True`. It will cause the PDF
101+
to be split page-by-page at client side, before sending to API, and combining individual responses
102+
as single result. This will work only for PDF files, so don't set it for other filetypes.
103+
104+
Warning: this feature causes the `parent_id` metadata generation in elements to be disabled, as that
105+
requires having context of multiple pages.
106+
107+
The amount of threads that will be used for sending individual pdf pages, is controlled by
108+
`UNSTRUCTURED_CLIENT_SPLIT_CALL_THREADS` env var. By default it equals to 5.
109+
It can't be more than 15, to avoid too high resource usage and costs.
110+
89111
<!-- No SDK Example Usage -->
90112
<!-- No SDK Available Operations -->
91113
<!-- No Pagination -->
@@ -119,9 +141,38 @@ This SDK is in beta, and there may be breaking changes between versions without
119141
to a specific package version. This way, you can install the same version each time without breaking changes unless you are intentionally
120142
looking for the latest version.
121143

144+
### Installation Instructions for Local Development
145+
146+
The following instructions are intended to help you get up and running with `unstructured-python-client` locally if you are planning to contribute to the project.
147+
148+
* Using `pyenv` to manage virtualenv's is recommended but not necessary
149+
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
150+
* `brew install pyenv-virtualenv`
151+
* `pyenv install 3.10`
152+
* Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).
153+
154+
* Create a virtualenv to work in and activate it, e.g. for one named `unstructured-python-client`:
155+
156+
`pyenv virtualenv 3.10 unstructured-python-client`
157+
`pyenv activate unstructured-python-client`
158+
159+
* Run `make install` and `make test`
160+
122161
### Contributions
123162

124-
While we value open-source contributions to this SDK, this library is generated programmatically.
163+
While we value open-source contributions to this SDK, this library is generated programmatically by Speakeasy. In order to start working with this repo, you need to:
164+
1. Install Speakeasy client locally https://github.com/speakeasy-api/speakeasy#installation
165+
2. Run `speakeasy auth login`
166+
3. Run `make client-generate`. This allows to iterate development with python client.
167+
168+
There are two important files used by `make client-generate`:
169+
1. `openapi.json` which is actually not stored here, [but fetched from unstructured-api](https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json), represents the API that is supported on backend.
170+
2. `overlay_client.yaml` is a handcrafted diff that when applied over above, produces `openapi_client.json`
171+
which is used to generate SDK.
172+
173+
Once PR with changes is merged, Github CI will autogenerate the Speakeasy client in a new PR, using
174+
the `openapi.json` and `overlay_client.yaml` You will have to manually bring back the human created lines in it.
175+
125176
Feel free to open a PR or a Github issue as a proof of concept and we'll do our best to include it in a future release!
126177

127178
### SDK Created by [Speakeasy](https://www.speakeasyapi.dev/docs/sdk-design/python/methodology-python)

_sample_docs/fake.doc

18 KB
Binary file not shown.

_sample_docs/layout-parser-paper.pdf

4.47 MB
Binary file not shown.

_sample_docs/list-item-example-1.pdf

46.6 KB
Binary file not shown.

_test_unstructured_client/test__decorators.py

+86-8
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,12 @@
1+
import os
2+
import pypdf
13
import pytest
4+
import requests
5+
from deepdiff import DeepDiff
26

37
from unstructured_client import UnstructuredClient
48
from unstructured_client.models import shared
5-
from unstructured_client.models.errors import SDKError
6-
9+
from unstructured_client.models.errors import SDKError, HTTPValidationError
710

811
FAKE_KEY = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"
912

@@ -20,7 +23,7 @@
2023
"unstructured-000mock.api.unstructuredapp.io/general/v0/general",
2124
],
2225
)
23-
def test_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
26+
def test_unit_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
2427
client = UnstructuredClient(
2528
server_url=server_url,
2629
api_key_auth=FAKE_KEY,
@@ -42,20 +45,20 @@ def test_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
4245
"http://localhost:8000/general/v0/general",
4346
],
4447
)
45-
def test_clean_server_url_fixes_malformed_localhost_url(server_url: str):
48+
def test_unit_clean_server_url_fixes_malformed_localhost_url(server_url: str):
4649
client = UnstructuredClient(
4750
server_url=server_url,
4851
api_key_auth=FAKE_KEY,
4952
)
5053
assert client.general.sdk_configuration.server_url == "http://localhost:8000"
5154

5255

53-
def test_clean_server_url_returns_empty_string_given_empty_string():
56+
def test_unit_clean_server_url_returns_empty_string_given_empty_string():
5457
client = UnstructuredClient( server_url="", api_key_auth=FAKE_KEY)
5558
assert client.general.sdk_configuration.server_url == ""
5659

5760

58-
def test_clean_server_url_returns_None_given_no_server_url():
61+
def test_unit_clean_server_url_returns_None_given_no_server_url():
5962
client = UnstructuredClient(
6063
api_key_auth=FAKE_KEY,
6164
)
@@ -71,7 +74,7 @@ def test_clean_server_url_returns_None_given_no_server_url():
7174
"unstructured-000mock.api.unstructuredapp.io/general/v0/general",
7275
],
7376
)
74-
def test_clean_server_url_fixes_malformed_urls_with_positional_arguments(
77+
def test_unit_clean_server_url_fixes_malformed_urls_with_positional_arguments(
7578
server_url: str,
7679
):
7780
client = UnstructuredClient(
@@ -85,7 +88,7 @@ def test_clean_server_url_fixes_malformed_urls_with_positional_arguments(
8588
)
8689

8790

88-
def test_suggest_defining_url_issues_a_warning_on_a_401():
91+
def test_unit_suggest_defining_url_issues_a_warning_on_a_401():
8992
client = UnstructuredClient(
9093
api_key_auth=FAKE_KEY,
9194
)
@@ -108,3 +111,78 @@ def test_suggest_defining_url_issues_a_warning_on_a_401():
108111
match="If intending to use the paid API, please define `server_url` in your request.",
109112
):
110113
client.general.partition(req)
114+
115+
116+
@pytest.mark.parametrize("call_threads", [1, 2, 5])
117+
@pytest.mark.parametrize(
118+
"filename, expected_ok",
119+
[
120+
("_sample_docs/list-item-example-1.pdf", True), # 1 page
121+
("_sample_docs/layout-parser-paper-fast.pdf", True), # 2 pages
122+
("_sample_docs/layout-parser-paper.pdf", True), # 16 pages
123+
("_sample_docs/fake.doc", True),
124+
("_sample_docs/fake.doc", False), # This will append .pdf to filename to fool first line of filetype detection, to simulate decoding error
125+
],
126+
)
127+
def test_integration_split_pdf_has_same_output_as_non_split(
128+
call_threads: int,
129+
filename: str,
130+
expected_ok: bool,
131+
caplog
132+
):
133+
"""
134+
Tests that output that we get from the split-by-page pdf is the same as from non-split.
135+
136+
Requires unstructured-api running in bg. See Makefile for how to run it.
137+
Doesn't check for raw_response as there's no clear patter for how it changes with the number of pages / call_threads.
138+
"""
139+
try:
140+
response = requests.get("http://localhost:8000/general/docs")
141+
assert response.status_code == 200, "The unstructured-api is not running on localhost:8000"
142+
except requests.exceptions.ConnectionError:
143+
assert False, "The unstructured-api is not running on localhost:8000"
144+
145+
client = UnstructuredClient(
146+
api_key_auth=FAKE_KEY,
147+
server_url="localhost:8000"
148+
)
149+
150+
with open(filename, "rb") as f:
151+
files = shared.Files(
152+
content=f.read(),
153+
file_name=filename,
154+
)
155+
156+
if not expected_ok:
157+
files.file_name += ".pdf"
158+
159+
req = shared.PartitionParameters(
160+
files=files,
161+
strategy='fast',
162+
languages=["eng"],
163+
split_pdf_page=True,
164+
)
165+
166+
os.environ["UNSTRUCTURED_CLIENT_SPLIT_CALL_THREADS"] = str(call_threads)
167+
168+
try:
169+
resp_split = client.general.partition(req)
170+
except (HTTPValidationError, AttributeError) as exc:
171+
if not expected_ok:
172+
assert "error arose when splitting by pages" in caplog.text
173+
assert "File does not appear to be a valid PDF" in str(exc)
174+
return
175+
else:
176+
pytest.exit("unexpected error", returncode=1)
177+
178+
req.split_pdf_page = False
179+
resp_single = client.general.partition(req)
180+
181+
assert len(resp_split.elements) == len(resp_single.elements)
182+
assert resp_split.content_type == resp_single.content_type
183+
assert resp_split.status_code == resp_single.status_code
184+
185+
# Difference in the parent_id is expected, because parent_ids are assigned when element crosses page boundary
186+
diff = DeepDiff(t1=resp_split.elements, t2=resp_single.elements,
187+
exclude_regex_paths=r"root\[\d+\]\['metadata'\]\['parent_id'\]")
188+
assert len(diff) == 0

0 commit comments

Comments
 (0)