Skip to content

feat: add option to parallelize pdfs #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 28 commits into from
Mar 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
339a62c
docs: speakeasy local development
hubert-rutkowski85 Feb 13, 2024
579f2eb
docs: speakeasy local development - added overlays for having client-…
hubert-rutkowski85 Feb 14, 2024
0ab389a
ci: added overlay. Updated openapi.json to newest version.
hubert-rutkowski85 Feb 15, 2024
b50969d
chore: Reverted old openapi.json, re-generated client.
hubert-rutkowski85 Feb 19, 2024
b4106a5
feat: working splitting of PDF file by page at client.
hubert-rutkowski85 Feb 20, 2024
6053b01
test: simple integration test for split pdf functionality. Added two …
hubert-rutkowski85 Feb 20, 2024
e3cb54e
test: add some more parametrization. Handle case of doc parsing. Chec…
hubert-rutkowski85 Feb 20, 2024
664932d
doc: add documentation about usage of split_pdf_page. Improve documen…
hubert-rutkowski85 Feb 20, 2024
ecea6d4
chore: cleaning.
hubert-rutkowski85 Feb 20, 2024
f1b9a3e
test: add makefile docker-test
hubert-rutkowski85 Feb 21, 2024
5c1c6e1
test: refactor the structure, so there's separation for unit/integrat…
hubert-rutkowski85 Feb 21, 2024
c4dde92
chore: lint fixes.
hubert-rutkowski85 Feb 21, 2024
dec63f9
chore: improve typing and structure of code.
hubert-rutkowski85 Feb 21, 2024
704d9a2
chore: small clean.
hubert-rutkowski85 Feb 21, 2024
31fd9f6
chore: review fixes part 1.
hubert-rutkowski85 Feb 22, 2024
db5cd65
chore: review fixes part 2.
hubert-rutkowski85 Feb 22, 2024
fb0350a
chore: review fix: support kwarg for request.
hubert-rutkowski85 Feb 26, 2024
0154602
chore: refactor env parsing to separate function.
hubert-rutkowski85 Feb 26, 2024
1f14a7a
chore: refactor for better structure, add typing, more consts. Deepco…
hubert-rutkowski85 Feb 26, 2024
65ae850
chore: improve error handling when getting non-pdf files.
hubert-rutkowski85 Feb 26, 2024
1574a88
Merge branch 'main' into 17-feat-add-option-to-parallelize-pdfs
hubert-rutkowski85 Feb 28, 2024
ac377c1
ci: disable docker tests to fix out of space error caused by download…
hubert-rutkowski85 Feb 28, 2024
fcbae0d
ci: fix linting error.
hubert-rutkowski85 Feb 28, 2024
581de28
ci: bump actions used, to get rid of Github warnings.
hubert-rutkowski85 Feb 28, 2024
eaee0d5
chore: revert generated code changes.
hubert-rutkowski85 Mar 5, 2024
2885205
Merge branch 'main' into 17-feat-add-option-to-parallelize-pdfs
hubert-rutkowski85 Mar 5, 2024
3adb6d8
doc: improve readme and review fix.
hubert-rutkowski85 Mar 5, 2024
519a646
chore: review change to openapi.json storage and sdk generation.
hubert-rutkowski85 Mar 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions .github/workflows/ci.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,22 +19,22 @@ jobs:
python-version: ["3.9","3.10","3.11"]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: make install-test
- name: Run unit tests
run: |
pip install .
make test
make test-unit

lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/checkout@v4
- name: Install dependencies
run: pip install .[dev]
- name: Lint
Expand Down
2 changes: 2 additions & 0 deletions .github/workflows/speakeasy_sdk_generation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ jobs:
openapi_doc_auth_header: x-api-key
openapi_docs: |
- https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json
overlay_docs: |
- ../../overlay_client.yaml
publish_python: true
speakeasy_version: latest
secrets:
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,6 @@ __pycache__/

# human-added igore files
.ipynb_checkpoints/
.idea/
openapi.json
openapi_client.json
37 changes: 32 additions & 5 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
PACKAGE_NAME := unstructured-python-client
CURRENT_DIR := $(shell pwd)
ARCH := $(shell uname -m)
DOCKER_IMAGE ?= downloads.unstructured.io/unstructured-io/unstructured-api:latest

###########
# Install #
###########

.PHONY: install-test
install-test:
pip install pytest
pip install requests_mock
pip install pytest requests_mock pypdf deepdiff

.PHONY: install-dev
install-dev:
Expand All @@ -25,14 +25,41 @@ install: install-test install-dev
#################

.PHONY: test
test:
PYTHONPATH=. pytest \
_test_unstructured_client
test: test-unit test-integration-docker

.PHONY: test-unit
test-unit:
PYTHONPATH=. pytest _test_unstructured_client -v -k "unit"

# Assumes you have unstructured-api running on localhost:8000
.PHONY: test-integration
test-integration:
PYTHONPATH=. pytest _test_unstructured_client -v -k "integration"

# Runs the unstructured-api in docker for tests
.PHONY: test-integration-docker
test-integration-docker:
-docker stop unstructured-api && docker kill unstructured-api
docker run --name unstructured-api -p 8000:8000 -d --rm ${DOCKER_IMAGE} --host 0.0.0.0 && \
curl -s -o /dev/null --retry 10 --retry-delay 5 --retry-all-errors http://localhost:8000/general/docs && \
PYTHONPATH=. pytest _test_unstructured_client -v -k "integration" && \
docker kill unstructured-api

.PHONY: lint
lint:
pylint --rcfile=pylintrc src

#############
# Speakeasy #
#############

.PHONY: client-generate
client-generate:
wget -nv -q -O openapi.json https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json
speakeasy overlay validate -o ./overlay_client.yaml
speakeasy overlay apply -s ./openapi.json -o ./overlay_client.yaml > ./openapi_client.json
speakeasy generate sdk -s ./openapi_client.json -o ./ -l python

###########
# Jupyter #
###########
Expand Down
85 changes: 68 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,8 @@ pip install unstructured-client
```
<!-- End SDK Installation [installation] -->

/docs/models/shared/partitionparameters.md

## Usage
Only the `files` parameter is required. See the [general partition](/docs/models/shared/partitionparameters.md) page for all available parameters. 
Only the `files` parameter is required.

```python
from unstructured_client import UnstructuredClient
Expand All @@ -35,7 +33,7 @@ from unstructured_client.models.errors import SDKError

s = UnstructuredClient(api_key_auth="YOUR_API_KEY")

filename = "sample-docs/layout-parser-paper-fast.pdf"
filename = "_sample_docs/layout-parser-paper-fast.pdf"

with open(filename, "rb") as f:
# Note that this currently only supports a single file
Expand All @@ -56,20 +54,26 @@ try:
print(resp.elements[0])
except SDKError as e:
print(e)

# {
# 'type': 'UncategorizedText',
# 'element_id': 'fc550084fda1e008e07a0356894f5816',
# 'metadata': {
# 'filename': 'layout-parser-paper-fast.pdf',
# 'filetype': 'application/pdf',
# 'languages': ['eng'],
# 'page_number': 1
# }
# }
```

Result:

```json
{
'type': 'UncategorizedText',
'element_id': 'fc550084fda1e008e07a0356894f5816',
'metadata': {
'filename': 'layout-parser-paper-fast.pdf',
'filetype': 'application/pdf',
'languages': ['eng'],
'page_number': 1
}
}
```

## Change the base URL
### UnstructuredClient

#### Change the base URL

If you are self hosting the API, or developing locally, you can change the server URL when setting up the client.

Expand All @@ -86,6 +90,24 @@ s = unstructured_client.UnstructuredClient(
api_key_auth=api_key,
)
```

### PartitionParameters

See the [general partition](/docs/models/shared/partitionparameters.md) page for all available parameters. 

#### Splitting PDF by pages

In order to speed up processing of long PDF files, set `split_pdf_page=True`. It will cause the PDF
to be split page-by-page at client side, before sending to API, and combining individual responses
as single result. This will work only for PDF files, so don't set it for other filetypes.

Warning: this feature causes the `parent_id` metadata generation in elements to be disabled, as that
requires having context of multiple pages.

The amount of threads that will be used for sending individual pdf pages, is controlled by
`UNSTRUCTURED_CLIENT_SPLIT_CALL_THREADS` env var. By default it equals to 5.
It can't be more than 15, to avoid too high resource usage and costs.

<!-- No SDK Example Usage -->
<!-- No SDK Available Operations -->
<!-- No Pagination -->
Expand Down Expand Up @@ -119,9 +141,38 @@ This SDK is in beta, and there may be breaking changes between versions without
to a specific package version. This way, you can install the same version each time without breaking changes unless you are intentionally
looking for the latest version.

### Installation Instructions for Local Development

The following instructions are intended to help you get up and running with `unstructured-python-client` locally if you are planning to contribute to the project.

* Using `pyenv` to manage virtualenv's is recommended but not necessary
* Mac install instructions. See [here](https://github.com/Unstructured-IO/community#mac--homebrew) for more detailed instructions.
* `brew install pyenv-virtualenv`
* `pyenv install 3.10`
* Linux instructions are available [here](https://github.com/Unstructured-IO/community#linux).

* Create a virtualenv to work in and activate it, e.g. for one named `unstructured-python-client`:

`pyenv virtualenv 3.10 unstructured-python-client`
`pyenv activate unstructured-python-client`

* Run `make install` and `make test`

### Contributions

While we value open-source contributions to this SDK, this library is generated programmatically.
While we value open-source contributions to this SDK, this library is generated programmatically by Speakeasy. In order to start working with this repo, you need to:
1. Install Speakeasy client locally https://github.com/speakeasy-api/speakeasy#installation
2. Run `speakeasy auth login`
3. Run `make client-generate`. This allows to iterate development with python client.

There are two important files used by `make client-generate`:
1. `openapi.json` which is actually not stored here, [but fetched from unstructured-api](https://raw.githubusercontent.com/Unstructured-IO/unstructured-api/main/openapi.json), represents the API that is supported on backend.
2. `overlay_client.yaml` is a handcrafted diff that when applied over above, produces `openapi_client.json`
which is used to generate SDK.

Once PR with changes is merged, Github CI will autogenerate the Speakeasy client in a new PR, using
the `openapi.json` and `overlay_client.yaml` You will have to manually bring back the human created lines in it.

Feel free to open a PR or a Github issue as a proof of concept and we'll do our best to include it in a future release!

### SDK Created by [Speakeasy](https://www.speakeasyapi.dev/docs/sdk-design/python/methodology-python)
Binary file added _sample_docs/fake.doc
Binary file not shown.
Binary file added _sample_docs/layout-parser-paper.pdf
Binary file not shown.
Binary file added _sample_docs/list-item-example-1.pdf
Binary file not shown.
94 changes: 86 additions & 8 deletions _test_unstructured_client/test__decorators.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,12 @@
import os
import pypdf
import pytest
import requests
from deepdiff import DeepDiff

from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError

from unstructured_client.models.errors import SDKError, HTTPValidationError

FAKE_KEY = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"

Expand All @@ -20,7 +23,7 @@
"unstructured-000mock.api.unstructuredapp.io/general/v0/general",
],
)
def test_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
def test_unit_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
client = UnstructuredClient(
server_url=server_url,
api_key_auth=FAKE_KEY,
Expand All @@ -42,20 +45,20 @@ def test_clean_server_url_fixes_malformed_paid_api_url(server_url: str):
"http://localhost:8000/general/v0/general",
],
)
def test_clean_server_url_fixes_malformed_localhost_url(server_url: str):
def test_unit_clean_server_url_fixes_malformed_localhost_url(server_url: str):
client = UnstructuredClient(
server_url=server_url,
api_key_auth=FAKE_KEY,
)
assert client.general.sdk_configuration.server_url == "http://localhost:8000"


def test_clean_server_url_returns_empty_string_given_empty_string():
def test_unit_clean_server_url_returns_empty_string_given_empty_string():
client = UnstructuredClient( server_url="", api_key_auth=FAKE_KEY)
assert client.general.sdk_configuration.server_url == ""


def test_clean_server_url_returns_None_given_no_server_url():
def test_unit_clean_server_url_returns_None_given_no_server_url():
client = UnstructuredClient(
api_key_auth=FAKE_KEY,
)
Expand All @@ -71,7 +74,7 @@ def test_clean_server_url_returns_None_given_no_server_url():
"unstructured-000mock.api.unstructuredapp.io/general/v0/general",
],
)
def test_clean_server_url_fixes_malformed_urls_with_positional_arguments(
def test_unit_clean_server_url_fixes_malformed_urls_with_positional_arguments(
server_url: str,
):
client = UnstructuredClient(
Expand All @@ -85,7 +88,7 @@ def test_clean_server_url_fixes_malformed_urls_with_positional_arguments(
)


def test_suggest_defining_url_issues_a_warning_on_a_401():
def test_unit_suggest_defining_url_issues_a_warning_on_a_401():
client = UnstructuredClient(
api_key_auth=FAKE_KEY,
)
Expand All @@ -108,3 +111,78 @@ def test_suggest_defining_url_issues_a_warning_on_a_401():
match="If intending to use the paid API, please define `server_url` in your request.",
):
client.general.partition(req)


@pytest.mark.parametrize("call_threads", [1, 2, 5])
@pytest.mark.parametrize(
"filename, expected_ok",
[
("_sample_docs/list-item-example-1.pdf", True), # 1 page
("_sample_docs/layout-parser-paper-fast.pdf", True), # 2 pages
("_sample_docs/layout-parser-paper.pdf", True), # 16 pages
("_sample_docs/fake.doc", True),
("_sample_docs/fake.doc", False), # This will append .pdf to filename to fool first line of filetype detection, to simulate decoding error
],
)
def test_integration_split_pdf_has_same_output_as_non_split(
call_threads: int,
filename: str,
expected_ok: bool,
caplog
):
"""
Tests that output that we get from the split-by-page pdf is the same as from non-split.

Requires unstructured-api running in bg. See Makefile for how to run it.
Doesn't check for raw_response as there's no clear patter for how it changes with the number of pages / call_threads.
"""
try:
response = requests.get("http://localhost:8000/general/docs")
assert response.status_code == 200, "The unstructured-api is not running on localhost:8000"
except requests.exceptions.ConnectionError:
assert False, "The unstructured-api is not running on localhost:8000"

client = UnstructuredClient(
api_key_auth=FAKE_KEY,
server_url="localhost:8000"
)

with open(filename, "rb") as f:
files = shared.Files(
content=f.read(),
file_name=filename,
)

if not expected_ok:
files.file_name += ".pdf"

req = shared.PartitionParameters(
files=files,
strategy='fast',
languages=["eng"],
split_pdf_page=True,
)

os.environ["UNSTRUCTURED_CLIENT_SPLIT_CALL_THREADS"] = str(call_threads)

try:
resp_split = client.general.partition(req)
except (HTTPValidationError, AttributeError) as exc:
if not expected_ok:
assert "error arose when splitting by pages" in caplog.text
assert "File does not appear to be a valid PDF" in str(exc)
return
else:
pytest.exit("unexpected error", returncode=1)

req.split_pdf_page = False
resp_single = client.general.partition(req)

assert len(resp_split.elements) == len(resp_single.elements)
assert resp_split.content_type == resp_single.content_type
assert resp_split.status_code == resp_single.status_code

# Difference in the parent_id is expected, because parent_ids are assigned when element crosses page boundary
diff = DeepDiff(t1=resp_split.elements, t2=resp_single.elements,
exclude_regex_paths=r"root\[\d+\]\['metadata'\]\['parent_id'\]")
assert len(diff) == 0
Loading