The unstructured_api_tools library includes utilities for converting pipeline notebooks into
REST API applications. unstructured_api_tools is intended for use in conjunction with
pipeline repos. See pipeline-sec-filings
for an example of a repo that uses unstructured_api_tools.
To install the library, run pip install unstructured_api_tools.
-
Using
pyenvto manage virtualenv's is recommended -
Create a virtualenv to work in and activate it, e.g. for one named
unstructured_api_tools:pyenv virtualenv 3.8.15 unstructured_api_tools
pyenv activate unstructured_api_tools -
Run
make install-project-local
Use the CLI command to convert pipeline notebooks to scripts, for example:
unstructured_api_tools convert-pipeline-notebooks \
--input-directory pipeline-family-sec-filings/pipeline-notebooks \
--output-directory pipeline-family-sec-filings/prepline_sec_filings/api \
--pipeline-family sec-filings \
--semver 0.2.1If you do not provide the pipeline-family and semver arguments, those values are parsed from
preprocessing-pipeline-family.yaml. You can provide the preprocessing-pipeline-family.yaml file
explicitly with --config-filename or the PIPELINE_FAMILY_CONFIG environment variable. If neither
of those is specified, the fallback is to use the preprocessing-pipeline-family.yaml file in the
current working directory.
The API file undergoes black, flake8 and mypy checks after being generated. If you want
flake8 to ignore specific errors, you can specify them through the CLI with
--flake8-ignore F401, E402.
See the flake8 docs
for a full list of error codes.
The command described in Usage generates a FastAPI API route for each pipeline_api
function defined in the notebook. The signature of the pipeline_api method determines what
parameters the generated FastAPI accepts.
Currently, only plain text file uploads are supported and as such the first argument must always be
text, but support for multiple files and binary files is coming soon!
In addition, any number of string array parameters may be specified. Any kwarg beginning with
m_ indicates a multi-value string parameter that is accepted by the FastAPI API.
For example, in a notebook containing:
def pipeline_api(text, m_subject=[], m_name=[]):
text represents the content of a file posted to the FastAPI API, and the m_subject and m_name
keyword args represent optional parameters that may be posted to the API as well, both allowing
multiple string parameters. A curl request against such an API could look like this:
curl -X 'POST' \
'https://<hostname>/<pipeline-family-name>/<pipeline-family-version>/<api-name>' \
-H 'accept: application/json' \
-H 'Content-Type: multipart/form-data' \
-F '[email protected]' \
-F 'subject=art' \
-F 'subject=history'
-F 'subject=math' \
-F 'name=feynman'
In addition, you can specify the response type if pipeline_api can support both "application/json"
and "text/csv" as return types.
For example, in a notebook containing a kwarg response_type:
def pipeline_api(text, response_type="text/csv", m_subject=[], m_name=[]):
The consumer of the API may then specify "text/csv" as the requested response content type with the usual
HTTP Accept header, e.g. Accept: application/json or Accept: text/csv.
See our security policy for information on how to report security vulnerabilities.
| Section | Description |
|---|---|
| Company Website | Unstructured.io product and company info |
