Skip to content

Binary array encoding (i.e. json+binref) not supported by Python SDK #423

@jpbrodrick89

Description

@jpbrodrick89

Summary

Currently, the from_image and from_tesseract_api Python SDK allows output_format="json+binref" but this is not supported by _decode_array (after #422 is merged this option will be explicitly removed for clarity). The current implementation of _decode_array only offers support for json+base64 encoding explicitly and otherwise defaults to raw list handling:

def _decode_array(encoded_arr: dict) -> np.ndarray:
    if "data" in encoded_arr:
        if encoded_arr["data"]["encoding"] == "base64":
            data = base64.b64decode(encoded_arr["data"]["buffer"])
            arr = np.frombuffer(data, dtype=encoded_arr["dtype"])
        else:
            arr = np.array(encoded_arr["data"]["buffer"], dtype=encoded_arr["dtype"])

To support json+binref, we need the following:

  1. Add failing tests
  2. Allow output_path to be provided as argument to from_url (or to query this with a request)
  3. Ensure output_path is provided when using json+binref encoding (or queried with a request if absent)
  4. Add fsspec to SDK dependencies to allow for reading of cloud file storage (for interim local solution can just use Python's built-in open)
  5. After safely reading file convert bytes to numpy array in _decode_array using np.frombuffer
  6. Double-check this works properly with from_url (not sure if binaries get downloaded with request otherwise we might need to request them directly).

Why is this needed?

The performance improvement of json+binref is unlikely to be material until output sizes exceed at least 100MB-5GB (depending on network speed). In these cases the

  1. 25% lower memory footprint of pure binaries over base64,
  2. more manageable json file sizes,
  3. minimal encode/decode time (the least material advantage as base64 encoding/decoding is already highly optimized)

could add up to a material improvement in serialization plus transfer time.

Happy to address when we determine there is sufficient advantage for this or to perform minimal tests on base64 vs binref encoding/decoding time.

Usage example

from tesseract_core import Tesseract

with Tesseract.from_image(..., output_format="json+binref"):
    ....

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions