-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Summary
Currently, the from_image and from_tesseract_api Python SDK allows output_format="json+binref" but this is not supported by _decode_array (after #422 is merged this option will be explicitly removed for clarity). The current implementation of _decode_array only offers support for json+base64 encoding explicitly and otherwise defaults to raw list handling:
def _decode_array(encoded_arr: dict) -> np.ndarray:
if "data" in encoded_arr:
if encoded_arr["data"]["encoding"] == "base64":
data = base64.b64decode(encoded_arr["data"]["buffer"])
arr = np.frombuffer(data, dtype=encoded_arr["dtype"])
else:
arr = np.array(encoded_arr["data"]["buffer"], dtype=encoded_arr["dtype"])To support json+binref, we need the following:
- Add failing tests
- Allow output_path to be provided as argument to from_url (or to query this with a request)
- Ensure output_path is provided when using json+binref encoding (or queried with a request if absent)
- Add
fsspecto SDK dependencies to allow for reading of cloud file storage (for interim local solution can just use Python's built-inopen) - After safely reading file convert bytes to numpy array in _decode_array using np.frombuffer
- Double-check this works properly with from_url (not sure if binaries get downloaded with request otherwise we might need to request them directly).
Why is this needed?
The performance improvement of json+binref is unlikely to be material until output sizes exceed at least 100MB-5GB (depending on network speed). In these cases the
- 25% lower memory footprint of pure binaries over base64,
- more manageable json file sizes,
- minimal encode/decode time (the least material advantage as base64 encoding/decoding is already highly optimized)
could add up to a material improvement in serialization plus transfer time.
Happy to address when we determine there is sufficient advantage for this or to perform minimal tests on base64 vs binref encoding/decoding time.
Usage example
from tesseract_core import Tesseract
with Tesseract.from_image(..., output_format="json+binref"):
....