Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArynSDK partition_file() hangs when run under VCR recording #958

Open
alexaryn opened this issue Oct 19, 2024 · 0 comments
Open

ArynSDK partition_file() hangs when run under VCR recording #958

alexaryn opened this issue Oct 19, 2024 · 0 comments

Comments

@alexaryn
Copy link
Contributor

Describe the bug
The partition_file() call works fine alone, but wrapping it in with vcr.use_cassette() causes it to hang indefinitely.

To Reproduce
Try a script like this:

import vcr
...
with vcr.use_cassette("/tmp/vcr_cassette.yaml"):
    with open(pdf, "rb") as fp:
        partitioned_doc = partition_file(fp, ARYN_API_KEY)

Expected behavior
The script terminates.

Additional context
I used tcpdump to capture network traffic. Without vcr, I see FIN packets, indicating closure of the connection. With vcr, there are no FINs. The FIN packets originate from the client side. When it's done reading the response, it appears to close. I think the behavior is in the iterator here:

for part in resp.iter_content(None):

I suspected that vcr doesn't handle streaming responses very well. I changed stream from True to False in this line:

resp = requests.post(aps_url, files=files, headers=http_header, stream=True, verify=ssl_verify)

And now vcr works fine and I see FIN packets. I'm not suggesting that we disable streaming, but it seems like useful diagnostic information.

Another thing I tried was to keep stream=True and set the HTTP header "Connection: close" here:

http_header = {"Authorization": "Bearer {}".format(aryn_config.api_key())}

The end result was still a hang, but this time I saw FIN packets, but they originated on the server side. I also noticed the Python process using 100% of a CPU while hanging.

I suspect the best workaround here is to disable streaming when vcr is involved. When I think about it, streaming doesn't make much sense in the context of cached responses. The reasons we use streaming are twofold: (1) provide timely feedback on progress, and (2) avoid idle connections that a firewall might shut down. If the client disables streaming, I think we lose #1, but keep #2 since the server always sends messages periodically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant