-
Notifications
You must be signed in to change notification settings - Fork 65
Description
Describe the bug
The partition_file() call works fine alone, but wrapping it in with vcr.use_cassette()
causes it to hang indefinitely.
To Reproduce
Try a script like this:
import vcr
...
with vcr.use_cassette("/tmp/vcr_cassette.yaml"):
with open(pdf, "rb") as fp:
partitioned_doc = partition_file(fp, ARYN_API_KEY)
Expected behavior
The script terminates.
Additional context
I used tcpdump to capture network traffic. Without vcr, I see FIN packets, indicating closure of the connection. With vcr, there are no FINs. The FIN packets originate from the client side. When it's done reading the response, it appears to close. I think the behavior is in the iterator here:
for part in resp.iter_content(None): |
I suspected that vcr doesn't handle streaming responses very well. I changed stream from True to False in this line:
resp = requests.post(aps_url, files=files, headers=http_header, stream=True, verify=ssl_verify) |
And now vcr works fine and I see FIN packets. I'm not suggesting that we disable streaming, but it seems like useful diagnostic information.
Another thing I tried was to keep stream=True and set the HTTP header "Connection: close" here:
http_header = {"Authorization": "Bearer {}".format(aryn_config.api_key())} |
The end result was still a hang, but this time I saw FIN packets, but they originated on the server side. I also noticed the Python process using 100% of a CPU while hanging.
I suspect the best workaround here is to disable streaming when vcr is involved. When I think about it, streaming doesn't make much sense in the context of cached responses. The reasons we use streaming are twofold: (1) provide timely feedback on progress, and (2) avoid idle connections that a firewall might shut down. If the client disables streaming, I think we lose #1, but keep #2 since the server always sends messages periodically.