fix(storage): Fix SocketTimeoutException when executing a long multi-part upload #2973
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Issue #, if available: N/A
Problem: We found that with multi-part uploads that took a long time (e.g. a large file being uploaded or a slow network connection) would eventually cause
SocketTimeoutExceptions. After a long investigation, we found that the issue is with creating a lot ofCoroutineWorkersand queuing them up at the same time. Since all of the pending part uploads were being queued at the same time,WorkManagerwould actually queue them ALL of them at the same time because the S3 uploadPart call is a suspending function. So a part upload may "start" at T:0, get suspended because 20 other parts are trying to upload at the same time, and then not actually continue until 5 minutes later. And since the socket connection was already open, it would timeout because nothing got sent.Description of change:
The fix to this was to change the underlying
Workerfor the PartUploader from aCoroutineWorker(suspending) to a plainWorker(blocking).This fixes the issue and I was able to upload a 400MB file over a simulated 4G connection (so like a half hour lol). Looking at the logs and network, only 3-4 Workers were active at a time.
Had to get creative with abstraction because of the
RouterWorkerthat's being used to route a work item into the appropriate subclassed *Worker.BaseTransferWorkergot converted from anabstract classto aninterfaceand then logic got moved to abstract classesSuspendingTransferWorkerandBlockingTransferWorkeras appropriate.How did you test these changes?
Besides testing multi-part, I regression tested all of the child Workers that got touched. Test cases:
WorkManagercould continue pending Work.Documentation update required?
General Checklist
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.