Description
-
currently there is only a multithreaded option of S3Transfer multipart upload/download (if
use_threads=True
is passed). I've found thatmultiprocessing
(where a new session/client is created in each fork, and a shared memory buffer is referenced across the process border) can be 2X faster. -
If there isn't already a way, there should be a way to pass down the HttpConnection write buffer size (see blocksize here, so python threads don't spend a bunch of time repeatedly writing small amounts of data then waiting on the GIL. This results in a 2X speed increase for me, even after applying optimization 1. You can do it outside of S3Transfer in a really hacky way like this:
from http.client import HTTPConnection
# the following logic changes S3Transfer library from 50MB/s to 100MB/s on a toy environment
HTTPConnection.__init__.__defaults__ = tuple(
x if x != 8192 else 1024 * 1024 for x in HTTPConnection.__init__.__defaults__
)
Use Case
S3 multipart upload/download is slower than needs to be for all users
Proposed Solution
-
add multiprocess implementation of the same interface (need only an option for how to obtain a session/client in the new process)
-- use multiprocessing.Queue and start processes just-in-time
-- pass SharedMemory across processes to avoid a chunk copy
-- instantiate new session and client in each subprocess, and upload parts concurrently -
provide some way for users to set the underlying HttpConnection's buffer size
Other Information
No response
Acknowledgements
- I may be able to implement this feature request
- This feature might incur a breaking change
SDK version used
boto3 1.28.52, s3transfer 0.6.0
Environment details (OS name and version, etc.)
linux 5.19.0-0