You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
currently there is only a multithreaded option of S3Transfer multipart upload/download (if use_threads=True is passed). I've found that multiprocessing (where a new session/client is created in each fork, and a shared memory buffer is referenced across the process border) can be 2X faster.
If there isn't already a way, there should be a way to pass down the HttpConnection write buffer size (see blocksize here, so python threads don't spend a bunch of time repeatedly writing small amounts of data then waiting on the GIL. This results in a 2X speed increase for me, even after applying optimization 1. You can do it outside of S3Transfer in a really hacky way like this:
from http.client import HTTPConnection
# the following logic changes S3Transfer library from 50MB/s to 100MB/s on a toy environment
HTTPConnection.__init__.__defaults__ = tuple(
x if x != 8192 else 1024 * 1024 for x in HTTPConnection.__init__.__defaults__
)
Use Case
S3 multipart upload/download is slower than needs to be for all users
Proposed Solution
add multiprocess implementation of the same interface (need only an option for how to obtain a session/client in the new process)
-- use multiprocessing.Queue and start processes just-in-time
-- pass SharedMemory across processes to avoid a chunk copy
-- instantiate new session and client in each subprocess, and upload parts concurrently
provide some way for users to set the underlying HttpConnection's buffer size
Other Information
No response
Acknowledgements
I may be able to implement this feature request
This feature might incur a breaking change
SDK version used
boto3 1.28.52, s3transfer 0.6.0
Environment details (OS name and version, etc.)
linux 5.19.0-0
The text was updated successfully, but these errors were encountered:
currently there is only a multithreaded option of S3Transfer multipart upload/download (if
use_threads=True
is passed). I've found thatmultiprocessing
(where a new session/client is created in each fork, and a shared memory buffer is referenced across the process border) can be 2X faster.If there isn't already a way, there should be a way to pass down the HttpConnection write buffer size (see blocksize here, so python threads don't spend a bunch of time repeatedly writing small amounts of data then waiting on the GIL. This results in a 2X speed increase for me, even after applying optimization 1. You can do it outside of S3Transfer in a really hacky way like this:
Use Case
S3 multipart upload/download is slower than needs to be for all users
Proposed Solution
add multiprocess implementation of the same interface (need only an option for how to obtain a session/client in the new process)
-- use multiprocessing.Queue and start processes just-in-time
-- pass SharedMemory across processes to avoid a chunk copy
-- instantiate new session and client in each subprocess, and upload parts concurrently
provide some way for users to set the underlying HttpConnection's buffer size
Other Information
No response
Acknowledgements
SDK version used
boto3 1.28.52, s3transfer 0.6.0
Environment details (OS name and version, etc.)
linux 5.19.0-0
The text was updated successfully, but these errors were encountered: