Skip to content

Multipart optimizations #3049

Open
Open
@codyohl

Description

@codyohl
  1. currently there is only a multithreaded option of S3Transfer multipart upload/download (if use_threads=True is passed). I've found that multiprocessing (where a new session/client is created in each fork, and a shared memory buffer is referenced across the process border) can be 2X faster.

  2. If there isn't already a way, there should be a way to pass down the HttpConnection write buffer size (see blocksize here, so python threads don't spend a bunch of time repeatedly writing small amounts of data then waiting on the GIL. This results in a 2X speed increase for me, even after applying optimization 1. You can do it outside of S3Transfer in a really hacky way like this:

from http.client import HTTPConnection

# the following logic changes S3Transfer library from 50MB/s to 100MB/s on a toy environment
HTTPConnection.__init__.__defaults__ = tuple(
    x if x != 8192 else 1024 * 1024 for x in HTTPConnection.__init__.__defaults__
)

Use Case

S3 multipart upload/download is slower than needs to be for all users

Proposed Solution

  1. add multiprocess implementation of the same interface (need only an option for how to obtain a session/client in the new process)
    -- use multiprocessing.Queue and start processes just-in-time
    -- pass SharedMemory across processes to avoid a chunk copy
    -- instantiate new session and client in each subprocess, and upload parts concurrently

  2. provide some way for users to set the underlying HttpConnection's buffer size

Other Information

No response

Acknowledgements

  • I may be able to implement this feature request
  • This feature might incur a breaking change

SDK version used

boto3 1.28.52, s3transfer 0.6.0

Environment details (OS name and version, etc.)

linux 5.19.0-0

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestThis issue requests a feature.p2This is a standard priority issues3transfer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions