Skip to content

Network related deadlock, stuck in re_upload phase #1123

@bvinc

Description

@bvinc

This issue is to inform anyone experiencing deadlocks or hangs related to re_upload requests what the issue is, and how it can be fixed. Since buck2 uses tonic and h2, and by default, allows 1024 concurrent requests to the CAS, it is likely to trigger this bug in the h2 crate.

hyperium/h2#860

  • With many actions that require uploads, they go through an re_upload phase where they call find_missing_blobs followed by uploads.
  • Buck2 has a cas_semaphore that allows for up to 1024 actions to enter the upload phase at a time.
  • This creates up to 1024 HTTP2 streams in a single connection to the CAS grpc server.
  • HTTP2 servers have a limit on the maximum number of concurrent streams. If the client attempts to create more streams than the remote server allows, these streams get created locally only, and put into a state called pending_open, and accept locally buffered data.
  • In HTTP2, each individual stream has a window, and the connection as a whole has a window, of the amount of unacknowledged data that can be sent. In H2, connection-level window capacity is assigned to streams that have data ready to send.
  • Due to the above bug in H2, what happens is that connection-level window capacity gets assigned to streams that are still in the pending_open state.
  • Once all of your connection level capacity is given to streams in the pending_open state, nothing more gets sent, and the connection stalls forever.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions