Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14

fables-tales · 2023-11-03T13:10:39Z

Consider a call flow like:

@read_buffer in our underlying async/io/stream contains exactly 9 bytes.
read_frame (takes 9 bytes off the underlying @read_buffer in consume_read_buffer) and this completely drains @read_buffer
successfully gets the header
reads payload, times out, @read_buffer is still empty, we do not parse the frame, and exit the call flow.
retry read_frame with a higher timeout
we enter read_header again, which will call fill_read_buffer (fills buffer with ~thousands of bytes)
@read_buffer in the underlying stream now contains the payload of the previous frame, instead of a valid frame header, and we get a protocol error.

I think in this case the "right" thing to do is put the 9 bytes back in the read buffer, or hold the frame header and retry reading the payload, instead of trying to read the header out of what is certainly payload.

def read_frame(maximum_frame_size = MAXIMUM_ALLOWED_FRAME_SIZE)
  # Read the header:
  length, type, flags, stream_id = read_header <- second time we come here, we're reading payload bytes, not header bytes
				
  # Async.logger.debug(self) {"read_frame: length=#{length} type=#{type} flags=#{flags} stream_id=#{stream_id} -> klass=#{@frames[type].inspect}"}
				
  # Allocate the frame:
  klass = @frames[type] || Frame
  frame = klass.new(stream_id, flags, type, length)
				
  # Read the payload:
  frame.read(@stream, maximum_frame_size) <- timeout occurs here
				
  # Async.logger.debug(self, name: "read") {frame.inspect}
				
  return frame
end

The text was updated successfully, but these errors were encountered:

penelope-stripe · 2023-11-03T14:00:23Z

require "protocol/http2/data_frame"
require "stringio"
require "protocol/http2/framer"
require "async/reactor"

class FunkyIO
  def initialize
    @f = Protocol::HTTP2::DataFrame.new(401, 0, Protocol::HTTP2::DataFrame::TYPE, 13, "a" * 13)
    @sio = StringIO.new
    @f.write_header(@sio)
    @sio.rewind

    @state = :yield_first_header
  end

  def read(size, buf = nil)
    case @state
    when :yield_first_header
      res = @sio.read(size, buf)
      @state = :now_timeout
      res
    when :now_timeout
      @state = :write_payload
      raise Async::TimeoutError
    when :write_payload
      @sio = StringIO.new
      @f.write_payload(@sio)
      @sio.rewind
      @sio.read(size, buf)
    end
  end
end

f = Protocol::HTTP2::Framer.new(FunkyIO.new)
Async::Reactor.run do
  begin
    p(f.read_frame)
  rescue Async::TimeoutError
    # try again
  end

  p(f.read_frame)
end

this script minimally reproduces the bug

ioquatix · 2023-11-03T21:40:25Z

Wow, nice find, I'll sort this out right away! Thanks!

ioquatix · 2024-01-23T20:05:20Z

retry read_frame with a higher timeout

Do you mind explaining in what situation you are retrying? I would assume that if the operation failed, you'd give up completely.

maruth-stripe · 2024-01-23T21:20:50Z

Do you mind explaining in what situation you are retrying? I would assume that if the operation failed, you'd give up completely

If the operation fails we would have to throw away the connection since the connection is left in a corrupted state (there's now a H/2 payload on the wire with no header, which is garbage for all intents and purposes.).

Currently, Request 1 timing out on reading a payload off the wire means any future reads off the wire are wrecked. We want to use the same connection for as long as possible. Having to re-establish a connection every time a read times out is quite toilsome.

Shopify has also seen this precise bug occur while using async-http (cc @dwdking)

We have had the patches in the PRs I've made deployed at Stripe for a couple of months now. Before the patch we were experiencing an incredibly high number of errors from this issue every day, the patch brought it down to ~0.

ioquatix · 2024-01-23T21:53:53Z

That makes sense and I understand the value of the related PRs. However, if timeout is a problem, why not increase the timeout too? It sounds like you are having a timeout while reading the frame, then retrying if timeout occurs. Maybe it would be more logical to increase the timeout, e.g. in your case 2x or 3x? At least my intention with the timeout is as a last ditch effort and retrying the operation would not make sense after a timeout occurred (the connection could be in a funky state as you've correctly outlined).

maruth-stripe · 2024-01-24T18:09:09Z

Increasing the timeout is not always feasible, since the timeout is determined by set of constraints the system must meet.

There are a couple of use-cases for wanting the connection to remain in a healthy state after timeout:

Multiple streams: If we have multiple streams on a connection, one stream exceeding its timeout should not result in the connection being abandoned -- hence penalizing all the other streams which may have been completed within their respective timeouts.
Correctness, Strong invariants: Makes reasoning significantly easier from a correctness point of view since you get a strong invariant regarding connection corruption. (I’ll come back to this again in a bit)
Retry with backoff: Sometimes one may want to retry a request after some backoff in case the server we’re making requests to is experiencing pressure. Ideally, without re-establishing a connection

The current behavior basically ends up necessitating throwing away the connection upon timeout. However, this is currently (1) not explicit in documentation, (2) not clear from the API, and (3) not something the library protects against. What we end up with is having every callsite to read_frame become

begin
  stream.read_frame
rescue Async::TimeoutError
  # throw away connection
end

which is (a) very toilsome and ugly, (b) non-trivial to get right.

In summary, the problems faced were the following:

We experienced correctness issues that took significant developer effort to debug. The error manifested as FrameSizeErrors, and tracing that back to a corrupted connection from timeouts is not obvious.
Always giving up on the connection is not the best for performance
timeout’s being (implicitly) fatal makes development and debugging difficult

The proposed fixes provide a strong correctness invariant, while allowing connections to persist.

ioquatix · 2024-01-24T19:22:44Z

Thanks for the clear explanation, it makes sense. I agree, the invariant makes sense.

ioquatix · 2024-06-10T10:09:09Z

Okay, I'm planning to work on this here: #19

ioquatix · 2024-06-10T10:55:26Z

@maruth-stripe are you using async-http on top of protocol-http2 or something else?

fables-tales · 2024-06-10T11:15:39Z

(I work with maruth) we built our own wrapper around protocol-http2 that uses async for a gRPC client that has some fairly specific constraints because the Stripe codebase is 30 million lines of Ruby with about 2000 active ruby developers.

ioquatix · 2024-06-10T13:06:52Z

Thanks @fables-tales for the clarification.

What I'm trying to understand is what scenario you are re-entering the Framer#read_frame.

In Async::HTTP::Protocol::HTTP2::Connection, we have a single background task reading frames and invoking the correct logic.

Can you help answer a few questions for me?

Are you assuming Framer#read_frame is re-entrant or safe to call from multiple tasks?
Do you have a single task invoking Framer#read_frame like https://github.com/socketry/async-http/blob/d0894a0e1c9d7af40cbf2fa82716dea8b422c4ba/lib/async/http/protocol/http2/connection.rb#L89-L92 or are you doing something different?
Is it only timeouts causing the problem, or are there other issues?
Even if we read full frames (I agree, if we can do so efficiently, it's a good idea), does it matter that Headers + Continuation frames may be read in separate operations?

fable-stripe · 2024-06-10T13:33:21Z

@maruth-stripe keep me honest here but I don't think we are re-entering read frame, we specifically added code to check against that.

the core behind our code looks something like:

c = grpc_connection(some_hostname)
handle = c.some_rpc(some_data)
handle.blocking_response_iterator(timeout) do |decoded_message|
...
end

blocking_response_iterator then starts a task within a reactor that calls read_frame and yields messages. We do not capture the return value of read_frame, but instead use the process_* family of methods to update state, and blocking_response iterator will yield a message (or timeout) if a data frame is read.

ioquatix · 2024-06-10T14:20:03Z

Okay, so for my understanding, you aren't multiplexing requests on a single connection and instead depending on sequential processing of frames for each stream until the stream is done?

fables-tales · 2024-06-10T14:48:58Z

multiplexing is possible:

c = connect(some_host)
a = c.rpc_1(data)
b = c.rpc_2(data)
a.blocking_response_iterator.each.take(3) do
end
b.blocking_response_iterator.each do
  ...
end
a.blocking_response_iterator.each do
end

is a pattern we support

ioquatix · 2024-06-10T15:19:15Z

What do you do if you receive a frame for a different stream? Are you multiplexing using a queue for each stream or something like that?

fables-tales · 2024-06-10T15:25:54Z

yes, if b's blocking response iterator is running and we receive a message for a we store it in a queue

fables-tales changed the title ~~Framer's sequential reads of frame header then payload can leave input buffer in a corrupted state~~ Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state Nov 3, 2023

ioquatix self-assigned this Nov 3, 2023

ioquatix added the bug Something isn't working label Nov 3, 2023

maruth-stripe mentioned this issue Nov 6, 2023

Add functionality to be able to peek n bytes off the wire socketry/async-io#72

Merged

3 tasks

ioquatix closed this as completed in socketry/async-io#72 Nov 9, 2023

maruth-stripe mentioned this issue Nov 9, 2023

Ensure wire always contains a full H/2 frame #15

Merged

3 tasks

ioquatix linked a pull request Jun 10, 2024 that will close this issue

Ensure wire always contains a full H/2 frame (#15) #19

Open

3 tasks

ioquatix reopened this Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14

Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14

fables-tales commented Nov 3, 2023 •

edited

Loading

penelope-stripe commented Nov 3, 2023

ioquatix commented Nov 3, 2023 •

edited

Loading

ioquatix commented Jan 23, 2024

maruth-stripe commented Jan 23, 2024

ioquatix commented Jan 23, 2024 •

edited

Loading

maruth-stripe commented Jan 24, 2024

ioquatix commented Jan 24, 2024

ioquatix commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fable-stripe commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14

Framer's sequential reads of frame header then payload can leave underlying async stream's @read_buffer in a corrupted state #14

Comments

fables-tales commented Nov 3, 2023 • edited Loading

penelope-stripe commented Nov 3, 2023

ioquatix commented Nov 3, 2023 • edited Loading

ioquatix commented Jan 23, 2024

maruth-stripe commented Jan 23, 2024

ioquatix commented Jan 23, 2024 • edited Loading

maruth-stripe commented Jan 24, 2024

ioquatix commented Jan 24, 2024

ioquatix commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fable-stripe commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

ioquatix commented Jun 10, 2024

fables-tales commented Jun 10, 2024

fables-tales commented Nov 3, 2023 •

edited

Loading

ioquatix commented Nov 3, 2023 •

edited

Loading

ioquatix commented Jan 23, 2024 •

edited

Loading