-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Description
Search before asking
- I had searched in the issues and found no similar feature requirement.
Description
Motivation
From a recent performance test led by @mrproliu , the BanyanDB liaison gRPC server is vulnerable to OOM errors when subjected to high-throughput write traffic. If a client sends data faster than the server can process and persist it, the gRPC library's internal buffers will grow indefinitely, consuming all available heap memory. This is frequently observed in profiling measure.Recv()
under heavy load.
To ensure server stability and prevent crashes, we need to introduce mechanisms that:
- Actively shed load when the system is under high memory pressure.
- Intelligently configure gRPC's network buffers to provide backpressure before the heap is exhausted.
Proposed Solution
This proposal outlines a two-pronged approach to control heap usage by integrating the existing protector
service with the gRPC server's lifecycle and configuration.
1. Load Shedding via Protector State
We will implement a gRPC Stream Server Interceptor that queries the protector
's state before allowing a new stream to be handled.
- Dependency Injection: The
liaison/grpc
server will need a reference to theprotector
service, which should be passed in during initialization. - Interceptor Logic:
- For each new incoming stream, the interceptor will check the current system state by calling
protector.State()
. - If
protector.State()
returnsStateHigh
, it indicates that system memory usage has crossed the configured high-water mark. - In this
StateHigh
condition, the interceptor will immediately reject the new stream with acodes.ResourceExhausted
gRPC status. This provides clear, immediate backpressure to the client, signaling that the server is temporarily unable to accept new workloads. - If the state is
StateLow
, the stream will be processed as normal.
- For each new incoming stream, the interceptor will check the current system state by calling
// pseudocode for the interceptor
func (s *server) protectorLoadSheddingInterceptor(...) error {
if s.protector.State() == protector.StateHigh {
s.log.Warn().Msg("rejecting new stream due to high memory pressure")
return status.Errorf(codes.ResourceExhausted, "server is busy, please retry later")
}
return handler(srv, ss)
}
2. Dynamic gRPC Buffer Sizing Based on Available Memory
Instead of using fixed, static buffer sizes, we will dynamically calculate the gRPC HTTP/2 flow control windows at server startup based on the available system memory reported by the protector
.
- Startup Logic: During the
Serve()
phase of the gRPC server, it will query the system's available memory. This can be done by calling the protector. - Configuration: Introduce a new configuration flag, e.g.,
grpc.buffer.memory-ratio
(defaulting to0.10
for 10%). This will determine what fraction of the available system memory should be allocated to gRPC's connection-level buffers. - Heuristic for Window Calculation:
totalBufferSize = availableMemory * memoryRatio
InitialConnWindowSize = totalBufferSize * 2 / 3
InitialWindowSize = totalBufferSize * 1 / 3
- This 2:1 ratio ensures the connection-level buffer is larger than any single stream's buffer, which is a common and effective practice.
- Applying the Options: The calculated values will be passed to
grpc.NewServer()
using thegrpc.InitialWindowSize()
andgrpc.InitialConnWindowSize()
server options. - Override Mechanism: The existing static configuration flags for window sizes (
grpc.InitialWindowSize
, etc.) should take precedence. If a user sets a specific value, the dynamic calculation will be skipped. This allows for expert manual tuning.
Use case
No response
Related issues
No response
Are you willing to submit a pull request to implement this on your own?
- Yes I am willing to submit a pull request on my own!
Code of Conduct
- I agree to follow this project's Code of Conduct