Description
In the PQ, batches are unbroken sequences of events, and are currently constrained to originate from a single page, in part to facilitate the ACK-ing of the batch's range without needing to ACK each individual event.
This limitation can cause undersized batches. In an extreme example, a pipeline configured with the default page size of 64MiB, processing events that are 500KiB in size, will emit one 125-event batch (~61MiB) and one ~6-event batch (~3MiB; the rest of the page).
In theory, a batch could be made to safely span multiple pages, while maintaining the unbroken sequential ordering guarantees of the existing implementation.
Implementation Notes
To do this I would extract an interface from the existing Batch
, rename the current single-page-origin implementation something like BatchSegment
, and add a new implementation that composes multiple BatchSegment
s in-order and delegates to their methods as necessary.
The reader would then need to be modified to continue across page boundaries, and to return enough information that the resulting composite batch could track its multiple segments. The existing synchronization that ensures at-most-one worker is contending for the read lock will also ensure that the batch as-composed will remain an unbroken sequence.
This may dove-tail with the refactoring required for #17821