-
Notifications
You must be signed in to change notification settings - Fork 16
Description
Description
Based on the guidance given in twmb/franz-go#440 (comment). We should decide what our AtLeastOnceDelivery
guarantees are. Currently, it may be expected that a consumer tries to run the passed Processor more than once if processing fails, but that is not the case.
AtLeastOnceDelivery
may only occur if a consumer has fetched some records from Kafka, but the process is terminated, crashes or loses network connectivity. After the consumer misses enough check-ins, the Kafka broker will consider that consumer instance in the consumer group as dead, and will resign its assigned partitions to other active consumers.
However, if the configured Processor
fails to process a single record, the commit is skipped and the next Fetch may pull new records, commit the new offset and will lead to up to MaxPollRecords
being lost for a partition.
Lines 76 to 82 in a482794
// Processor that will be used to process each event individually. | |
// It is recommended to keep the synchronous processing fast and below the | |
// rebalance.timeout.ms setting in Kafka. | |
// | |
// The processing time of each processing cycle can be calculated as: | |
// record.process.time * MaxPollRecords. | |
Processor model.BatchProcessor |
Lines 64 to 69 in a482794
// MaxPollRecords defines an upper bound to the number of records that can | |
// be polled on a single fetch. If MaxPollRecords <= 0, defaults to 100. | |
// | |
// It is best to keep the number of polled records small or the consumer | |
// risks being forced out of the group if it exceeds rebalance.timeout.ms. | |
MaxPollRecords int |
Proposed solution(s)
1. Document the current behavior, add logging
Implemented in #125
The simplest path forward would be to continue processing the records that can be processed instead of stopping the processing of partitions record at that offset and potentially losing the rest of that fetch and log those that cannot be. This will lead to less data loss for AtMostOnceDelivery
. Update the documentation accordingly.
2. Implement some sort of per-partition processing retries
Instead of dropping the events that can't be processed, a per-partition sized buffer could be kept which could be retried on a timer or when there aren't any active records fetched for the partition.
This option increases the memory footprint of the consumer and is slightly more complicated to implement well.