Skip to content

kafka(consumer): Document/clarify the consumer behavior, or implement re-processing if processor fails #118

@marclop

Description

@marclop

Description

Based on the guidance given in twmb/franz-go#440 (comment). We should decide what our AtLeastOnceDelivery guarantees are. Currently, it may be expected that a consumer tries to run the passed Processor more than once if processing fails, but that is not the case.

AtLeastOnceDelivery may only occur if a consumer has fetched some records from Kafka, but the process is terminated, crashes or loses network connectivity. After the consumer misses enough check-ins, the Kafka broker will consider that consumer instance in the consumer group as dead, and will resign its assigned partitions to other active consumers.

However, if the configured Processor fails to process a single record, the commit is skipped and the next Fetch may pull new records, commit the new offset and will lead to up to MaxPollRecords being lost for a partition.

// Processor that will be used to process each event individually.
// It is recommended to keep the synchronous processing fast and below the
// rebalance.timeout.ms setting in Kafka.
//
// The processing time of each processing cycle can be calculated as:
// record.process.time * MaxPollRecords.
Processor model.BatchProcessor

// MaxPollRecords defines an upper bound to the number of records that can
// be polled on a single fetch. If MaxPollRecords <= 0, defaults to 100.
//
// It is best to keep the number of polled records small or the consumer
// risks being forced out of the group if it exceeds rebalance.timeout.ms.
MaxPollRecords int

Proposed solution(s)

1. Document the current behavior, add logging

Implemented in #125

The simplest path forward would be to continue processing the records that can be processed instead of stopping the processing of partitions record at that offset and potentially losing the rest of that fetch and log those that cannot be. This will lead to less data loss for AtMostOnceDelivery. Update the documentation accordingly.

2. Implement some sort of per-partition processing retries

Instead of dropping the events that can't be processed, a per-partition sized buffer could be kept which could be retried on a timer or when there aren't any active records fetched for the partition.

This option increases the memory footprint of the consumer and is slightly more complicated to implement well.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions