kafka(consumer): Document/clarify the consumer behavior, or implement re-processing if processor fails

## Description

Based on the guidance given in https://github.com/twmb/franz-go/issues/440#issuecomment-1528899269. We should decide what our `AtLeastOnceDelivery` guarantees are. Currently, it may be expected that a consumer tries to run the passed Processor more than once if processing fails, but that is not the case. 

`AtLeastOnceDelivery` may only occur if a consumer has fetched some records from Kafka, but the process is terminated, crashes or loses network connectivity. After the consumer misses enough check-ins, the Kafka broker will consider that consumer instance in the consumer group as dead, and will resign its assigned partitions to other active consumers.

However, if the configured `Processor` fails to process a single record, the commit is skipped and the next Fetch may pull new records, commit the new offset and will lead to up to `MaxPollRecords` being lost for a partition.

https://github.com/elastic/apm-queue/blob/a482794ff7551d00b828ec866e8e0d060e97a253/kafka/consumer.go#L76-L82

https://github.com/elastic/apm-queue/blob/a482794ff7551d00b828ec866e8e0d060e97a253/kafka/consumer.go#L64-L69

## Proposed solution(s)

### 1. Document the current behavior, add logging

#### Implemented in #125

The simplest path forward would be to continue processing the records that can be processed instead of stopping the processing of partitions record at that offset and potentially losing the rest of that fetch and log those that cannot be. This will lead to less data loss for `AtMostOnceDelivery`. Update the documentation accordingly.

### 2. Implement some sort of per-partition processing retries

Instead of dropping the events that can't be processed, a per-partition sized buffer could be kept which could be retried on a timer or when there aren't any active records fetched for the partition.

This option increases the memory footprint of the consumer and is slightly more complicated to implement well.

	// Processor that will be used to process each event individually.
	// It is recommended to keep the synchronous processing fast and below the
	// rebalance.timeout.ms setting in Kafka.
	//
	// The processing time of each processing cycle can be calculated as:
	// record.process.time * MaxPollRecords.
	Processor model.BatchProcessor

	// MaxPollRecords defines an upper bound to the number of records that can
	// be polled on a single fetch. If MaxPollRecords <= 0, defaults to 100.
	//
	// It is best to keep the number of polled records small or the consumer
	// risks being forced out of the group if it exceeds rebalance.timeout.ms.
	MaxPollRecords int

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kafka(consumer): Document/clarify the consumer behavior, or implement re-processing if processor fails #118

Description

Proposed solution(s)

1. Document the current behavior, add logging

Implemented in #125

2. Implement some sort of per-partition processing retries

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kafka(consumer): Document/clarify the consumer behavior, or implement re-processing if processor fails #118

Description

Description

Proposed solution(s)

1. Document the current behavior, add logging

Implemented in #125

2. Implement some sort of per-partition processing retries

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions