-
Notifications
You must be signed in to change notification settings - Fork 106
Description
Hey.
We have a topic with 50 partitions, with consumers scaled out to a relatively small number of hosts (3). Sporadically, we get a situation where one particular partition gets stuck and does not commit any new offsets. As soon as we restart the hosts, it kicks back into life and immediately catches up.
We see this in the logs very frequently after it becomes stuck:
Skipping fetching records for assigned partition <mypartition> because it is paused
The last couple of FS2 Kafka logs mentioning the stuck partition, before it was stuck, say this:
Completed fetches with records for partitions [ <mypartition> -> { first: 45677489, last: 45677495 }, .... ]
Followed by logs that never mention the partition again:
Current state [State(fetches = Map(... a number of other partitions, but not <mypartition> ... ), ...)]
In trying to figure out what's going on, I ended up looking at this bit of the code:
def pollConsumer(state: State[F, K, V]): F[ConsumerRecords] =
withConsumer
.blocking { consumer =>
val assigned = consumer.assignment.toSet
val requested = state.fetches.keySetStrict
val available = state.records.keySetStrict
val resume = (requested intersect assigned) diff available
val pause = assigned diff resume
if (pause.nonEmpty)
consumer.pause(pause.asJava)
if (resume.nonEmpty)
consumer.resume(resume.asJava)
consumer.poll(pollTimeout)
}
.flatMap(records)
Filling in all the possible combinations, I think we get these possible outcomes:

I've highlighted the state that I think we're in, and, given that the java kafka library is saying it will skip fetching for paused partitions, I'm a little confused how the partition is then expected to resume.
If I pull down the FS2 Kafka codebase, and replace the intersect
with union
the KafkaConsumerSpec tests still pass, and it would cause our highlighted scenario to be resumed (i think), but I don't know what the consequences of that would be!
It feels like this should be a more widespread issue given how central this code is, and how long it's been like this for, so I bet I'm missing something.
Thanks