Skip to content

Stuck partitions - question about the pause/resume logic #1230

@simonpetty

Description

@simonpetty

Hey.

We have a topic with 50 partitions, with consumers scaled out to a relatively small number of hosts (3). Sporadically, we get a situation where one particular partition gets stuck and does not commit any new offsets. As soon as we restart the hosts, it kicks back into life and immediately catches up.

We see this in the logs very frequently after it becomes stuck:

Skipping fetching records for assigned partition <mypartition> because it is paused

The last couple of FS2 Kafka logs mentioning the stuck partition, before it was stuck, say this:

Completed fetches with records for partitions [ <mypartition> -> { first: 45677489, last: 45677495 }, .... ]

Followed by logs that never mention the partition again:

Current state [State(fetches = Map(... a number of other partitions, but not <mypartition> ... ), ...)]

In trying to figure out what's going on, I ended up looking at this bit of the code:

def pollConsumer(state: State[F, K, V]): F[ConsumerRecords] =
      withConsumer
        .blocking { consumer =>
          val assigned = consumer.assignment.toSet
          val requested = state.fetches.keySetStrict
          val available = state.records.keySetStrict

          val resume = (requested intersect assigned) diff available
          val pause = assigned diff resume

          if (pause.nonEmpty)
            consumer.pause(pause.asJava)

          if (resume.nonEmpty)
            consumer.resume(resume.asJava)

          consumer.poll(pollTimeout)
        }
        .flatMap(records)

Filling in all the possible combinations, I think we get these possible outcomes:

Screenshot 2023-08-10 at 09 01 54

I've highlighted the state that I think we're in, and, given that the java kafka library is saying it will skip fetching for paused partitions, I'm a little confused how the partition is then expected to resume.

If I pull down the FS2 Kafka codebase, and replace the intersect with union the KafkaConsumerSpec tests still pass, and it would cause our highlighted scenario to be resumed (i think), but I don't know what the consequences of that would be!

It feels like this should be a more widespread issue given how central this code is, and how long it's been like this for, so I bet I'm missing something.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions