-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Bug Description
Hi Strimzi experts,
I'm running a Strimzi-managed Kafka Connect cluster in a single Kubernetes namespace.
Next to it, I have hundreds of KafkaConnectors in a different namespace that are listening to multiple MongoDB instances. One MongoDB instance for each connector.
The Strimzi operator is deployed via Helm chart version 0.45.0 from oci://quay.io/strimzi-helm/strimzi-kafka-operator.
I'm facing an issue where one single stuck KafkaConnector CR with connectivity issues to mongo blocks the entire reconciliation loop of Strimzi Kafka Connect operator.
This blocks the operator from processing other connectors and even the cluster's resources like Kafka Connect pods or brokers. I can't restart resources via annotations, and other connectors aren't registering / deleting in Kafka Connect. Attempts to restart them or use "rollingUpdate: true" don't work.
Here's a log excerpt from the operator for that single Kafkaconnector :
2025-09-30 10:27:12 WARN KafkaConnectAssemblyOperator:584 - Reconciliation #28129(connector-watch) KafkaConnect (data-sources/data-sources-connect-cluster):
Error reconciling connector connector-xxxxx io.strimzi.operator.cluster.operator.assembly.ConnectRestException: PUT /connectors/connector-xxxxx/config returned 500 (Internal Server Error): Request timed out.
The worker is currently performing multi-property validation for the connector, which began at 2025-09-30T10:25:42.723Z.
The only workaround that unblocked the reconcile loop was deleting the connector with issues, which then freed up the entire cluster.
To reproduce, I can simply stop the MongoDB server, causing the previous validation timeout error message.
This feels like a critical blocker in production, as one single faulty connector can take down the whole setup by blocking hundreds of other connectors to reconcile.
With the help of @scholzj on slack we were able to pinpoint the exact problem.
The reconciliation is running in an infinite loop. That is likely because the error message provided by Kafka Connect is different every time due to it including the timestamp.
The DEBUG logs of the operator confirm this (LOOK at the timestamp at the end of each line) :
2025-10-09 14:01:47 DEBUG StatusDiff:41 - Ignoring Status diff {"op":"replace","path":"/conditions/0/lastTransitionTime","value":"2025-10-09T14:01:47.930202242Z"}
2025-10-09 14:01:47 DEBUG StatusDiff:46 - Status differs: {"op":"replace","path":"/conditions/0/message","value":"PUT /connectors/connector-xxxxx-mongodb-event/config returned 500 (Internal Server Error): Request timed out.
The worker is currently performing multi-property validation for the connector, which began at 2025-10-09T14:00:18Z."}
2025-10-09 14:01:47 DEBUG StatusDiff:47 - Current Status path /conditions/0/message has value "PUT /connectors/connector-xxxxx-mongodb-event/config returned 500 (Internal Server Error): Request timed out.
The worker is currently performing multi-property validation for the connector, which began at 2025-10-09T13:58:47.839Z."
2025-10-09 14:01:47 DEBUG StatusDiff:48 - Desired Status path /conditions/0/message has value "PUT /connectors/connector-xxxxx-mongodb-event/config returned 500 (Internal Server Error): Request timed out.
The worker is currently performing multi-property validation for the connector, which began at 2025-10-09T14:00:18Z."
2025-10-09 14:01:47 DEBUG CustomResource:195 - Calling CustomResource#setKind doesn't do anything because the Kind is computed and shouldn't be changed
2025-10-09 14:01:47 INFO CrdOperator:123 - Reconciliation #126(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): Status of KafkaConnector connector-xxxxx-mongodb-event in namespace data-sources has been updated
2025-10-09 14:01:47 DEBUG AbstractConnectOperator:1141 - Reconciliation #126(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): Completed status update
2025-10-09 14:01:47 INFO KafkaConnectAssemblyOperator:563 - Reconciliation #126(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): reconciled
2025-10-09 14:01:47 DEBUG AbstractOperator:467 - Reconciliation #126(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): Lock lock::data-sources::KafkaConnect::data-sources-connect-cluster released
2025-10-09 14:01:48 DEBUG CustomResource:184 - Calling CustomResource#setApiVersion doesn't do anything because the API version is computed and shouldn't be changed
2025-10-09 14:01:48 DEBUG CustomResource:195 - Calling CustomResource#setKind doesn't do anything because the Kind is computed and shouldn't be changed
2025-10-09 14:01:48 DEBUG CustomResource:184 - Calling CustomResource#setApiVersion doesn't do anything because the API version is computed and shouldn't be changed
2025-10-09 14:01:48 DEBUG CustomResource:195 - Calling CustomResource#setKind doesn't do anything because the Kind is computed and shouldn't be changed
2025-10-09 14:01:48 INFO KafkaConnectAssemblyOperator:555 - Reconciliation #130(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): KafkaConnector connector-xxxxx-mongodb-event in namespace data-sources was MODIFIED
2025-10-09 14:01:48 DEBUG AbstractOperator:395 - Reconciliation #130(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): Try to acquire lock lock::data-sources::KafkaConnect::data-sources-connect-cluster
2025-10-09 14:01:48 DEBUG AbstractOperator:398 - Reconciliation #130(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): Lock lock::data-sources::KafkaConnect::data-sources-connect-cluster acquired
2025-10-09 14:01:48 INFO KafkaConnectAssemblyOperator:487 - Reconciliation #130(connector-watch) KafkaConnect(data-sources/data-sources-connect-cluster): creating/updating connector: connector-xxxxx-mongodb-event
He explained it like this :
- One of the reconciliations runs into this error due to a Connect / connector issue (me killing mongodb on purpose) and updates the status of the connector CR with the error message
- The update of the error message in the connector status means a modification to the resource which triggers immediately another reconciliation
- The reconciliation waits 90 seconds and runs into the same error again. This is where it should normally stop, because the error is already in the status. But because of the timestamp in the error, it is treated as a new error and the Connector status is updated again
- New reconciliation is immediately triggered by this update
Steps to reproduce
- Have a KafkaConnector CR that should listen to a MongoDB instance
- Stop the MongoDB instance
- Watch the Operator getting stuck in a infinite loop of reconciliation
Expected behavior
The operator should be resilient and skip the failed connector.
It should continue to process other resources.
I understand the fix could be on the Kafka Connect side (by removing the timestamp from the message). But, I dont know the possible effects of these changes.
Strimzi version
0.45.0
Kubernetes version
v1.31.10
Installation method
Helm CHART
Infrastructure
OnPremise
Configuration files and logs
No response
Additional context
No response