You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Under certain circumstances, eg network errors, a service might stop due to an error, however no attempt is made to restart it.
The regular healthcheck will notice a service is not running and therefore the app is not fully running but no remedial action is taken:
ERROR Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService will stop due to error. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
java.util.concurrent.ExecutionException: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
at org.apache.kafka.common.internals.KafkaFutureImpl.wrapAndThrow(KafkaFutureImpl.java:45) ~[kafka-clients-2.4.0.jar:?]
at org.apache.kafka.common.internals.KafkaFutureImpl.access$000(KafkaFutureImpl.java:32) ~[kafka-clients-2.4.0.jar:?]
at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:89) ~[kafka-clients-2.4.0.jar:?]
at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:260) ~[kafka-clients-2.4.0.jar:?]
at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.minPartitionNum(MultiClusterTopicManagementService.java:324) ~[kafka-monitor-2.5.12.jar:2.5.12]
at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementHelper.maybeCreateTopic(MultiClusterTopicManagementService.java:313) ~[kafka-monitor-2.5.12.jar:2.5.12]
at com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService$TopicManagementRunnable.run(MultiClusterTopicManagementService.java:179) [kafka-monitor-2.5.12.jar:2.5.12]
at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) [?:?]
at java.util.concurrent.FutureTask.runAndReset(Unknown Source) [?:?]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.kafka.common.errors.TimeoutException: Aborted due to timeout.
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService stopped. (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO Topic-management-service-for-single-cluster-monitor/MultiClusterTopicManagementService shutdown completed (com.linkedin.xinfra.monitor.services.MultiClusterTopicManagementService)
INFO TopicManagementService is not running. (com.linkedin.xinfra.monitor.apps.SingleClusterMonitor)
ERROR App single-cluster-monitor is not fully running. (com.linkedin.xinfra.monitor.XinfraMonitor)
Ideally failing services should be restarted automatically.
If however dealing with the complexities of the above (backoff, retry limits etc) is not desirable within the context of this project, it should at the very least provide the option to shutdown the monitor when an app is not fully running so this can be dealt with by the scheduler that runs xinfra-monitor.
Under certain circumstances, eg network errors, a service might stop due to an error, however no attempt is made to restart it.
The regular healthcheck will notice a service is not running and therefore the app is not fully running but no remedial action is taken:
Ideally failing services should be restarted automatically.
If however dealing with the complexities of the above (backoff, retry limits etc) is not desirable within the context of this project, it should at the very least provide the option to shutdown the monitor when an app is not fully running so this can be dealt with by the scheduler that runs xinfra-monitor.
At the moment it just logs an error:
https://github.com/linkedin/kafka-monitor/blob/master/src/main/java/com/linkedin/xinfra/monitor/XinfraMonitor.java#L143-L145
The text was updated successfully, but these errors were encountered: