-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Labels
kind/bugSomething isn't workingSomething isn't working
Description
What happened?
I am using GKE k8s cluster and every week we will automatic maintenence kicks in and upgrades the nodes and whenever that happens my streaming applications are not able to restart themselves and here is the restart policy I have
restartPolicy:
type: Always
onFailureRetries: 3
onFailureRetryInterval: 30
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 60
and here are the logs
[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).Start.func2.2)
/go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249)
2025-09-20T18:56:01.608Z INFO sparkapplication/controller.go:190 Finished reconciling SparkApplication {"name": "cdc-to-iceberg-spark-job", "namespace": "default"}
2025-09-20T18:56:01.608Z ERROR controller/controller.go:341 Reconciler error {"controller": "spark-application-controller", "namespace": "default", "name": "cdc-to-iceberg-spark-job", "reconcileID": "17378a5f-3cd2-43b2-8fea-675942b9c8be", "error": "resources associated with SparkApplication name: cdc-to-iceberg-spark-job namespace: default, needed to be deleted"}
[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).reconcileHandler)
/go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:341](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:341)
[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).processNextWorkItem)
/go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:288](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:288)
[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).Start.func2.2)
/go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249)
2025-09-20T18:56:01.697Z INFO sparkapplication/controller.go:648 Submitting SparkApplication {"name": "falcon-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}
2025-09-20T18:56:01.749Z INFO sparkapplication/controller.go:696 Created web UI service for SparkApplication {"name": "falcon-spark-job", "namespace": "default"}
2025-09-20T18:56:01.749Z INFO sparkapplication/controller.go:756 Running spark-submit for SparkApplication {"name": "falcon-spark-job", "namespace": "default", "arguments": ["--master", "k8s://[https://10.18.253.1:443](https://10.18.253.1/)", "--deploy-mode", "cluster", "--class", "com.tassat.falcon.KafkaToGCSLoader", "--name", "falcon-spark-job", "--conf", "spark.kubernetes.namespace=default", "--conf", "spark.kubernetes.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.kubernetes.submission.waitAppCompletion=false", "--conf", "spark.metrics.conf.master.sink.prometheusServlet.path=/metrics/master/prometheus", "--conf", "spark.sql.streaming.metricsEnabled=true", "--conf", "spark.ui.proxyBase=/default/falcon-spark-job", "--conf", "spark.ui.proxyRedirectUri=/", "--conf", "spark.kubernetes.submission.waitAppCompletion=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus", "--conf", "spark.metrics.conf.applications.sink.prometheusServlet.path=/metrics/applications/prometheus", "--conf", "[spark.kubernetes.driver.pod.name](http://spark.kubernetes.driver.pod.name/)=falcon-spark-job-driver", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=falcon-spark-job](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=falcon-spark-job)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b)", "--conf", "spark.kubernetes.driver.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.driver.cores=1", "--conf", "spark.kubernetes.driver.request.cores=0.1", "--conf", "spark.kubernetes.driver.limit.cores=2000m", "--conf", "spark.driver.memory=2g", "--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=spark", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=falcon-spark-job](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=falcon-spark-job)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b)", "--conf", "spark.executor.instances=1", "--conf", "spark.kubernetes.executor.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.executor.cores=1", "--conf", "spark.kubernetes.executor.request.cores=0.2", "--conf", "spark.kubernetes.executor.limit.cores=2000m", "--conf", "spark.executor.memory=2g", "--conf", "spark.kubernetes.authenticate.executor.serviceAccountName=spark", "local:///opt/spark/jars/falcon_2.12-1.0.0.jar", "--kafkaBootstrapServers", "twp-kafka-001:9093", "--kafkaTopicPattern", "^dwh-.*-platformdb\\..*", "--maxOffsetsPerTrigger", "10000", "--gcsWarehouse", "gs://twp-spark-workload/lighthouse", "--checkpointPath", "gs://twp-spark-workload/lighthouse-checkpoint", "--googleCloudKeyfile", "/etc/secrets/gcp/tassat-dwh-prod-38271034e7ca.json", "--kafkaSecurityProtocol", "SSL", "--kafkaSslTruststoreLocation", "/etc/secrets/kafka/dwh-kafka-client-truststore.p12", "--kafkaSslKeystoreLocation", "/etc/secrets/kafka/dwh-kafka-client-keystore.p12"]}
2025-09-20T18:56:01.806Z INFO sparkapplication/event_handler.go:99 Spark pod deleted {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "phase": "Failed"}
2025-09-20T18:56:02.889Z INFO sparkapplication/controller.go:177 Reconciling SparkApplication {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}
2025-09-20T18:56:02.889Z INFO sparkapplication/controller.go:648 Submitting SparkApplication {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}
2025-09-20T18:56:02.937Z INFO sparkapplication/controller.go:696 Created web UI service for SparkApplication {"name": "cdc-to-iceberg-spark-job", "namespace": "default"}
2025-09-20T18:56:02.938Z INFO sparkapplication/controller.go:756 Running spark-submit for SparkApplication {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "arguments": ["--master", "k8s://[https://10.18.253.1:443](https://10.18.253.1/)", "--deploy-mode", "cluster", "--class", "com.tassat.falcon.CDCToIceberg", "--name", "cdc-to-iceberg-spark-job", "--conf", "spark.kubernetes.namespace=default", "--conf", "spark.kubernetes.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.kubernetes.submission.waitAppCompletion=false", "--conf", "spark.sql.streaming.metricsEnabled=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus", "--conf", "spark.metrics.conf.master.sink.prometheusServlet.path=/metrics/master/prometheus", "--conf", "spark.ui.proxyBase=/default/cdc-to-iceberg-spark-job", "--conf", "spark.ui.proxyRedirectUri=/", "--conf", "spark.metrics.conf.applications.sink.prometheusServlet.path=/metrics/applications/prometheus", "--conf", "spark.executor.memoryOverhead=1024m", "--conf", "spark.kubernetes.submission.waitAppCompletion=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet", "--conf", "[spark.kubernetes.driver.pod.name](http://spark.kubernetes.driver.pod.name/)=cdc-to-iceberg-spark-job-driver", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780)", "--conf", "spark.kubernetes.driver.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.driver.cores=1", "--conf", "spark.kubernetes.driver.request.cores=0.1", "--conf", "spark.kubernetes.driver.limit.cores=2000m", "--conf", "spark.driver.memory=4g", "--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=spark", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780)", "--conf", "spark.executor.instances=2", "--conf", "spark.kubernetes.executor.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.executor.cores=2", "--conf", "spark.kubernetes.executor.request.cores=0.2", "--conf", "spark.kubernetes.executor.limit.cores=2000m", "--conf", "spark.executor.memory=8g", "--conf", "spark.executor.memoryOverhead=1g", "--conf", "spark.kubernetes.authenticate.executor.serviceAccountName=spark", "local:///opt/spark/jars/falcon_2.12-1.0.2.jar", "--kafkaBootstrapServers", "twp-kafka-001:9093", "--kafkaTopicPattern", "^dwh-.*-platformdb\\..*", "--maxOffsetsPerTrigger", "10000", "--gcsWarehouse", "gs://twp-spark-workload/iceberg-warehouse", "--checkpointPath", "gs://twp-spark-workload/cdc-to-iceberg-checkpoint", "--googleCloudKeyfile", "/etc/secrets/gcp/tassat-dwh-prod-38271034e7ca.json", "--kafkaSecurityProtocol", "SSL", "--kafkaSslTruststoreLocation", "/etc/secrets/kafka/dwh-kafka-client-truststore.p12", "--kafkaSslKeystoreLocation", "/etc/secrets/kafka/dwh-kafka-client-keystore.p12"]}
2025-09-20T18:56:06.961Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "falcon-spark-job-driver", "namespace": "default", "phase": "Pending"}
2025-09-20T18:56:08.771Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "phase": "Pending"}
2025-09-20T18:56:10.741Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T18:56:12.011Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "falcon-spark-job-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T18:56:14.814Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "phase": "Pending"}
2025-09-20T18:56:14.923Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "cdctoiceberg-1d847299687be182-exec-2", "namespace": "default", "phase": "Pending"}
2025-09-20T18:56:15.803Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T18:56:16.023Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "cdctoiceberg-1d847299687be182-exec-2", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T18:56:16.225Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "phase": "Pending"}
2025-09-20T18:56:17.323Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T18:57:16.845Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "oldPhase": "Running", "newPhase": "Failed"}
2025-09-20T18:57:17.192Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "cdctoiceberg-1d847299687be182-exec-3", "namespace": "default", "phase": "Pending"}
2025-09-20T18:57:17.276Z INFO sparkapplication/event_handler.go:99 Spark pod deleted {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "phase": "Failed"}
2025-09-20T18:57:41.521Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "cdctoiceberg-1d847299687be182-exec-3", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-20T19:04:50.900Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "oldPhase": "Running", "newPhase": "Failed"}
2025-09-20T19:04:51.151Z INFO sparkapplication/event_handler.go:99 Spark pod deleted {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "phase": "Failed"}
2025-09-20T19:04:51.635Z INFO sparkapplication/event_handler.go:60 Spark pod created {"name": "kafkatogcsloader-a8841499687be68f-exec-2", "namespace": "default", "phase": "Pending"}
2025-09-20T19:06:06.433Z INFO sparkapplication/event_handler.go:84 Spark pod updated {"name": "kafkatogcsloader-a8841499687be68f-exec-2", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}
2025-09-21T13:12:38.198Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "oldState": "SUBMISSION_FAILED", "newState": "SUBMISSION_FAILED"}
2025-09-21T13:12:38.198Z INFO sparkapplication/event_handler.go:188 SparkApplication updated {"name": "falcon-spark-job", "namespace": "default", "oldState": "SUBMISSION_FAILED", "newState": "SUBMISSION_FAILED"}
Reproduction Code
No response
Expected behavior
No response
Actual behavior
No response
Environment & Versions
- Kubernetes Version: 1.33.4-gke
- Spark Operator Version: 2.1.0
- Apache Spark Version: 3.5.3
Additional context
No response
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
kant777
Metadata
Metadata
Assignees
Labels
kind/bugSomething isn't workingSomething isn't working