Skip to content

Spark operator unable to restart applications after automatic node upgrades/restarts #2658

@kant777

Description

@kant777

What happened?

I am using GKE k8s cluster and every week we will automatic maintenence kicks in and upgrades the nodes and whenever that happens my streaming applications are not able to restart themselves and here is the restart policy I have

 restartPolicy:

    type: Always

    onFailureRetries: 3

    onFailureRetryInterval: 30

    onSubmissionFailureRetries: 5

    onSubmissionFailureRetryInterval: 60

and here are the logs

[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).Start.func2.2)

                /go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249)

2025-09-20T18:56:01.608Z          INFO      sparkapplication/controller.go:190            Finished reconciling SparkApplication                {"name": "cdc-to-iceberg-spark-job", "namespace": "default"}

2025-09-20T18:56:01.608Z          ERROR  controller/controller.go:341         Reconciler error {"controller": "spark-application-controller", "namespace": "default", "name": "cdc-to-iceberg-spark-job", "reconcileID": "17378a5f-3cd2-43b2-8fea-675942b9c8be", "error": "resources associated with SparkApplication name: cdc-to-iceberg-spark-job namespace: default, needed to be deleted"}

[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).reconcileHandler)

                /go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:341](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:341)

[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).processNextWorkItem)

                /go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:288](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:288)

[sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func2.2](http://sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller%5B...%5D).Start.func2.2)

                /go/pkg/mod/[sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249](http://sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:249)

2025-09-20T18:56:01.697Z          INFO      sparkapplication/controller.go:648            Submitting SparkApplication                {"name": "falcon-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}

2025-09-20T18:56:01.749Z          INFO      sparkapplication/controller.go:696            Created web UI service for SparkApplication               {"name": "falcon-spark-job", "namespace": "default"}

2025-09-20T18:56:01.749Z          INFO      sparkapplication/controller.go:756            Running spark-submit for SparkApplication               {"name": "falcon-spark-job", "namespace": "default", "arguments": ["--master", "k8s://[https://10.18.253.1:443](https://10.18.253.1/)", "--deploy-mode", "cluster", "--class", "com.tassat.falcon.KafkaToGCSLoader", "--name", "falcon-spark-job", "--conf", "spark.kubernetes.namespace=default", "--conf", "spark.kubernetes.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.kubernetes.submission.waitAppCompletion=false", "--conf", "spark.metrics.conf.master.sink.prometheusServlet.path=/metrics/master/prometheus", "--conf", "spark.sql.streaming.metricsEnabled=true", "--conf", "spark.ui.proxyBase=/default/falcon-spark-job", "--conf", "spark.ui.proxyRedirectUri=/", "--conf", "spark.kubernetes.submission.waitAppCompletion=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus", "--conf", "spark.metrics.conf.applications.sink.prometheusServlet.path=/metrics/applications/prometheus", "--conf", "[spark.kubernetes.driver.pod.name](http://spark.kubernetes.driver.pod.name/)=falcon-spark-job-driver", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=falcon-spark-job](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=falcon-spark-job)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b)", "--conf", "spark.kubernetes.driver.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.driver.cores=1", "--conf", "spark.kubernetes.driver.request.cores=0.1", "--conf", "spark.kubernetes.driver.limit.cores=2000m", "--conf", "spark.driver.memory=2g", "--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=spark", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=falcon-spark-job](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=falcon-spark-job)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=9ce54c24-90ec-44c0-8869-dc820926280b)", "--conf", "spark.executor.instances=1", "--conf", "spark.kubernetes.executor.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.0)", "--conf", "spark.executor.cores=1", "--conf", "spark.kubernetes.executor.request.cores=0.2", "--conf", "spark.kubernetes.executor.limit.cores=2000m", "--conf", "spark.executor.memory=2g", "--conf", "spark.kubernetes.authenticate.executor.serviceAccountName=spark", "local:///opt/spark/jars/falcon_2.12-1.0.0.jar", "--kafkaBootstrapServers", "twp-kafka-001:9093", "--kafkaTopicPattern", "^dwh-.*-platformdb\\..*", "--maxOffsetsPerTrigger", "10000", "--gcsWarehouse", "gs://twp-spark-workload/lighthouse", "--checkpointPath", "gs://twp-spark-workload/lighthouse-checkpoint", "--googleCloudKeyfile", "/etc/secrets/gcp/tassat-dwh-prod-38271034e7ca.json", "--kafkaSecurityProtocol", "SSL", "--kafkaSslTruststoreLocation", "/etc/secrets/kafka/dwh-kafka-client-truststore.p12", "--kafkaSslKeystoreLocation", "/etc/secrets/kafka/dwh-kafka-client-keystore.p12"]}

2025-09-20T18:56:01.806Z          INFO      sparkapplication/event_handler.go:99     Spark pod deleted            {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "phase": "Failed"}

2025-09-20T18:56:02.889Z          INFO      sparkapplication/controller.go:177            Reconciling SparkApplication                {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}

2025-09-20T18:56:02.889Z          INFO      sparkapplication/controller.go:648            Submitting SparkApplication                {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "state": "SUBMISSION_FAILED"}

2025-09-20T18:56:02.937Z          INFO      sparkapplication/controller.go:696            Created web UI service for SparkApplication               {"name": "cdc-to-iceberg-spark-job", "namespace": "default"}

2025-09-20T18:56:02.938Z          INFO      sparkapplication/controller.go:756            Running spark-submit for SparkApplication               {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "arguments": ["--master", "k8s://[https://10.18.253.1:443](https://10.18.253.1/)", "--deploy-mode", "cluster", "--class", "com.tassat.falcon.CDCToIceberg", "--name", "cdc-to-iceberg-spark-job", "--conf", "spark.kubernetes.namespace=default", "--conf", "spark.kubernetes.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.kubernetes.submission.waitAppCompletion=false", "--conf", "spark.sql.streaming.metricsEnabled=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.path=/metrics/prometheus", "--conf", "spark.metrics.conf.master.sink.prometheusServlet.path=/metrics/master/prometheus", "--conf", "spark.ui.proxyBase=/default/cdc-to-iceberg-spark-job", "--conf", "spark.ui.proxyRedirectUri=/", "--conf", "spark.metrics.conf.applications.sink.prometheusServlet.path=/metrics/applications/prometheus", "--conf", "spark.executor.memoryOverhead=1024m", "--conf", "spark.kubernetes.submission.waitAppCompletion=true", "--conf", "spark.metrics.conf.*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet", "--conf", "[spark.kubernetes.driver.pod.name](http://spark.kubernetes.driver.pod.name/)=cdc-to-iceberg-spark-job-driver", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780](http://spark.kubernetes.driver.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780)", "--conf", "spark.kubernetes.driver.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.driver.cores=1", "--conf", "spark.kubernetes.driver.request.cores=0.1", "--conf", "spark.kubernetes.driver.limit.cores=2000m", "--conf", "spark.driver.memory=4g", "--conf", "spark.kubernetes.authenticate.driver.serviceAccountName=spark", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/app-name=cdc-to-iceberg-spark-job)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/launched-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/mutated-by-spark-operator=true)", "--conf", "[spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780](http://spark.kubernetes.executor.label.sparkoperator.k8s.io/submission-id=41d7297e-f3c7-4563-99e0-1975ebf00780)", "--conf", "spark.executor.instances=2", "--conf", "spark.kubernetes.executor.container.image=[tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5](http://tep-artifactory-app-001.in.tassat.com/docker/spark-with-deps:1.0.5)", "--conf", "spark.executor.cores=2", "--conf", "spark.kubernetes.executor.request.cores=0.2", "--conf", "spark.kubernetes.executor.limit.cores=2000m", "--conf", "spark.executor.memory=8g", "--conf", "spark.executor.memoryOverhead=1g", "--conf", "spark.kubernetes.authenticate.executor.serviceAccountName=spark", "local:///opt/spark/jars/falcon_2.12-1.0.2.jar", "--kafkaBootstrapServers", "twp-kafka-001:9093", "--kafkaTopicPattern", "^dwh-.*-platformdb\\..*", "--maxOffsetsPerTrigger", "10000", "--gcsWarehouse", "gs://twp-spark-workload/iceberg-warehouse", "--checkpointPath", "gs://twp-spark-workload/cdc-to-iceberg-checkpoint", "--googleCloudKeyfile", "/etc/secrets/gcp/tassat-dwh-prod-38271034e7ca.json", "--kafkaSecurityProtocol", "SSL", "--kafkaSslTruststoreLocation", "/etc/secrets/kafka/dwh-kafka-client-truststore.p12", "--kafkaSslKeystoreLocation", "/etc/secrets/kafka/dwh-kafka-client-keystore.p12"]}

2025-09-20T18:56:06.961Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "falcon-spark-job-driver", "namespace": "default", "phase": "Pending"}

2025-09-20T18:56:08.771Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "phase": "Pending"}

2025-09-20T18:56:10.741Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "cdc-to-iceberg-spark-job-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T18:56:12.011Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "falcon-spark-job-driver", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T18:56:14.814Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "phase": "Pending"}

2025-09-20T18:56:14.923Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "cdctoiceberg-1d847299687be182-exec-2", "namespace": "default", "phase": "Pending"}

2025-09-20T18:56:15.803Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T18:56:16.023Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "cdctoiceberg-1d847299687be182-exec-2", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T18:56:16.225Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "phase": "Pending"}

2025-09-20T18:56:17.323Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T18:57:16.845Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "oldPhase": "Running", "newPhase": "Failed"}

2025-09-20T18:57:17.192Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "cdctoiceberg-1d847299687be182-exec-3", "namespace": "default", "phase": "Pending"}

2025-09-20T18:57:17.276Z          INFO      sparkapplication/event_handler.go:99     Spark pod deleted            {"name": "cdctoiceberg-1d847299687be182-exec-1", "namespace": "default", "phase": "Failed"}

2025-09-20T18:57:41.521Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "cdctoiceberg-1d847299687be182-exec-3", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-20T19:04:50.900Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "oldPhase": "Running", "newPhase": "Failed"}

2025-09-20T19:04:51.151Z          INFO      sparkapplication/event_handler.go:99     Spark pod deleted            {"name": "kafkatogcsloader-a8841499687be68f-exec-1", "namespace": "default", "phase": "Failed"}

2025-09-20T19:04:51.635Z          INFO      sparkapplication/event_handler.go:60     Spark pod created            {"name": "kafkatogcsloader-a8841499687be68f-exec-2", "namespace": "default", "phase": "Pending"}

2025-09-20T19:06:06.433Z          INFO      sparkapplication/event_handler.go:84     Spark pod updated          {"name": "kafkatogcsloader-a8841499687be68f-exec-2", "namespace": "default", "oldPhase": "Pending", "newPhase": "Running"}

2025-09-21T13:12:38.198Z          INFO      sparkapplication/event_handler.go:188  SparkApplication updated                {"name": "cdc-to-iceberg-spark-job", "namespace": "default", "oldState": "SUBMISSION_FAILED", "newState": "SUBMISSION_FAILED"}

2025-09-21T13:12:38.198Z          INFO      sparkapplication/event_handler.go:188  SparkApplication updated                {"name": "falcon-spark-job", "namespace": "default", "oldState": "SUBMISSION_FAILED", "newState": "SUBMISSION_FAILED"}

Reproduction Code

No response

Expected behavior

No response

Actual behavior

No response

Environment & Versions

  • Kubernetes Version: 1.33.4-gke
  • Spark Operator Version: 2.1.0
  • Apache Spark Version: 3.5.3

Additional context

No response

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions