layout | title |
---|---|
global |
Spark on Kubernetes Development |
Kubernetes is a framework for easily deploying, scaling, and managing containerized applications. It would be useful for a user to run their Spark jobs on a Kubernetes cluster alongside their other Kubernetes-managed applications. For more about the motivations for adding this feature, see the umbrella JIRA ticket that tracks this project: SPARK-18278.
This submodule is an initial implementation of allowing Kubernetes to be a supported cluster manager for Spark, along with Mesos, Hadoop YARN, and Standalone. This document provides a summary of important matters to keep in mind when developing this feature.
To build Spark with Kubernetes support, use the kubernetes
profile when invoking Maven. For example, to simply compile
the Kubernetes core implementation module along with its dependencies:
build/mvn compile -Pkubernetes -pl resource-managers/kubernetes/core -am -DskipTests
To build a distribution of Spark with Kubernetes support, use the dev/make-distribution.sh
script, and add the
kubernetes
profile as part of the build arguments. Any other build arguments can be specified as one would expect when
building Spark normally. For example, to build Spark against Hadoop 2.7 and Kubernetes:
dev/make-distribution.sh --tgz -Phadoop-2.7 -Pkubernetes
Below is a list of the submodules for this cluster manager and what they do.
core
: Implementation of the Kubernetes cluster manager support.integration-tests
: Integration tests for the project.docker-minimal-bundle
: Base Dockerfiles for the driver and the executors. The Dockerfiles are used for integration tests as well as being provided in packaged distributions of Spark.integration-tests-spark-jobs
: Spark jobs that are only used in integration tests.integration-tests-spark-jobs-helpers
: Dependencies for the spark jobs used in integration tests. These dependencies are separated out to facilitate testing the shipping of jars to drivers running on Kubernetes clusters.
Note that the integration test framework is currently being heavily revised and is subject to change.
Note that currently the integration tests only run with Java 8.
Running any of the integration tests requires including kubernetes-integration-tests
profile in the build command. In
order to prepare the environment for running the integration tests, the pre-integration-test
step must be run in Maven
on the resource-managers/kubernetes/integration-tests
module:
build/mvn pre-integration-test -Pkubernetes -Pkubernetes-integration-tests -pl resource-managers/kubernetes/integration-tests -am -DskipTests
Afterwards, the integration tests can be executed with Maven or your IDE. Note that when running tests from an IDE, the
pre-integration-test
phase must be run every time the Spark main code changes. When running tests from the
command line, the pre-integration-test
phase should automatically be invoked if the integration-test
phase is run.
After the above step, the integration test can be run using the following command:
build/mvn integration-test \
-Pkubernetes -Pkubernetes-integration-tests \
-pl resource-managers/kubernetes/integration-tests -am
In order to run against any cluster, use the following:
build/mvn integration-test
-Pkubernetes -Pkubernetes-integration-tests
-pl resource-managers/kubernetes/integration-tests -am
-DextraScalaTestArgs="-Dspark.kubernetes.test.master=k8s://https:// -Dspark.docker.test.driverImage= -Dspark.docker.test.executorImage="
The integration tests make use of Minikube, which fires up a virtual machine
and setup a single-node kubernetes cluster within it. By default the vm is destroyed after the tests are finished.
If you want to preserve the vm, e.g. to reduce the running time of tests during development, you can pass the property
spark.docker.test.persistMinikube
to the test process:
build/mvn integration-test \
-Pkubernetes -Pkubernetes-integration-tests \
-pl resource-managers/kubernetes/integration-tests -am \
-DextraScalaTestArgs=-Dspark.docker.test.persistMinikube=true
See the usage guide for more information.