-
Notifications
You must be signed in to change notification settings - Fork 4
Home
We are hoping to build a Jupyter extension that allows one to create and customize Spark clusters with a simple UI interface to bypass the cumbersome syntax of configuring and connecting to a Spark cluster programatically. For example, creating a local Spark cluster with Python from within a Jupyter Notebook looks something like this:
import pyspark
# Spark Session
spark = (
pyspark
.sql
.SparkSession
.builder
.config("master", "local[*]")
.getOrCreate()
)
# Spark Context
sc = spark.sparkContext
Once one wishes to customize their Spark cluster (with several calls to config()
) this code can get cumbersome and hard to replicate. For example, if we wished to customize the driver and were creating and customizing separate executors (using a scheduler like YARN or Kubernetes) we'd write something like:
import pyspark
# Spark Session
spark = (
pyspark
.sql
.SparkSession
.builder
.config('spark.driver.memory', '12g')
.config("spark.driver.cores", "4")
.config("spark.executor.instances", "24")
.config("spark.executor.cores", "4")
.config("spark.executor.memory", "57500m")
.getOrCreate()
)
# Spark Context
sc = spark.sparkContext
One can imagine that a well-designed user interface could replace the programmatic creation of (or connection to) a Spark cluster with interactions with simple UI elements. For example, a slider can configure the number of executors, number of cores, and amount of memory in an intuitive way. Additionally, a drop-down menu could list a set of available Spark clusters that one can connect a Spark shell to (which is the backend for interacting with Spark from Jupyter).
Additionally, if we are scheduling a job on Kubernetes, we'd have to add more repetitive kubernetes-specific options such as:
...
.config("spark.kubernetes.node.selector.dirac.washington.edu/instance-name", "r5-2xlarge")
.config("spark.kubernetes.executor.request.cores", "7.5")
.config("spark.kubernetes.executor.limit.cores", "8.0")
...
We also want to extend the cluster creation UI to work with different schedulers so that the same UI can be used to create / connect to a local Spark cluster, a remote YARN / Kubernetes cluster, or a stand-alone cluster as well.
This extension would be in the spirit of the Dask JupyterLab Extension (Dask is the preferred distributed computation engine within the Python ecosystem), which exposes a simple interface to create and scale distributed clusters on which Python code can be easily executed. There is also a GSoC project from last year that we can build off of that implements something along the same vein as what we are proposing.