Skip to content
stevenstetzler edited this page Mar 15, 2020 · 2 revisions

Project Motivation

We are hoping to build a Jupyter extension that allows one to create and customize Spark clusters with a simple UI interface to bypass the cumbersome syntax of configuring and connecting to a Spark cluster programatically. For example, creating a local Spark cluster with Python from within a Jupyter Notebook looks something like this:

import pyspark
# Spark Session
spark = (
    pyspark
    .sql
    .SparkSession
    .builder
    .config("master", "local[*]")
    .getOrCreate()
)
# Spark Context
sc = spark.sparkContext

Once one wishes to customize their Spark cluster (with several calls to config()) this code can get cumbersome and hard to replicate. For example, if we wished to customize the driver and were creating and customizing separate executors (using a scheduler like YARN or Kubernetes) we'd write something like:

import pyspark
# Spark Session
spark = (
    pyspark
    .sql
    .SparkSession
    .builder
    .config('spark.driver.memory', '12g')
    .config("spark.driver.cores", "4")
    .config("spark.executor.instances", "24")
    .config("spark.executor.cores", "4")
    .config("spark.executor.memory", "57500m")
    .getOrCreate()
)
# Spark Context
sc = spark.sparkContext

One can imagine that a well-designed user interface could replace the programmatic creation of (or connection to) a Spark cluster with interactions with simple UI elements. For example, a slider can configure the number of executors, number of cores, and amount of memory in an intuitive way. Additionally, a drop-down menu could list a set of available Spark clusters that one can connect a Spark shell to (which is the backend for interacting with Spark from Jupyter).

Additionally, if we are scheduling a job on Kubernetes, we'd have to add more repetitive kubernetes-specific options such as:

   ...
    .config("spark.kubernetes.node.selector.dirac.washington.edu/instance-name", "r5-2xlarge")
    .config("spark.kubernetes.executor.request.cores", "7.5")
    .config("spark.kubernetes.executor.limit.cores", "8.0")
   ...

We also want to extend the cluster creation UI to work with different schedulers so that the same UI can be used to create / connect to a local Spark cluster, a remote YARN / Kubernetes cluster, or a stand-alone cluster as well.

Prior Work

This extension would be in the spirit of the Dask JupyterLab Extension (Dask is the preferred distributed computation engine within the Python ecosystem), which exposes a simple interface to create and scale distributed clusters on which Python code can be easily executed. There is also a GSoC project from last year that we can build off of that implements something along the same vein as what we are proposing.

Clone this wiki locally