This directory contains shell scripts for running larger scale benchmarks on Databricks AWS hosted Spark service using the Databricks CLI. You will need a Databricks AWS account to run them. The benchmarks use datasets synthetically generated using gen_data.py. For convenience, these have been precomputed and currently stored in the public S3 bucket spark-rapids-ml-bm-datasets-public
. The benchmark scripts are currently configured to read the data from there.
-
Install latest databricks-cli on your local workstation. Note that Databricks has deprecated the legacy python based cli in favor of a self contained executable. Make sure the new version is first on the executables PATH.
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
-
Generate an access token for your Databricks workspace in the
User Settings
section of the workspace UI. -
Configure the access token for the Databricks CLI. If you have multiple workspaces, you should use a distinct
profile
name for this one, e.g.aws
, or else it will overwrite your currentDEFAULT
profile. This profile needs to be supplied on all invocations of thedatabricks
cli via the--profile
option. For safety, the instructions below assume you are using a newaws
profile.export DB_PROFILE=aws databricks configure --token --profile $DB_PROFILE # Host: <copy-and-paste databricks workspace url> # Token: <copy-and-paste access token from UI>
-
Next, in this directory, run the following to upload the files required to run the benchmarks:
# change below to desired dbfs location WITHOUT DBFS URI for uploading benchmarking related files export BENCHMARK_HOME=/path/to/benchmark/files/in/dbfs # need separate directory for cluster init script as databricks requires these to be stored in the workspace and not dbfs # ex: /Users/<databricks-user-name>/benchmark export WS_BENCHMARK_HOME=/path/to/benchmark/files/in/workspace ./setup.sh
This will create and copy the files into a DBFS directory at the path specified by
BENCHMARK_HOME
and a cluster init script to the workspace directory specified byWS_BENCHMARK_HOME
. The script will not overwrite existing files and instead simply print the error message returned from databricks. If overwrite is desired, first deleted the files and/or directories usingdatabricks fs rm [-r] <dbfs path>
for the dbfs files anddatabricks workspace delete [--recursive] <workspace path>
for the workspace files. Note: ExportBENCHMARK_HOME
,WS_BENCHMARK_HOME
andDB_PROFILE
in any new/different shell in which subsequent steps may be run.
-
The running time of each individual benchmark run can be limited by the
TIME_LIMIT
environment variable. The cpu kmeans benchmark takes over 9000 seconds (ie., > 2 hours) to complete. If not set, the default is3600
seconds.export TIME_LIMIT=3600
-
The benchmarks can be run as
./run_benchmark.sh [cpu|gpu|gpu_etl] [[12.2|13.3|14.3]] >> benchmark_log
The script creates a cpu or gpu cluster, respectively using the cluster specs in cpu_cluster_spec, gpu_cluster_spec, gpu_etl_cluster_spec, depending on the supplied argument. In gpu and gpu_etl mode each algorithm benchmark is run 3 times, and similarly in cpu mode, except for kmeans and random forest classifier and regressor which are each run 1 time due to their long running times. gpu_etl mode also uses the spark-rapids gpu accelerated plugin.
An optional databricks runtime version can be supplied as a second argument, with 13.3 being the default if not specified. Runtimes higher than 13.3 are only compatible with cpu and gpu modes (i.e. not gpu_etl) as they are not yet supported by the spark-rapids plugin.
-
The file
benchmark_log
will have the fit/train/transform running times and accuracy scores. A simple convenience script has been provided to extract timing information for each run:./process_bm_log.sh benchmark_log
-
Cancelling a run: Hit
Ctrl-C
and then cancel the run with the last printedrunid
(check usingtail benchmark_log
) by executing:
databricks jobs cancel-run <runid> --profile $DB_PROFILE
-
The created clusters are configured to terminate after 30 min, but can be manually terminated or deleted via the Databricks UI.
-
Monitor progress periodically in case of a possible hang, to avoid incurring cloud costs in such cases.