Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc]update getting started doc for EMR and databricks[skip ci] #7413

Merged
merged 4 commits into from
Jan 3, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions docs/get-started/getting-started-aws-emr.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ Different versions of EMR ship with different versions of Spark, RAPIDS Accelera

| EMR | Spark | RAPIDS Accelerator jar | cuDF jar | xgboost4j-spark jar
| --- | --- | --- | ---| --- |
| 6.9 | 3.3.0 | rapids-4-spark_2.12-22.08.0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar |
| 6.8 | 3.3.0 | rapids-4-spark_2.12-22.06.0.jar | Bundled with rapids-4-spark | xgboost4j-spark_3.0-1.4.2-0.3.0.jar |
| 6.7 | 3.2.1 | rapids-4-spark_2.12-22.02.0.jar | cudf-22.02.0-cuda11.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar |
| 6.6 | 3.2.0 | rapids-4-spark_2.12-22.02.0.jar | cudf-22.02.0-cuda11.jar | xgboost4j-spark_3.0-1.2.0-0.1.0.jar |
Expand All @@ -40,7 +41,7 @@ g4dn.2xlarge nodes:

```
aws emr create-cluster \
--release-label emr-6.7.0 \
--release-label emr-6.9.0 \
--applications Name=Hadoop Name=Spark Name=Livy Name=JupyterEnterpriseGateway \
--service-role EMR_DefaultRole \
--ec2-attributes KeyName=my-key-pair,InstanceProfile=EMR_EC2_DefaultRole \
Expand Down Expand Up @@ -80,8 +81,8 @@ detailed cluster configuration page.

#### Step 1: Software Configuration and Steps

Select **emr-6.8.0** for the release, uncheck all the software options, and then check **Hadoop
3.2.1**, **Spark 3.3.0**, **Livy 0.7.1** and **JupyterEnterpriseGateway 2.1.0**.
Select **emr-6.9.0** for the release, uncheck all the software options, and then check **Hadoop
3.3.3**, **Spark 3.3.0**, **Livy 0.7.1** and **JupyterEnterpriseGateway 2.6.0**.

In the "Edit software settings" field, copy and paste the configuration from the [EMR
document](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-rapids.html). You can also
Expand Down
27 changes: 13 additions & 14 deletions docs/get-started/getting-started-databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,18 +55,17 @@ The number of GPUs per node dictates the number of Spark executors that can run
detected.

## Start a Databricks Cluster
Create a Databricks cluster by going to Clusters, then clicking `+ Create Cluster`. Ensure the
Create a Databricks cluster by going to "Compute", then clicking `+ Create compute`. Ensure the
cluster meets the prerequisites above by configuring it as follows:
1. Select the Databricks Runtime Version from one of the supported runtimes specified in the
Prerequisites section.
2. Under Autopilot Options, disable autoscaling.
3. Choose the number of workers that matches the number of GPUs you want to use.
4. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
2. Choose the number of workers that matches the number of GPUs you want to use.
3. Select a worker type. On AWS, use nodes with 1 GPU each such as `p3.2xlarge` or `g4dn.xlarge`.
p2 nodes do not meet the architecture requirements (Pascal or higher) for the Spark worker
(although they can be used for the driver node). For Azure, choose GPU nodes such as
Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
5. Select the driver type. Generally this can be set to be the same as the worker.
6. Start the cluster.
Standard_NC6s_v3. For GCP, choose N1 or A2 instance types with GPUs.
4. Select the driver type. Generally this can be set to be the same as the worker.
5. Start the cluster.

## Advanced Cluster Configuration

Expand All @@ -93,25 +92,25 @@ cluster.
2. Once you are in the notebook, click the “Run All” button.
3. Ensure that the newly created init.sh script is present in the output from cell 2 and that the
contents of the script are correct.
4. Go back and edit your cluster to configure it to use the init script. To do this, click the
Clusters” button on the left panel, then select your cluster.
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init
4. Go back and edit your cluster to configure it to use the init script. To do this, click the
Compute” button on the left panel, then select your cluster.
5. Click the “Edit” button, then navigate down to the “Advanced Options” section. Select the “Init
Scripts” tab in the advanced options section, and paste the initialization script:
`dbfs:/databricks/init_scripts/init.sh`, then click “Add”.

![Init Script](../img/Databricks/initscript.png)

6. Now select the “Spark” tab, and paste the following config options into the Spark Config section.
Change the config values based on the workers you choose. See Apache Spark
Change the config values based on the workers you choose. See Apache Spark
[configuration](https://spark.apache.org/docs/latest/configuration.html) and RAPIDS Accelerator
for Apache Spark [descriptions](../configs.md) for each config.

The
[`spark.task.resource.gpu.amount`](https://spark.apache.org/docs/latest/configuration.html#scheduling)
configuration is defaulted to 1 by Databricks. That means that only 1 task can run on an
executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
executor with 1 GPU, which is limiting, especially on the reads and writes from Parquet. Set
this to 1/(number of cores per executor) which will allow multiple tasks to run in parallel just
like the CPU side. Having the value smaller is fine as well.
like the CPU side. Having the value smaller is fine as well.
Note: Please remove the `spark.task.resource.gpu.amount` config for a single-node Databricks
cluster because Spark local mode does not support GPU scheduling.

Expand Down Expand Up @@ -184,7 +183,7 @@ output_path='dbfs:///FileStore/tables/mortgage_parquet_gpu/output/'
Run the notebook by clicking “Run All”.

## Hints
Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud
Spark logs in Databricks are removed upon cluster shutdown. It is possible to save logs in a cloud
storage location using Databricks [cluster log
delivery](https://docs.databricks.com/clusters/configure.html#cluster-log-delivery-1). Enable this
option before starting the cluster to capture the logs.
Expand Down
Binary file modified docs/img/AWS-EMR/RAPIDS_EMR_GUI_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/AWS-EMR/RAPIDS_EMR_GUI_5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.