Name	Name	Last commit message	Last commit date
parent directory ..
Dockerfile	Dockerfile
README.md	README.md
noaa_slugplot.py	noaa_slugplot.py

EMR Serverless Custom Images

Custom images are now supported in EMR Serverless allowing you to make use of containers to create reproducible data pipelines.

In this example, we use a simple example of adding the seaborn library to build a weather visualization.

Pre-requisities

Important

This example is intended to be run in the us-east-1 region as it reads data from NOAA Global Surface Summary of Day dataset from the Registry of Open Data.

In order to make use of custom images in EMR, you'll need to have:

a local installation of Docker to build your image
an ECR repository to host the resulting image.

We'll assume the user you're using has access to create and update ECR repositories, create EMR Serverless applications, and has access to the AWS CLI.

Set up some variables to be used throughout.

AWS_REGION=us-east-1
S3_BUCKET=<your-bucket-name>
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
JOB_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/<your-emr-serverless-job-role>

Build and publish

We'll follow the docs for customizing an image for EMR Serverless.

Create an ECR repository to publish to

aws ecr create-repository \
    --repository-name spark-seaborn

Allow any EMR Serverless application to access the custom image

aws ecr set-repository-policy \
    --repository-name spark-seaborn \
    --policy-text '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Emr Serverless Custom Image Support",
                "Effect": "Allow",
                "Principal": {
                    "Service": "emr-serverless.amazonaws.com"
                },
                "Action": [
                    "ecr:BatchGetImage",
                    "ecr:DescribeImages",
                    "ecr:GetDownloadUrlForLayer"
                ],
                "Condition":{
                    "StringLike": {
                        "aws:SourceArn": "arn:aws:emr-serverless:'${AWS_REGION}':'${ACCOUNT_ID}':/applications/*"
                    }
                }
            }
        ]
    }'

Build the image

docker build . -t $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest

# login to ECR repo
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# push the docker image
docker push $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest

Now create an EMR Serverless application with that image

Note that you can create different images for your driver and executor using the worker-type-specifications parameter - see the CLI instructions in the docs.

aws emr-serverless create-application \
    --name spark-seaborn \
    --release-label emr-6.9.0 \
    --type SPARK \
    --image-configuration '{
        "imageUri": "'${ACCOUNT_ID}'.dkr.ecr.'${AWS_REGION}'.amazonaws.com/spark-seaborn:latest"
    }' \
    --initial-capacity '{
        "DRIVER": {
            "workerCount": 1,
            "workerConfiguration": {
                "cpu": "4vCPU",
                "memory": "16GB"
            }
        },
        "EXECUTOR": {
            "workerCount": 3,
            "workerConfiguration": {
                "cpu": "4vCPU",
                "memory": "16GB"
            }
        }
    }'

And start the resulting application

APPLICATION_ID=00f7hdef7siki109
aws emr-serverless start-application --application-id ${APPLICATION_ID}

Now let's upload our pyspark script and start a job!

aws s3 cp noaa_slugplot.py s3://${S3_BUCKET}/code/pyspark/

aws emr-serverless start-job-run \
    --name noaa-slugplot-seattle \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/noaa_slugplot.py",
            "entryPointArguments": [ "72793524234", "2022", "'${S3_BUCKET}'", "tmp/slugplots/seattle-2022.png" ]
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'

Wait for the job to finish

JOB_RUN_ID=00f7hqip55r36l09

aws emr-serverless get-job-run \  
    --application-id $APPLICATION_ID \
    --job-run-id $JOB_RUN_ID

Then copy down the resulting file!

aws s3 cp s3://${S3_BUCKET}/tmp/slugplots/seattle-2022.png .

Cleanup

Stop and delete the application and ECR repository

aws emr-serverless stop-application --application-id ${APPLICATION_ID}
aws emr-serverless delete-application --application-id ${APPLICATION_ID}
aws ecr delete-repository --repository-name spark-seaborn --force

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

custom-images

custom-images

README.md

EMR Serverless Custom Images

Pre-requisities

Build and publish

Cleanup

Files

custom-images

Directory actions

More options

Directory actions

More options

Latest commit

History

custom-images

Folders and files

parent directory

README.md

EMR Serverless Custom Images

Pre-requisities

Build and publish

Cleanup