Skip to content

Latest commit

 

History

History

custom-images

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

EMR Serverless Custom Images

Custom images are now supported in EMR Serverless allowing you to make use of containers to create reproducible data pipelines.

In this example, we use a simple example of adding the seaborn library to build a weather visualization.

Pre-requisities

Important

This example is intended to be run in the us-east-1 region as it reads data from NOAA Global Surface Summary of Day dataset from the Registry of Open Data.

In order to make use of custom images in EMR, you'll need to have:

  • a local installation of Docker to build your image
  • an ECR repository to host the resulting image.

We'll assume the user you're using has access to create and update ECR repositories, create EMR Serverless applications, and has access to the AWS CLI.

Set up some variables to be used throughout.

AWS_REGION=us-east-1
S3_BUCKET=<your-bucket-name>
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
JOB_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/<your-emr-serverless-job-role>

Build and publish

We'll follow the docs for customizing an image for EMR Serverless.

  • Create an ECR repository to publish to
aws ecr create-repository \
    --repository-name spark-seaborn
  • Allow any EMR Serverless application to access the custom image
aws ecr set-repository-policy \
    --repository-name spark-seaborn \
    --policy-text '{
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "Emr Serverless Custom Image Support",
                "Effect": "Allow",
                "Principal": {
                    "Service": "emr-serverless.amazonaws.com"
                },
                "Action": [
                    "ecr:BatchGetImage",
                    "ecr:DescribeImages",
                    "ecr:GetDownloadUrlForLayer"
                ],
                "Condition":{
                    "StringLike": {
                        "aws:SourceArn": "arn:aws:emr-serverless:'${AWS_REGION}':'${ACCOUNT_ID}':/applications/*"
                    }
                }
            }
        ]
    }'
  • Build the image
docker build . -t $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest
  • Login to ECR and push
# login to ECR repo
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com

# push the docker image
docker push $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest
  • Now create an EMR Serverless application with that image

Note that you can create different images for your driver and executor using the worker-type-specifications parameter - see the CLI instructions in the docs.

aws emr-serverless create-application \
    --name spark-seaborn \
    --release-label emr-6.9.0 \
    --type SPARK \
    --image-configuration '{
        "imageUri": "'${ACCOUNT_ID}'.dkr.ecr.'${AWS_REGION}'.amazonaws.com/spark-seaborn:latest"
    }' \
    --initial-capacity '{
        "DRIVER": {
            "workerCount": 1,
            "workerConfiguration": {
                "cpu": "4vCPU",
                "memory": "16GB"
            }
        },
        "EXECUTOR": {
            "workerCount": 3,
            "workerConfiguration": {
                "cpu": "4vCPU",
                "memory": "16GB"
            }
        }
    }'
  • And start the resulting application
APPLICATION_ID=00f7hdef7siki109
aws emr-serverless start-application --application-id ${APPLICATION_ID}
  • Now let's upload our pyspark script and start a job!
aws s3 cp noaa_slugplot.py s3://${S3_BUCKET}/code/pyspark/ 
aws emr-serverless start-job-run \
    --name noaa-slugplot-seattle \
    --application-id $APPLICATION_ID \
    --execution-role-arn $JOB_ROLE_ARN \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/noaa_slugplot.py",
            "entryPointArguments": [ "72793524234", "2022", "'${S3_BUCKET}'", "tmp/slugplots/seattle-2022.png" ]
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "s3MonitoringConfiguration": {
                "logUri": "s3://'${S3_BUCKET}'/logs/"
            }
        }
    }'
  • Wait for the job to finish
JOB_RUN_ID=00f7hqip55r36l09

aws emr-serverless get-job-run \  
    --application-id $APPLICATION_ID \
    --job-run-id $JOB_RUN_ID
  • Then copy down the resulting file!
aws s3 cp s3://${S3_BUCKET}/tmp/slugplots/seattle-2022.png .

Cleanup

  • Stop and delete the application and ECR repository
aws emr-serverless stop-application --application-id ${APPLICATION_ID}
aws emr-serverless delete-application --application-id ${APPLICATION_ID}
aws ecr delete-repository --repository-name spark-seaborn --force