Custom images are now supported in EMR Serverless allowing you to make use of containers to create reproducible data pipelines.
In this example, we use a simple example of adding the seaborn
library to build a weather visualization.
Important
This example is intended to be run in the us-east-1
region as it reads data from NOAA Global Surface Summary of Day dataset from the Registry of Open Data.
In order to make use of custom images in EMR, you'll need to have:
- a local installation of Docker to build your image
- an ECR repository to host the resulting image.
We'll assume the user you're using has access to create and update ECR repositories, create EMR Serverless applications, and has access to the AWS CLI.
Set up some variables to be used throughout.
AWS_REGION=us-east-1
S3_BUCKET=<your-bucket-name>
ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
JOB_ROLE_ARN=arn:aws:iam::${ACCOUNT_ID}:role/<your-emr-serverless-job-role>
We'll follow the docs for customizing an image for EMR Serverless.
- Create an ECR repository to publish to
aws ecr create-repository \
--repository-name spark-seaborn
- Allow any EMR Serverless application to access the custom image
aws ecr set-repository-policy \
--repository-name spark-seaborn \
--policy-text '{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Emr Serverless Custom Image Support",
"Effect": "Allow",
"Principal": {
"Service": "emr-serverless.amazonaws.com"
},
"Action": [
"ecr:BatchGetImage",
"ecr:DescribeImages",
"ecr:GetDownloadUrlForLayer"
],
"Condition":{
"StringLike": {
"aws:SourceArn": "arn:aws:emr-serverless:'${AWS_REGION}':'${ACCOUNT_ID}':/applications/*"
}
}
}
]
}'
- Build the image
docker build . -t $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest
- Login to ECR and push
# login to ECR repo
aws ecr get-login-password --region $AWS_REGION | docker login --username AWS --password-stdin $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com
# push the docker image
docker push $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/spark-seaborn:latest
- Now create an EMR Serverless application with that image
Note that you can create different images for your driver and executor using the worker-type-specifications
parameter - see the CLI instructions in the docs.
aws emr-serverless create-application \
--name spark-seaborn \
--release-label emr-6.9.0 \
--type SPARK \
--image-configuration '{
"imageUri": "'${ACCOUNT_ID}'.dkr.ecr.'${AWS_REGION}'.amazonaws.com/spark-seaborn:latest"
}' \
--initial-capacity '{
"DRIVER": {
"workerCount": 1,
"workerConfiguration": {
"cpu": "4vCPU",
"memory": "16GB"
}
},
"EXECUTOR": {
"workerCount": 3,
"workerConfiguration": {
"cpu": "4vCPU",
"memory": "16GB"
}
}
}'
- And start the resulting application
APPLICATION_ID=00f7hdef7siki109
aws emr-serverless start-application --application-id ${APPLICATION_ID}
- Now let's upload our pyspark script and start a job!
aws s3 cp noaa_slugplot.py s3://${S3_BUCKET}/code/pyspark/
aws emr-serverless start-job-run \
--name noaa-slugplot-seattle \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://'${S3_BUCKET}'/code/pyspark/noaa_slugplot.py",
"entryPointArguments": [ "72793524234", "2022", "'${S3_BUCKET}'", "tmp/slugplots/seattle-2022.png" ]
}
}' \
--configuration-overrides '{
"monitoringConfiguration": {
"s3MonitoringConfiguration": {
"logUri": "s3://'${S3_BUCKET}'/logs/"
}
}
}'
- Wait for the job to finish
JOB_RUN_ID=00f7hqip55r36l09
aws emr-serverless get-job-run \
--application-id $APPLICATION_ID \
--job-run-id $JOB_RUN_ID
- Then copy down the resulting file!
aws s3 cp s3://${S3_BUCKET}/tmp/slugplots/seattle-2022.png .
- Stop and delete the application and ECR repository
aws emr-serverless stop-application --application-id ${APPLICATION_ID}
aws emr-serverless delete-application --application-id ${APPLICATION_ID}
aws ecr delete-repository --repository-name spark-seaborn --force