Simplify your Spark dependency management with Docker in EMR 6.0.0

Apache Spark is a powerful data processing engine that gives data analyst and engineering teams easy to use APIs and tools to analyze their data, but it can be challenging for teams to manage their Python and R library dependencies. Installing every dependency that a job may need before it runs and dealing with library version conflicts is time-consuming and complicated. Amazon EMR 6.0.0 simplifies this by allowing you to use Docker images from Docker Hub and Amazon ECR to package your dependencies. This allows you to package and manage your dependencies for individual Spark jobs or notebooks, without having to manage a spiderweb of dependencies across your cluster.

This post shows you how to use Docker to manage notebook dependencies with Amazon EMR 6.0.0 and EMR Notebooks. You will launch an EMR 6.0.0 cluster and use notebook-specific Docker images from Amazon ECR with your EMR Notebook.

Creating a Docker image

The first step is to create a Docker image that contains Python 3 and the latest version of the numpy Python package. You create Docker images by using a Dockerfile, which defines the packages and configuration to include in the image. Docker images used with Amazon EMR 6.0.0 must contain a Java Development Kit (JDK); the following Dockerfile uses Amazon Linux 2 and the Amazon Corretto JDK 8:

FROM amazoncorretto:8

RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development

RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv

RUN python -V
RUN python3 -V

ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3

RUN pip3 install --upgrade pip
RUN pip3 install numpy 

RUN python3 -c "import numpy as np"

You will use this Dockerfile to create a Docker image, and then tag and upload it to Amazon ECR. After you upload it, you will launch an EMR 6.0.0 cluster that is configured to use this Docker image as the default image for Spark jobs. Complete the following steps to build, tag, and upload your Docker image:

Create a directory and a new file named Dockerfile using the following commands:
```
$ mkdir pyspark-latest
$ vi pyspark-latest/Dockerfile
```
Copy, and then paste the contents of the Dockerfile, save it, and run the following command to build a Docker image:
```
$ docker build -t 'local/pyspark-latest' pyspark-latest/
```
Create the emr-docker-examples Amazon ECR repository for this walkthrough using the following command:
```
$ aws ecr create-repository --repository-name emr-docker-examples
```
Tag the locally built image and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint, using the following command:
```
$ docker tag local/pyspark-latest 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-latest
```
Before you can push the Docker image to Amazon ECR, you need to log in.
To get the login line for your Amazon ECR account, use the following command:
```
$ aws ecr get-login --region us-east-1 --no-include-email
```

Enter and run the output from the get-login command:

$ docker login -u AWS -p <password> https://123456789123.dkr.ecr.us-east-1.amazonaws.com

Upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint. See the following command:
```
$ docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-latest
```

After this push is complete, the Docker image is available to use with your EMR cluster.

Launching an EMR 6.0.0 cluster with Docker enabled

To use Docker with Amazon EMR, you must launch your EMR cluster with Docker runtime support enabled and have the right configuration in place to connect to your Amazon ECR account. To allow your cluster to download images from Amazon ECR, makes sure the instance profile for your cluster has the permissions from the AmazonEC2ContainerRegistryReadOnly managed policy associated with it. The configuration in the first step below configures your EMR 6.0.0 cluster to use Amazon ECR to download Docker images, and configures Apache Livy and Apache Spark to use the pyspark-latest Docker image as the default Docker image for all Spark jobs. Complete the following steps to launch your cluster:

Create a file named emr-configuration.json in the local directory with the following configuration (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint):

[
   {
      "Classification":"container-executor",
      "Properties":{

      },
      "Configurations":[
         {
            "Classification":"docker",
            "Properties":{
               "docker.privileged-containers.registries":"local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com",
               "docker.trusted.registries":"local,centos,123456789123.dkr.ecr.us-east-1.amazonaws.com"
           }
         }
      ]
   },
   {
      "Classification":"livy-conf",
      "Properties":{
         "livy.spark.master":"yarn",
         "livy.spark.deploy-mode":"cluster",
         "livy.server.session.timeout":"16h"
      }
   },
   {
      "Classification":"hive-site",
      "Properties":{
         "hive.execution.mode":"container"
      }
   },
   {
      "Classification":"spark-defaults",
      "Properties":{
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE":"docker",
         "spark.yarn.am.waitTime":"300s",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE":"docker",
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
         "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-latest",
         "spark.executor.instances":"2",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_CLIENT_CONFIG":"hdfs:///user/hadoop/config.json",
         "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:pyspark-latest"
      }
   }
]

You will use that configuration to launch your EMR 6.0.0 cluster using the AWS CLI.

Enter the following commands (replace myKey with the name of the EC2 key pair you use to access the cluster using SSH, and subnet-1234567 with the subnet ID the cluster should be launched in):
```
aws emr create-cluster \
--name 'EMR 6.0.0 with Docker' \
--release-label emr-6.0.0 \
--applications Name=Livy Name=Spark \
--ec2-attributes "KeyName=myKey,SubnetId=subnet-1234567" \
--instance-type m5.xlarge --instance-count 3 \
--use-default-roles \
--configurations file://./emr-configuration.json
```
After the cluster launches and is in the Waiting state, make sure that the cluster hosts can authenticate themselves to Amazon ECR and download Docker images.
Use your EC2 key pair to SSH into one of the core nodes of the cluster.
To generate the Docker CLI command to create the credentials (valid for 12 hours) the cluster uses to download Docker images from Amazon ECR, enter the following command:
```
$ aws ecr get-login --region us-east-1 --no-include-email
```
Enter and run the output from the get-login command:
```
$ sudo docker login -u AWS -p <password> https://123456789123.dkr.ecr.us-east-1.amazonaws.com 
```
This command generates a config.json file in the /root/.docker folder.
Place the generated config.json file in the HDFS location /user/hadoop/ using the following command:
```
$ sudo hdfs dfs -put /root/.docker/config.json /user/hadoop/
```

Now that you have an EMR cluster that is configured with Docker and an image in Amazon ECR, you can use EMR Notebooks to create and run your notebook.

Creating an EMR Notebook

EMR Notebooks are serverless Jupyter notebooks available directly through the Amazon EMR console. They allow you to separate your notebook environment from your underlying cluster infrastructure, and access your notebook without spending time setting up SSH access or configuring your browser for port-forwarding. You can find EMR Notebooks in the left-hand navigation of the EMR console.

To create your notebook, complete the following steps:

Click on Notebooks in the EMR Console
Choose a name for your notebook
Click Choose an existing cluster and select the cluster you just created
Click Create notebook
Once your notebook is in a Ready status, you can click the Open in JupyterLab button to open it in a new browser tab. A default notebook with the name of your EMR Notebook is created by default. When you click on that notebook, you’ll be asked to choose a Kernel. Choose PySpark.
Enter the following configuration into the first cell in your notebook and click ▸(Run):
```
%%configure -f
{"conf": { "spark.pyspark.virtualenv.enabled": "false" }}
```

Enter the following PySpark code into your notebook and click ▸(Run):

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext

import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)
print(np.__version__)

The output should look like the following screenshot; the numpy version in use is the latest (at the time of this writing, 1.18.2).

This PySpark code was run on your EMR 6.0.0 cluster using YARN, Docker, and the pyspark-latest image that you created. EMR Notebooks connect to EMR clusters using Apache Livy. The configuration specified in emr-configuration.json configured your EMR cluster’s Spark and Livy instances to use Docker and the pyspark-latest Docker image as the default Docker image for all Spark jobs submitted to this cluster. This allows you to use numpy without having to install it on any cluster nodes. The following section looks at how you can create and use different Docker images for specific notebooks.

Using a custom Docker image for a specific notebook

Individual workloads often require specific versions of library dependencies. To allow individual notebooks to use their own Docker images, you first create a new Docker image and push it to Amazon ECR. You then configure your notebook to use this Docker image instead of the default pyspark-latest image.

Complete the following steps:

Create a new Dockerfile with a specific version of numpy: 1.17.5.

FROM amazoncorretto:8

RUN yum -y update
RUN yum -y install yum-utils
RUN yum -y groupinstall development

RUN yum list python3*
RUN yum -y install python3 python3-dev python3-pip python3-virtualenv

RUN python -V
RUN python3 -V

ENV PYSPARK_DRIVER_PYTHON python3
ENV PYSPARK_PYTHON python3

RUN pip3 install --upgrade pip
RUN pip3 install 'numpy==1.17.5'

RUN python3 -c "import numpy as np"

Create a directory and a new file named Dockerfile using the following commands:
```
$ mkdir numpy-1-17
$ vi numpy-1-17/Dockerfile
```
Enter the contents of your new Dockerfile and the following code to build a Docker image:
```
$ docker build -t 'local/numpy-1-17' numpy-1-17/
```
Tag and upload the locally built image to Amazon ECR and replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint using the following commands:
```
$ docker tag local/numpy-1-17 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:numpy-1-17 
$ docker push 123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:numpy-1-17
```
Now that the numpy-1-17 Docker image is available in Amazon ECR, you can use it with a new notebook.

Create a new Notebook by returning to your EMR Notebook and click File, New, Notebook, and choose the PySpark kernel.To tell your EMR Notebook to use this specific Docker image instead of the default, you need to use the following configuration parameters.

Enter the following code in your notebook (replace 123456789123.dkr.ecr.us-east-1.amazonaws.com with your Amazon ECR endpoint) and choose ▸(Run):

 %%configure -f
{"conf": 
 { 
  "spark.pyspark.virtualenv.enabled": "false",
  "spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE": "123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:numpy-1-17",
  "spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE":"123456789123.dkr.ecr.us-east-1.amazonaws.com/emr-docker-examples:numpy-1-17"
 }
}

To check if your PySpark code is using version 1.17.5, enter the same PySpark code as before to use numpy and output the version.

Enter the following code into your notebook and choose Run:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext

import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)
print(np.__version__)

The output should look like the following screenshot; the numpy version in use is the version you installed in your numpy-1-17 Docker image: 1.17.5.

Summary

This post showed you how to simplify your Spark dependency management using Amazon EMR 6.0.0 and Docker. You created a Docker image to package your Python dependencies, created a cluster configured to use Docker, and used that Docker image with an EMR Notebook to run PySpark jobs. To find out more about using Docker images with EMR, refer to the EMR documentation on how to Run Spark Applications with Docker Using Amazon EMR 6.0.0. Stay tuned for additional updates on new features and further improvements with Apache Spark on Amazon EMR.

About the Author

Paul Codding is a senior product manager for EMR at Amazon Web Services.

Suthan Phillips is a big data architect at AWS. He works with customers to provide them architectural guidance and helps them achieve performance enhancements for complex applications on Amazon EMR. In his spare time, he enjoys hiking and exploring the Pacific Northwest.

AWS Big Data Blog