Hands-on Guide to Docker for Data Science

Using Docker for machine learning projects and Jupyter notebooks.

Docker is a critical piece for software development these days. Docker lets you separate and isolate applications as you develop them. As a data scientist, it is common to rely on virtual environment libraries like pyenvor virtualenv — but using Docker can unlock you to not just prototype but build production-grade applications. While virtual environments are excellent for quick dev work, Docker provides a great way for collaborating with colleagues and to also deploy data science applications in cloud. In this blog we will cover the Docker fundamentals you need to know for data science and machine learning development.

Becoming proficient in Docker will help you go to next level in data science projects [Source]

There are many resources to learn about Docker, but when I was first starting out, I wanted a small tutorial that got me working with a Dockerfile quickly rather than go through a lot of theory. My hope with this blog is to be that tutorial. Anyways, let’s get started.

Here are a few reasons why Docker is great for data science applications:

Isolate applications: You can use bothconda andpip to install libraries. This gives you a lot of flexibility in which libraries you prefer to use in a particular project. For example, recently I wanted to use Facebook’s Prophet package for time series forecasting and it was just an easy install via conda but for ML libraries such as TensorFlow I prefer using pip — with Docker it was really easy to use both. Docker also lets you use any arbitrary operating system. You can build different versions of Ubuntu or Alpine and even Windows. With virtual environments, you need to use host OS.
Collaborate better: Usually in Python projects, you need to use a different virtual environments for different projects. You need to install requirements.txt and then go through specific steps if needed. With Docker though, the installation steps are usually similar. Using Dockerfile with docker-compose lets your users just run docker-compose up to install all the requirements and setup everything for you. And this step remains the same irrespective of machine or OS — which is great! So for collaboration with your colleagues or for open source projects, Docker is great!
Deployment to cloud services like AWS becomes easier having everything encapsulated in a Dockerfile. You can just push it to AWS Elastic Container Registry (ECR) and use this container in multiple places — say you want to deploy the model to AWS SageMaker, you can do so with just using the same Dockerfile and AWS deploys it for you. If you are deploying a Flask web app, you can use a Dockerfile to deploy it with AWS. With virtual environments, you would have to do quite a bit of steps to make sure things are the same.

To get started, download Docker from here for Windows, Linux or Mac. Also, create an account at Docker Hub: hub.docker.com — Docker Hub is like Github where you can publish and use other people’s images. For example — you can see a bunch of Python images here. Usually, you begin with an image from Docker Hub and build on top of it for your own specific requirements.

As this is a hands-on guide, let’s start by first creating a Dockerfile. You can choose to either work with an existing project or create a new project folder. We will go through a workflow with scripting mode. If you prefer working with Jupyter notebooks, we will cover them in separate section in the end.

Dockerfile is a text file that contains all the commands needed to build a Docker image. A Docker image is built from a series of layers. Each layer represents an instruction in the image’s Dockerfile. Anytime a layer changes in a Dockerfile, when you rebuild the image, all layers after that changed layer are re-built.
Let’s write our Dockerfile! Create a new Dockerfile using vim Dockerfile with:

FROM python:3.8-slim-busterRUN mkdir /app
WORKDIR /app
RUN pip install numpy==1.19.4 
pandas==1.1.5 
scikit-learn==0.23.2 
tensorflow==2.4.0 
seaborn==0.11.0       
COPY . .

Here we are doing the following steps:
1. Building from Docker Hub’s Python 3.8 Slim Buster image. There are many images available from Python officially. Slim Buster image is a good trade-off on size, performance and features for most cases. Read more here.
2. Creating a new app folder inside Docker container. Making app as our work folder — by default this will be location of all our commands from this point on.
4. Running pip install to download the libraries we need for our project. These are all the common libraries needed for machine learning project such as Numpy and Pandas for data exploration, Scikit-Learn for data scaling, shallow modeling, feature selection and metrics, TensorFlow for deep learning models, and finally, Seaborn for data visualizations.
5. Copying our local files in the current directory into app directory.

A Docker image is a read-only template file that contains the source code, libraries, dependencies, tools, and other files needed for an application to run. It is built with a set of layers.
The basic idea of a docker command from the Command Line is:

docker <management command> <command>

Important commands to know for Docker images is:

# List all docker images
docker image ls# Build an image with name "hello" from Dockerfile in this directory
docker image build -t hello .   # Delete image with name "hello"
docker image rm hello

As per above, let us now run docker image build -t hello . inside the folder with the Dockerfile. It will show something several steps of installation of all the libraries and their dependencies. Finally, it will show:

Successfully built <IMAGE_ID>
Successfully tagged hello:latest

If we only wanted Python 3.8 image without the additional commands we could run: docker image pull python:3.8-slim-buster

Docker Container are running instances of Docker Images — containers run the actual applications. Container includes an application and all of its dependencies. Container shares the kernel with other containers, and runs as an isolated process on the host OS. Images can exist without containers, whereas a container needs to run an image to exist. You can have many running containers from the same image. Important commands to know:

Now you want to run the Docker image you have built,

Let’s run the image and get into the bash shell inside the Docker container: docker container run -it hello bash You need to specify -itto interact with the container through terminal shell.
The above command will not continue to copy our local files back and forth into the Docker container. This is really important as we need to use a text editor like Atom or Sublime Text to write code — and we want to reflect the code we see on our text editor back into Docker container without rebuilding the container all the time!
To do this, you need to use volumes .
Run docker container run -it -v $PWD:/app hello bash — this will start a bash shell inside Docker and create a way to copy files back and forth between our local machine and Docker container.
We will use docker-compose in a later step so that you don’t have to remember this long command!

# Run the container from previously built "hello" image
# It will take you to a bash shell inside the docker container 
# where you can run commands and scripts
docker container run -it -v $PWD:/app hello bash# List all running docker containers
docker container ls # List all docker containers even stopped ones
docker container ls -a

If we only to run a few bash commands inside a Python 3.8 container we could run: docker container run -it python:3.8-slim-buster bash

The following image helps illustrate the difference between a Dockerfile, docker image and docker container:

Dockerfile helps build a Docker image which then is used to run Docker containers

Docker compose is a tool for defining and running multi-container Docker applications. Even if you are running a single application, it can be a useful tool for you to not remember all the different volumes arguments, port settings etc.
To create a docker-compose file, run vim docker-compose.yml inside the same folder as Dockerfile:

version: "3"services:
hello:
build: .
image: hello
volumes:
- ".:/app"

This Docker file is a YAML file which tells docker to build various services from different Dockerfiles, configure the image names, environment variables, volumes and ports.

To run this file, do docker-compose build hello— this will build the hello image like before. This followed by docker-compose run hello bash will run the container and set up the volumes etc. This command replaces the entire docker container run -it -v $PWD:/app hello bash command from Step 3.

Important commands to know:

# Build the docker service "hello"
docker-compose build hello# Run a container from image "hello" and open a bash shell
docker-compose run hello bash

Now you can code away and analyze and run data science code inside a Docker image!

As you build more docker images and containers, you need ways to manage all of these on your system. These commands will help you navigate Docker world. Let’s learn about some docker system commands.

docker system info to get all the information about number of containers, images etc.
docker system df to find out how much space you can reclaim by deleting unused containers, images etc.
docker system prune followed by ‘Y’ will clean up all stopped containers and all dangling images which is really useful. If you add -a to the above command it will clean up all images not just dangling images.

If you prefer working with Jupyter notebooks instead, it is little more tricky but we can build from our knowledge in previous sections. You’d have to modify your image and docker-compose file as well.

# Dockerfile for working with Jupyter notebooks 
FROM python:3.8-slim-busterWORKDIR /home/notebooks      # This is different from usualRUN pip install numpy==1.19.4 
pandas==1.1.5 
scikit-learn==0.23.2 
tensorflow==2.4.0 
seaborn==0.11.0  
jupyter  
notebook     COPY . .EXPOSE 8888ENTRYPOINT ["jupyter", "notebook","--ip=0.0.0.0","--allow-root", "--no-browser"]

Create the above Dockerfile in your existing project and then change your docker-compose file to:

version: "3"services:
hello:
build: .
image: hello
volumes:
- ".:/home/notebooks"  # This is different from usual
ports: 
- 8888:8888

Here are the commands to run:

# Build the docker service "hello"
docker-compose build hello# Create a Docker container from image "hello" 
# It will use ENTRYPOINT in Dockerfile to start Jupyter notebook
docker-compose up hello

As before, you run build command. But instead of invoking bash, you just run up and using the ENTRYPOINT above, your Docker will start off a Jupyter notebook server that you can access locally at http://127.0.0.1:8888/?token=<token> Now you can Jupyter away on a Docker container using a browser on your computer.

Congratulations! The above workflow will hopefully be very useful to your data science development. Let’s quickly recap what is possible when you combine Docker and data science projects:

Docker use cases for data science/ ML development:

1. You can share your Dockerfile and docker-compose files with other data scientists for ease of working together.
2. You can deploy this image to AWS ECR to be deployed into a SageMaker endpoint. This will allow you to build your model as an endpoint to be used by users using your website etc. I have written a blog detailed these steps here.
3. You can deploy these containers inside AWS ECS Fargate to run a machine learning model periodically. If you do not want to expose a model to an end-user but rather run it periodically over say all new users that came this month, you can use this approach.

This concludes this hands-on guide on using Docker and data science.

Check out my PODCAST! I have started a podcast called “The Data Life Podcast”. If you deal with data already in your job, or want to get better at machine learning or data science, this is the podcast for you — the unique thing here is that I cover the life aspects as well, with a recent episode on Women in Tech. You can listen to it on Apple Podcasts, Spotify, or Overcast.

My podcast “The Data Life Podcast” which covers technical as well as life aspects of data science

If you have any questions, drop me a note at my LinkedIn profile!

Using Docker for machine learning projects and Jupyter notebooks.

Docker use cases for data science/ ML development:

Footer