Stop training your models on one GPU

Let’s get straight to the point; the PyTorch Operator. The PyTorch Operator is the implementation of PyTorchJob custom resource definition. Using this custom resource, users can create and manage PyTorch jobs like other built-in resources in Kubernetes (e.g., deployments and pods).

We will pick it up from where we left off last time and transform our solution to the RANZCR CLiP Kaggle challenge to run as a PyTorchJob operation. As you will see, it makes things a lot easier! You can follow along with the code you can find in this repo. Use the feature-pytorch-operator branch.

The code

So, without further ado, let’s start with the main function.

Now, this function runs as is in every process. The PyTorch Operator is responsible for distributing the code to different pods. It is also responsible for the process coordination through a master process.

Indeed, all you need to do differently is initialize the process group on line 50 and wrap your model within a DistributedDataParallel class on line 65. Then, on line 70, you start your familiar training loop. So, let’s see what’s inside the train function.

If you’ve ever written PyTorch code before, you won’t find anything that is done differently here. It’s the same old training procedure. That’s all! However, you should still pay attention to how your split the dataset, with the use of the DistributedSampler, as we saw in the last article.

I will refer you to GitHub to see what every utility function does, but we have the basics covered here!

The configuration

The last part is the configuration. To start our training process on Kubernetes, we need to containerize our application and write a few YAML lines.

To turn our code into a container image, we need a Dockerfile. As it turns out, this is pretty straightforward: copy your code and run the file that contains the main function.

Similarly, the YAML configuration is fairly simple as well.

We specify that we want to start a PyTorchJob custom resource with one master node and one worker. As you can see, in this example, we have two GPUs available. We allocate one to the master and one to the worker. If you have more GPUs, you can bump up the number of replicas. Finally, you need to specify the container image that holds your code and any arguments you would like to pass.

Congratulations! All you need to do is apply the configuration file:

kubectl apply -f torch_operator.yaml

To get started with the PyTorch operator, we need Kubeflow, an open-source project dedicated to making deployments of ML projects simpler, portable, and scalable. From the documentation:

The Kubeflow project is dedicated to making deployments of machine learning (ML) workflows on Kubernetes simple, portable and scalable. Our goal is not to recreate other services, but to provide a straightforward way to deploy best-of-breed open-source systems for ML to diverse infrastructures. Anywhere you are running Kubernetes, you should be able to run Kubeflow.

MiniKF

But how do we start with Kubeflow? Do we need a Kubernetes cluster? Should we deploy the whole thing ourselves? I mean, have you looked at Kubeflow’s manifest repo?

Don’t panic; in the end, all we need to experiment with Kubeflow is a GCP or AWS account! We’re going to use MiniKF. MiniKF is a single-node Kubeflow instance that comes with many great features pre-installed. Specifically:

Kale: An orchestration and workflow tool for Kubeflow that enables you to run complete data science workflows starting from a notebook
Arrikto Rok: A data versioning system to support reproducibility, caching, model lineage, and much more.

So, to install Kubeflow on GCP, follow the guide I provide below:

Or, if you prefer AWS:

Deep Neural Networks (DNNs) have been the main force behind most of the recent advances in Machine Learning. Breakthroughs like that are mostly due to the amount of data at our disposal, which increases the need to scale-out the training process to more computational resources.

At the same time, the field of DevOps is gaining traction. Kubernetes is ubiquitous; monolithic legacy systems are breaking up into smaller microservices that are easier to maintain.

How can we bring the two worlds together? In this story, we examined how to solve a real-world use case using the PyTorch Operator running on MiniKF.

My name is Dimitris Poulopoulos, and I’m a machine learning engineer working for Arrikto. I have designed and implemented AI and software solutions for major clients such as the European Commission, Eurostat, IMF, the European Central Bank, OECD, and IKEA.

If you are interested in reading more posts about Machine Learning, Deep Learning, Data Science, and DataOps, follow me on Medium, LinkedIn, or @james2pl on Twitter. Also, visit the resources page on my website, a place for great books and top-rated courses, to start building your own Data Science curriculum!

The code

The configuration

Footer