Deploy Amazon EKS clusters with Kubeflow at Scale With Swiss Army Kube

Image by author

Introduction

Provectus is excited to introduce Swiss Army Kube for Kubeflow (SAKK) — a new open-source tool for Kubeflow deployment on Amazon EKS. SAKK makes it easier for ML engineers using Amazon EKS to bring Kubeflow to Kubernetes, to configure, serve, and manage ML-ready clusters at scale. It allows for the setup and deployment of infrastructure in a declarative modular way, using Terraform and the Infrastructure as Code (IaC) approach, and for automated cluster management with GitOps CI/CD. For more information, check out SAKK documentation on GitHub.

Why Deploy Amazon EKS with Kubeflow at Scale?

Many companies that work on ML/AI projects (either client or in-house) often have to deploy multiple similar Kubeflow clusters, and then manage and augment them with new resources as their training and inference workload grows. In cases where several teams are working on a project, they might also need isolated access to Kubeflow pipelines, for efficiency and security reasons.

While you can have several pipelines in a single Kubeflow instance, there is currently no way in Kubeflow to grant isolated secure access to teams working on a single Kubeflow cluster. Also, each Kubeflow instance requires a separate Amazon EKS cluster. The existing ways to set up and deploy a Kubeflow EKS cluster require working with several native tools for Kubeflow and Kubernetes, that don’t allow you to add non-native resources to the deployed clusters later.

For example, if your company works on a number of ML/AI projects, it needs at least several clusters with Kubeflow instances (dev, test, production environments, client stands, etc.) for each project, and often for each project team. All these deployments require you to install several CLI tools, configure them separately to describe all the Kubernetes and Kubeflow resources you want to deploy, and repeat the process for each of your clusters.

This means you will likely have to allocate highly qualified engineers to handle ML infrastructure deployment and maintenance. They will set up multiple cluster configurations either manually or by writing a bunch of bash scripts. The same goes for further cluster support and upgrades.

Swiss Army Kube for Kubeflow

Swiss Army Kube for Kubeflow bridges the gap by offering a straightforward blueprint based on the best DevOps practices. It spares you the effort required to separately deploy and manage Kubeflow, Amazon EKS, and other resources you may need to bring to your cluster for your desired ML workflow.

The example folder of SAKK repository plays the role of a ready project template. All you need to do is to checkout the branch and set it up each time you want to deploy Amazon EKS with Kubeflow. Your ML team can configure the cluster with everything required at once by setting variables in a single main.tf file at the root of the repository, and deploy it with a couple of Terraform commands. You can replicate this simple process as many times as you want, to get multiple ML clusters up and running in just minutes using Terraform as a single entry point.

Amazon EKS cluster with Kubeflow and ArgoCD that you get as a result of configuring and deploying ML infrastructure with Terraform using Swiss Army Kube for Kubeflow (SAKK). Image by author.

After deployment, you can manage clusters with ArgoCD CLI or UI and store its state in the GitHub repository with an organized CI/CD pipeline. With SAKK, you can enjoy all the portability and standardization benefits of Kubernetes and Amazon EKS, but also add resources to your clusters that go beyond the restrictions of native Kubeflow, Kubernetes, or Amazon EKS CLI tools, without writing custom code.

SAKK’s approach to cluster configuration and deployment does not require extensive knowledge of DevOps tools, meaning you can save time for your DevOps engineers for more important tasks. It provides an opportunity to significantly reduce the time, resources, and costs associated with the deployment and maintenance of ML infrastructure, and training people to manage it.

Cluster Automation with ArgoCD

SAKK uses ArgoCD to automate application states and provide the GitOps approach. The state of your cluster will be stored in a GitHub repository and S3 bucket, described as code.

Advantage of Terraform

SAKK uses Terraform to unify and standardize cloud infrastructure deployment with IaC — a go-to standard and currently one of the best practices for ML DevOps (MLOps). At the moment, Terraform is the leading software in this space. The HCL (HashiCorp Configuration Language) syntax of Terraform configurations is easy to learn, and is a better alternative to random scripts for dealing with clusters. Moreover, Terraform has great documentation and a vibrant community.

Built-In Identity Management with Amazon Cognito

SAKK uses AWS Cognito User Pools for identity management on AWS. In the tutorial below, Cognito is used to create a secure environment, with all access permissions managed in one place. However, SAKK doesn’t vendor-lock you, as it is capable of using any other Identity provider.

Image by author.

Deploying Kubeflow on EKS using Swiss Army Kube is very straightforward. Aside from prerequisites, it takes just a couple more steps:

Configure your cluster deployment (set up ~5 variables in one file)
Deploy your cluster with two Terraform commands (init and apply)

After that, you will get a cluster ready for access and further management.

Prerequisites

For this short tutorial, you need to have an AWS account with an IAM user and the AWS CLI installed. If you don’t have it yet, please use these official guides from AWS:

2. Next, install Terraform using this official guide:

3. Fork and clone the Swiss Army Kube for Kubeflow official repository:

That’s it! Now let’s configure and deploy the Amazon EKS Kubeflow cluster.

1. Configure Cluster Deployment

You set up your cluster in a single Terraform file: main.tf. The minimal number of things to configure here is the following:

cluster_name(name of your cluster)
mainzoneid(main Route53 zone id)
domains(names of endpoint domains)
admin_arns(ARNs of users who will have admin permissions)
cert_manager_email(emails for LetsEncrypt notifications)
cognito_users(list of users for Cognito Pool)

Example configuration of main.tf:

terraform {
backend s3 {}
}module "sak_kubeflow" {
source = "git::https://github.com/provectus/sak-kubeflow.git?ref=init"  cluster_name = "simple"  owner      = "github-repo-owner"
repository = "github-repo-name"
branch     = "branch-name"  #Main route53 zone id if exist (Change It)
mainzoneid = "id-of-route53-zone"  # Name of domains aimed for endpoints
domains = ["sandbox.some.domain.local"]  # ARNs of users who will have admin permissions.
admin_arns = [
{
userarn  = "arn:aws:iam::<aws-account-id>:user/<username>"
username = "<username>"
groups   = ["system:masters"]
}
]  # Email that would be used for LetsEncrypt notifications
cert_manager_email = "info@some.domain.local"  # An optional list of users for Cognito Pool
cognito_users = [
{
email    = "qa@some.domain.local"
username = "qa"
group    = "masters"
},
{
email    = "developer@some.domain.local"
username = "developer"
}
]  argo_path_prefix = "examples/simple/"
argo_apps_dir    = "argocd-applications"
}

In most cases, you’ll also need to override variables related to the GitHub repository (repository, branch, owner) in the main.tf.

Next, you might want to configure backend.hcl that stores Terraform state. Example configuration of backend.hcl:

bucket         = "bucket-with-terraform-states"
key            = "some-key/kubeflow-sandbox"
region         = "region-where-bucket-placed"
dynamodb_table = "dynamodb-table-for-locks"

2. Deploy Your AWS EKS Kubeflow Cluster

Deploy the cluster you’ve just configured with the following Terraform commands:

terraform init
terraform apply
aws --region <region> eks update-kubeconfig --name <cluster-name>

These commands let you:

Initialize Terraform and download all remote dependencies
Create a clean EKS cluster with all required AWS resources (IAM roles, ASGs, S3 buckets, etc.)
Update your local kubeconfig file to access your newly created EKS cluster in the configured context

These Terraform commands will generate a few files in the default apps folder of the repository. You need to commit them in Git and push them to your Github repository before you start deploying services to your EKS Kubernetes cluster:

Image by author.

Note that ArgoCD is pre-configured to track changes of the current repository. When new changes are made to its apps folder, they trigger the synchronization process, and all objects placed in this folder get created.

After that, you can manage your Kubernetes cluster with either ArgoCD CLI/UI or kubectl. To start using kubectl (Kubernetes CLI for cluster management), install and configure it following this official guide:

3. Access and Manage Your Amazon EKS Kubeflow Cluster

Now you have your cluster deployed and ready for work. During the deployment process, two service access endpoints were created in accordance with your domains variable settings in the main.tf file:

Check the email you provided in the domains variable for access credentials and use them to log in.

To learn more about Kubeflow and ArgoCD, you can check out their respective official documentation:

Once you successfully logged into your Amazon EKS cluster via kubectl, access Kubeflow UI and pass all the configuration screens, you’ll see the Kubeflow dashboard:

Image by author.

In the Pipelines section, Kubeflow offers a few samples to let you try pipelines quickly. To learn more about using Kubeflow on AWS, please check the official Kubeflow documentation.

Image by author.

Alternatively, you can upload your own pipelines using AWS SageMaker and Kubeflow. For instance, let’s upload a demo module with one of the built-in AWS SageMaker algorithms.

Create a folder for managing separate Terraform states (with resources related to pipeline executions) and add a main.tf file with this code:

module kmeans_mnist {
source = "path/to/kmeans-mnist-pipeline/folder/at/root/of/the/project"  cluster_name = "<your-cluster-name>"
username     = "<your-kubeflow-username>"
}

2. Run Terraform:

terraform init
terraform apply

Terraform will generate a training_pipeline.yaml file and create a Kubernetes service account that matches your Kubeflow username and has all the required permissions for AWS for running the pipeline.

3. Upload the training pipeline to Kubflow through the Pipelines section of Kubeflow UI:

Image by author.

4. Now that you have your first pipeline and a prepared Kubernetes service account, specify them in the form to start a run:

Image by author.

That’s it! Now you have a pipeline executing in Kubeflow.

SAKK will continue to evolve according to the Roadmap. Check it out in the official repository. Upcoming plans for development include getting more resources configurable via Terraform:

Further AWS Integration

More AWS features will become configurable via Terraform (in the main.tf): RDS (Postgres), ElastiCache (Redis), S3 (Minio), etc. will be moved out of Kubernetes and managed by AWS.

Upgrading Product Versions

It will become possible to set product versions (ArgoCD, Kubeflow, Kubeflow Pipelines) via Terraform (in the main.tf).

Setting AWS IAM roles for Kubeflow

Setting Kubeflow users’ roles and permissions to enable their work with AWS will move to Terraform. Users will be able to generate Kubeflow profiles and resources that will be stored in the GitHub repository and used as a part of the GitOps process.

Kubeflow Pipelines Management

We’ll make it possible to store the state of Kubeflow Pipelines. Users will be able to deploy Kubeflow with ready pipelines from the outside: preload them from the GitHub repository, upload default AWS pipelines.

Most importantly, please use the product. Fork and clone, play around with configurations and deployments and use deployed clusters. See if you like it and tell us why or why not. Let us know your ideas about how SAKK can be improved and contribute to the roadmap. Don’t hesitate to fork, watch and star SAKK repository:

To contribute to PRs (pull requests), you can use this official GitHub guide:

Join SAKK on Slack to discuss any questions or ideas:

Please star and watch SAKK repository and subscribe to maintainers:

Review SAKK Roadmap, comment and contribute:

At Provectus, we believe that any organization or engineer using ML should be able to focus on their ML applications and pipelines without having to worry too much about infrastructure deployment.

We built SAKK to share the best practices gained by our DevOps team over the last 10 years, and we hope the product helps your ML teams to simplify MLOps and start getting enterprise-ready AI/ML clusters for your use cases effortlessly, be it financial modeling, computer vision, natural language understanding, speech translation, or anything else.

Swiss Army Kube for Kubeflow (SAKK) is based on the main Swiss Army Kube (SAK) umbrella repository. SAKK is a SAK modification specifically for the Kubeflow Amazon EKS setup, based on SAK’s collection of modules.

Currently, SAKK is available for Amazon EKS (Elastic Kubernetes Service) only. We plan to expand to other cloud platforms soon.

About Provectus

Provectus is an Artificial Intelligence consultancy and solutions provider, helping companies in Healthcare & Life Sciences, Retail & CPG, Media & Entertainment, Manufacturing, and Internet businesses achieve their objectives through AI. Provectus is headquartered in Palo Alto, CA. For more information, visit provectus.com.

Introduction

Why Deploy Amazon EKS with Kubeflow at Scale?

Swiss Army Kube for Kubeflow

Cluster Automation with ArgoCD

Advantage of Terraform

Built-In Identity Management with Amazon Cognito

Prerequisites

1. Configure Cluster Deployment

2. Deploy Your AWS EKS Kubeflow Cluster

3. Access and Manage Your Amazon EKS Kubeflow Cluster

Further AWS Integration

Upgrading Product Versions

Setting AWS IAM roles for Kubeflow

Kubeflow Pipelines Management

Footer