Kubeflow provides a simple, portable, and scalable way of running Machine Learning workloads on Kubernetes.
In this module, we will install Kubeflow on Amazon EKS, run a single-node training and inference using TensorFlow, train and deploy model locally and remotely using Fairing, setup Kubeflow pipeline and review how to call AWS managed services such as Sagemaker for training and inference.
We need more resources for completing this chapter of the EKS Workshop. First, we’ll increase the size of our cluster to 6 nodes
export NODEGROUP_NAME=$(eksctl get nodegroups --cluster eksworkshop-eksctl -o json | jq -r ‘.[0].Name’)
eksctl scale nodegroup --cluster eksworkshop-eksctl --name $NODEGROUP_NAME --nodes 6 --nodes-max 6
curl — silent — location “https://github.com/kubeflow/kfctl/releases/download/v1.0.1/kfctl_v1.0.1-0-gf3edb9b_linux.tar.gz" | tar xz -C /tmp
sudo mv -v /tmp/kfctl /usr/local/bin
Next step is to export environment variables needed for Kubeflow install.
cat << EoF > kf-install.sh
export AWS_CLUSTER_NAME=eksworkshop-eksctl
export KF_NAME=${AWS_CLUSTER_NAME}export BASE_DIR=${HOME}/environment
export KF_DIR=${BASE_DIR}/${KF_NAME}export CONFIG_URI=”https://raw.githubusercontent.com/kubeflow/manifests/v1.0-branch/kfdef/kfctl_aws.v1.0.1.yaml"export CONFIG_FILE=${KF_DIR}/kfctl_aws.yaml
EoFsource kf-install.sh
mkdir -p ${KF_DIR}
cd ${KF_DIR}
wget -O kfctl_aws.yaml $CONFIG_URI
We will use IAM Roles for Service Account in our configuration. IAM Roles for Service Account offers fine grained access control so that when Kubeflow interacts with AWS resources (such as ALB creation), it will use roles that are pre-defined by kfctl.
kfctl will setup OIDC Identity Provider for your EKS cluster and create two IAM roles (kf-admin-${AWS_CLUSTER_NAME} and kf-user-${AWS_CLUSTER_NAME}) in your account.
kfctl will then build trust relationship between OIDC endpoint and Kubernetes Service Accounts (SA) so that only SA can perform actions that are defined in the IAM role.
Because we are using this feature, we will disable using IAM roles defined at the Worker nodes. In addition, we will replace EKS Cluster Name and AWS Region in your $(CONFIG_FILE).
sed -i ‘/region: us-west-2/ a enablePodIamPolicy: true’ ${CONFIG_FILE}sed -i -e ‘s/kubeflow-aws/’”$AWS_CLUSTER_NAME”’/’ ${CONFIG_FILE}
sed -i “s@us-west-2@$AWS_REGION@” ${CONFIG_FILE}sed -i “s@roles:@#roles:@” ${CONFIG_FILE}
sed -i “s@- eksctl-eksworkshop-eksctl-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@#- eksctl-eksworkshop-eksctl-nodegroup-ng-a2-NodeInstanceRole-xxxxxxx@” ${CONFIG_FILE}
Until https://github.com/kubeflow/kubeflow/issues/3827 is fixed, install aws-iam-authenticator
curl -o aws-iam-authenticator https://amazon-eks.s3.us-west-2.amazonaws.com/1.15.10/2020-02-22/bin/linux/amd64/aws-iam-authenticator
chmod +x aws-iam-authenticator
sudo mv aws-iam-authenticator /usr/local/bin
Apply configuration and deploy Kubeflow on your cluster:
cd ${KF_DIR}
kfctl apply -V -f ${CONFIG_FILE}
Run below command to check the status
kubectl -n kubeflow get all
Kubeflow Dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
In your Cloud9 environment, click Tools / Preview / Preview Running Application to access dashboard. You can click on Pop out window button to maximize browser into new tab.
Leave the current terminal running because if you kill the process, you will loose access to the dashboard. Open new Terminal to follow rest of the workshop.
Click on Start Setup
Specify the namespace as eksworkshop
Click on Finish to view the dashboard
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is often used for data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and more.
In Kubeflow dashboard, click on Create a new Notebook server:
- Select the namespace created in previous step.
- This pre-populates the namespace field on the dashboard. Specify a name myjupyter for the notebook:
- In the Image section, select the latest tensorflow-1.x image whose name ends in cpu (not gpu) from the dropbown box.
- Change the CPU value to 1.0.
- Scroll to the bottom, take all other defaults, and click on LAUNCH.
It takes a few seconds for the Jupyter notebook to come online. Click on CONNECT. - This connects to the notebook and opens the notebook interface in a new browser tab.
- Click on New, select Python3
While Jupyter notebook is good for interactive model training, you may like to package the training code as Docker image and run it in Amazon EKS cluster.
This chapter explains how to build a training model for Fashion-MNIST dataset using TensorFlow and Keras on Amazon EKS. This dataset contains 70,000 grayscale images in 10 categories and is meant to be a drop-in replace of MNIST.
We will use a pre-built Docker image seedjeffwan/mnist_tensorflow_keras:1.13.1 for this exercise. This image uses tensorflow/tensorflow:1.13.1 as the base image. The image has training code and downloads training and test data sets. It also stores the generated model in an S3 bucket.
Alternatively, you can use Dockerfile to build the image by using the command below. We will skip this step for now
docker build -t <dockerhub_username>/<repo_name>:<tag_name> .
Create S3 bucket
Create an S3 bucket where training model will be saved:
export HASH=$(< /dev/urandom tr -dc a-z0–9 | head -c6)
export S3_BUCKET=$HASH-eks-ml-data
aws s3 mb s3://$S3_BUCKET — region $AWS_REGION
This name will be used in the pod specification later. This bucket is also used for serving the model.
If you want to use an existing bucket in a different region, then make sure to specify the exact region as the value of AWS_REGION environment variable in mnist-training.yaml.
AWS credentials are required to save model on S3 bucket. These credentials are stored in EKS cluster as Kubernetes secrets.
Create an IAM user ‘s3user’, attach S3 access policy and retrieve temporary credentials.
aws iam create-user --user-name s3user
aws iam attach-user-policy --user-name s3user --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam create-access-key --user-name s3user > /tmp/create_output.json
Next, record the new user’s credentials into environment variables:
export AWS_ACCESS_KEY_ID_VALUE=$(jq -j .AccessKey.AccessKeyId /tmp/create_output.json | base64)
export AWS_SECRET_ACCESS_KEY_VALUE=$(jq -j .AccessKey.SecretAccessKey /tmp/create_output.json | base64)
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: aws-secret
type: Opaque
data:
AWS_ACCESS_KEY_ID: $AWS_ACCESS_KEY_ID_VALUE
AWS_SECRET_ACCESS_KEY: $AWS_SECRET_ACCESS_KEY_VALUE
EOF
Create pod:
curl -LO https://eksworkshop.com/advanced/420_kubeflow/kubeflow.files/mnist-training.yaml
envsubst < mnist-training.yaml | kubectl create -f -
This will start a pod which will start the training and save the generated model in S3 bucket. Check status:
kubectl get pods
After the model is trained and stored in S3 bucket, the next step is to use that model for inference.
This chapter explains how to use the previously trained model and run inference using TensorFlow and Keras on Amazon EKS.
A model from training was stored in the S3 bucket in previous section. Make sure S3_BUCKET and AWS_REGION environment variables are set correctly.
curl -LO https://eksworkshop.com/advanced/420_kubeflow/kubeflow.files/mnist-inference.yaml
envsubst <mnist-inference.yaml | kubectl apply -f -
Wait for the containers to start and run the next command to check its status.
kubectl get pods -l app=mnist,type=inference
Now, we are going to use Kubernetes port forward for the inference endpoint to do local testing:
kubectl port-forward `kubectl get pods -l=app=mnist,type=inference -o jsonpath=’{.items[0].metadata.name}’ --field-selector=status.phase=Running` 8500:8500
Leave the current terminal running and open a new terminal for installing tensorflow.
curl -O https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py --user
pip3 install requests tensorflow --user
Use the script inference_client.py to make prediction request. It will randomly pick one image from test dataset and make prediction.
curl -LO https://eksworkshop.com/advanced/420_kubeflow/kubeflow.files/inference_client.pypython inference_client.py --endpoint http://localhost:8500/v1/models/mnist:predict
It will randomly pick one image from test dataset and make prediction.
Data: {“instances”: [[[[0.0], [0.0], [0.0], [0.0], [0.0] … 0.0], [0.0]]]], “signature_name”: “serving_default”}
The model thought this was a Ankle boot (class 9), and it was actually a Ankle boot (class 9)
Now that we saw how to run training job and inference, let’s terminate these pods to free up resources.
kubectl delete -f mnist-training.yamlkubectl delete -f mnist-inference.yaml
Uninstall Kubeflow
Delete IAM users, S3 bucket and Kubernetes secret
# delete s3user
aws iam detach-user-policy --user-name s3user --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam delete-access-key --access-key-id `echo $AWS_ACCESS_KEY_ID_VALUE | base64 — decode` --user-name s3user
aws iam delete-user --user-name s3user# delete sagemakeruser
aws iam detach-user-policy --user-name sagemakeruser --policy-arn arn:aws:iam::aws:policy/AmazonSageMakerFullAccess
aws iam delete-access-key --access-key-id `echo $AWS_ACCESS_KEY_ID_VALUE | base64 --decode` --user-name sagemakeruser
aws iam delete-user --user-name sagemakeruser# delete S3 bucket
aws s3 rb s3://$S3_BUCKET --force --region $AWS_REGION
# delete aws-secret
kubectl delete secret/aws-secret
kubectl delete secret/aws-secret -n kubeflow
Run these commands to uninstall Kubeflow from your EKS cluster.
cd ${KF_DIR}
kfctl delete -V -f ${CONFIG_FILE}
Scale the cluster back to previous size.
eksctl scale nodegroup — cluster eksworkshop-eksctl — name $NODEGROUP_NAME — nodes 3