AWS Machine Learning Blog

Simplify Machine Learning Inference on Kubernetes with Amazon SageMaker Operators

Amazon SageMaker Operators for Kubernetes allows you to augment your existing Kubernetes cluster with SageMaker hosted endpoints.

Machine learning inferencing requires investment to create a reliable and efficient service. For an XGBoost model, developers have to create an application, such as through Flask that will load the model and then run the endpoint, which requires developers to think about queue management, faultless deployment, and reloading of newly trained models. The serving container then has to be pushed to a Docker repository, where Kubernetes can be configured to pull from, and deploy on the cluster. These steps require your data scientist to work on tasks unrelated to improving model accuracy, or bringing in a dev ops engineer, which adds to development schedules and requires more time to iterate.

With the SageMaker Operators, developers only need to write a yaml file that specifies the S3 stored locations of the saved models, and live predictions become available through a secure endpoint. Reconfiguring the endpoint is as simple as updating the yaml file. On top of being easy to use, the service also has the following features:

  • Multi-model endpoint – Hosting dozens or more models can be challenging to configure and lead to many machines operating with low utilization. Multi-model endpoints sets up one instance with on the fly loading of model artifacts for serving
  • Elastic Inference – Run your smaller workloads on split GPUs that you can deploy at low cost
  • High Utilization & Dynamic Auto Scaling – Endpoints can run with 100% utilization and add replicas based on custom metrics you define, such as invocations per second. Alternatively, automatic scaling can be configured on predefined metrics for client performance
  • Availability Zone Transfer – If there is an outage, Amazon SageMaker will automatically move your endpoint to another Availability Zone within your VPC
  • A/B Testing – Set up multiple models, and direct traffic proportional to the amount that you set on a single endpoint
  • Security – Endpoints are created with HTTPS and can be configured to be run in a private VPC (no internet egress) and accessed through AWS PrivateLink
  • Compliance Ready – Amazon SageMaker has been certified compliant with HIPAA, PCI DSS, and SOC (1, 2, 3) rules and regulations

Packaged together, the features that are available in Kubernetes through SageMaker Operators shorten time to launch model serving, and reduce your development resources to setup and maintain production infrastructure. This can be a drop of 90% in total cost of ownership over EKS or EC2 alone.

This post demonstrates how to set up Amazon SageMaker Operators for Kubernetes to create and update endpoints for a pre-trained XGBoost model completely from kubectl. The solution contains the following steps:

  • Create an IAM Amazon SageMaker role, which gives Amazon SageMaker permissions needed to serve your model
  • Prepare a YAML file that deploys your model to Amazon SageMaker
  • Deploy your model to Amazon SageMaker
  • Query the endpoint to obtain predictions
  • Perform an eventually consistent update to the deployed model

Prerequisites

This post assumes you have the following prerequisites:

  • A Kubernetes cluster
  • The Amazon SageMaker Operators installed on your cluster
  • An XGBoost model you can deploy

For information about installing the operator onto an Amazon EKS cluster, see Introducing Amazon SageMaker Operators for Kubernetes. You can bring your own XGBoost model, but this tutorial uses the existing model from the previously mentioned post.

Creating an Amazon SageMaker execution role

Amazon SageMaker needs an IAM role that it can assume to serve your model. If you do not have one already, create one with the following bash code:

export assume_role_policy_document='{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "sagemaker.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}'
aws iam create-role --role-name <execution role name> \
    --assume-role-policy-document \
    "$assume_role_policy_document"
aws iam attach-role-policy --role-name <execution role name> \
    --policy-arn \
    arn:aws:iam::aws:policy/AmazonSageMakerFullAccess

Replace <execution role name> with a suitable role name. This creates an IAM role that Amazon SageMaker can use to assume when serving your model.

Preparing your hosting deployment

The operators provide a Custom Resource Definition (CRD) named HostingDeployment. You use a HostingDeployment to configure your model deployment on Amazon SageMaker Hosting.

To prepare your hosting deployment, create a file called hosting.yaml with the following contents:

apiVersion: sagemaker.aws.amazon.com/v1
kind: HostingDeployment
metadata:
  name: hosting-deployment
spec:
  region: us-east-2
  productionVariants:
    - variantName: AllTraffic
      modelName: xgboost-model
      initialInstanceCount: 1
      instanceType: ml.r5.large
      initialVariantWeight: 1
  models:
    - name: xgboost-model
      executionRoleArn: SAGEMAKER_EXECUTION_ROLE_ARN
      containers:
        - containerHostname: xgboost
          modelDataUrl: s3://BUCKET_NAME/model.tar.gz
          image: 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest

Replace SAGEMAKER_EXECUTION_ROLE_ARN with the ARN of the execution role you created in the previous step. Replace BUCKET_NAME with the bucket that contains your model.

Make sure that the bucket Region, HostingDeployment Region, and image ECR Region are all equivalent.

Deploying your model to Amazon SageMaker

You can now start the deployment by running kubectl apply -f hosting.yaml. See the following code:

$ kubectl apply -f hosting.yaml
hostingdeployment.sagemaker.aws.amazon.com/hosting-deployment created

You can track deployment status with kubectl get hostingdeployments. See the following code:

$ kubectl get hostingdeployments
NAME                 STATUS     SAGEMAKER-ENDPOINT-NAME
hosting-deployment   Creating   hosting-deployment-38ecac47487611eaa81606fc3390e6ba

Your model endpoint may take up to fifteen minutes to be deployed.  You can use the below command to view the status.  The endpoint will be ready for queries once it achieves the InService status.

$ kubectl get hostingdeployments
NAME                 STATUS      SAGEMAKER-ENDPOINT-NAME
hosting-deployment   InService   hosting-deployment-38ecac47487611eaa81606fc3390e6ba

Querying the endpoint

After the endpoint is in service, you can test that it works with the following example code:

$ aws sagemaker-runtime invoke-endpoint \
  --region us-east-2 \
  --endpoint-name SAGEMAKER-ENDPOINT-NAME \
  --body $(seq 784 | xargs echo | sed 's/ /,/g') \
  >(cat) \
  --content-type text/csv > /dev/null

This bash command connects to the HTTPS endpoint using AWS CLI. The model you created is based on the MNIST digit dataset, and your predictor reads what number is in the image. When you make this call, it sends an inference payload that contains 784 features in CSV format, which represent pixels in an image. You see the predicted number that the model believes is in the payload. See the following code:

$ aws sagemaker-runtime invoke-endpoint \
  --region us-east-2 \
  --endpoint-name hosting-deployment-38ecac47487611eaa81606fc3390e6ba \
  --body $(seq 784 | xargs echo | sed 's/ /,/g') \
  >(cat) \
  --content-type text/csv > /dev/null
8.0

This confirms that your endpoint is up and running.

Eventually consistent updates

After you deploy a model, you can make changes to the Kubernetes YAML and the operator updates the endpoint. The updates propagate to Amazon SageMaker in an eventually consistent way. This enables you to configure your endpoints declaratively and lets the operator handle the details.

To demonstrate this, you can change the instance type of the model from ml.r5.large to ml.c5.2xlarge. Complete the following steps:

  1. Modify the instance type in hosting.yaml to be ml.c5.2xlarge. See the following code:
    apiVersion: sagemaker.aws.amazon.com/v1
    kind: HostingDeployment
    metadata:
      name: hosting-deployment
    spec:
      region: us-east-2
      productionVariants:
        - variantName: AllTraffic
          modelName: xgboost-model
          initialInstanceCount: 1
          instanceType: ml.c5.2xlarge
          initialVariantWeight: 1
      models:
        - name: xgboost-model
          executionRoleArn: SAGEMAKER_EXECUTION_ROLE_ARN
          containers:
            - containerHostname: xgboost
              modelDataUrl: s3://BUCKET_NAME/model.tar.gz
              image: 825641698319.dkr.ecr.us-east-2.amazonaws.com/xgboost:latest
    
  2. Apply the change to the Kubernetes cluster. See the following code:
    $ kubectl apply -f hosting.yaml
    hostingdeployment.sagemaker.aws.amazon.com/hosting-deployment configured
  3. Get the status of the hosting deployment. It will show as Updating and then change to InService when ready. See the following code:
    $ kubectl get hostingdeployments
    NAME                 STATUS     SAGEMAKER-ENDPOINT-NAME
    hosting-deployment   Updating   hosting-deployment-38ecac47487611eaa81606fc3390e6ba

The endpoint remains live and fully available throughout the update. For more information and additional examples, see the GitHub repo.

Cleaning up

To delete the endpoint, and not incur further usage charges, run kubectl delete -f hosting.yaml. See the following code:

$ kubectl delete -f hosting.yaml
hostingdeployment.sagemaker.aws.amazon.com "hosting-deployment" deleted

Conclusion

This post demonstrated how Amazon SageMaker Operators for Kubernetes supports real-time inference. It also supports training and hyperparameter tuning.

As always, please share your experience and feedback, or submit additional example YAML specs or operator improvements. You can share how you’re using Amazon SageMaker Operators for Kubernetes by posting on the AWS forum for Amazon SageMaker, creating issues in the GitHub repo, or sending it through your AWS Support contacts.


About the authors

Cade Daniel is a Software Development Engineer with AWS Deep Learning. He develops products that make training and serving DL/ML models more efficient and easy for customers. Outside of work, he enjoys practicing his Spanish and learning new hobbies.

 

 

 

Alex Chung is a Senior Product Manager with AWS in enterprise machine learning systems. His role is to make AWS MLOps products more accessible for Kubernetes machine learning custom environments. He’s passionate about accelerating ML adoption for a large body of users to solve global economic and societal problems. Outside machine learning, he is also a board member at a Silicon Valley nonprofit for donating stock to charityCocatalyst.org that optimizes donor tax benefits similar to donor advised funds.