This sample corresponds to the AWS Blog Post Securing MLflow in AWS: Fine-grained access control with AWS native services
We aim to demostrate how it is possible to achieve a hybrid architecture using different tools to enable end-to-end Machine Learning workflows. Specifically, we look at Amazon SageMaker and MLflow, and how they can be integrated securely without worrying about managing credentials by using IAM Roles and temporary credentials.
This sample shows how to do the following:
- How to deploy MLflow on a serverless architecture (we build on top of running MLflow on Fargate)
- How to expose a MLflow server via private integrations to an Amazon API Gateway (we build on top of running MLflow on AWS)
- How to add authentication and authorization for programmatic access and browser access to MLflow
- How to access MLflow via SageMaker using SageMaker Execution Roles
Due to its modularity, this sample can be extended in a number of ways, and we will provide guidance on how to do so.
This sample is made of 4 different stacks:
MLflowVPCStack
- deploys a MLfLow tracking server on a serverless infrastructure running on ECS and Fargate on a private subnet
- deploys an Aurora Serverless database for the data store and S3 for the artifact store.
RestApiGatewayStack
- exposes the MLFlow server via a PrivateLink to an REST API Gateway.
- deploys a Cognito User Pool to manage the users accessing the UI.
- deploy a Lambda Authorizer to verify the JWT token with the Cognito User Pool ID keys and returns IAM policies to allow or deny a request.
- adds IAM Authorizer. This will be applied to the
AmplifyMLflowStack
- creates an app with CI/CD capability to deploy the MLFLow UI
SageMakerStudioUserStack
- deploys a SageMaker Studio domain (if not existing).
- adds three users, each one with a different SageMaker execution role implementing different access level:
mlflow-admin
-> admin like permission to the MLFlow resourcesmlflow-reader
-> read-only admin to the MLFlow resourcesmlflow-model-approver
-> same permissions asmlflow-reader
plus can register new models from existing runs, and promote existing registered models to new stages in MLflow
Our proposed architecture is shown Fig. 1
Fig. 1 - MLflow on AWS architecture diagram
- Access to an AWS account with Admin permissions and credentials correctly set
- Docker
- Python 3.8
Log into the AWS Management Console and search for Cloud9 in the search bar.
Click Cloud9 and create an AWS Cloud9 environment region based on Amazon Linux 2.
For the instance type, we tested with a t3.large
, but you can very likely use a Free-Tier eligible instance.
Open a new terminal inside AWS Cloud9 IDE and run:
git clone https://github.com/aws-samples/sagemaker-studio-mlflow-integration.git
The CDK script expects the following ENV variables to be set
AWS_REGION=<region-where-you-want-to-deploy>
AWS_ACCOUNT=<AWS-account-where-you-want-to-deploy>
If you would like to use an existing SageMaker Studio domain, please set this ENV variable
DOMAIN_ID=<your-existing-sagemaker-studio-domain-id>
The default region used by the CDK app is us-west-2
.
You can change the default region by setting up the AWS_REGION
environment variable.
When working on Cloud9, you can specify the same region where your Cloud9 environment is running as follow:
sudo yum install jq -y
export AWS_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region')
echo "export AWS_REGION=${AWS_REGION}" | tee -a ~/.bash_profile
export AWS_ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
echo "export AWS_ACCOUNT=${AWS_ACCOUNT}" | tee -a ~/.bash_profile
The CDK script expects to find the ENV variable DOMAIN_ID
in order to figure out if a new SageMaker Studio domain is needed or not.
export DOMAIN_ID=$(aws sagemaker list-domains | jq -r 'select(.Domains[0] != null) .Domains[0].DomainId | tostring')
echo "export DOMAIN_ID=${DOMAIN_ID}" | tee -a ~/.bash_profile
MLflow UI does not support any login workflow, nonetheless mechanisms to set the proper headers to authenticated API calls against a backend service.
Amplify provides libraries that can be used to quickly add a login workflow, and to easily manage the lifecycle of the authentication tokens.
We provide you a patch to be applied on top of MLflow 2.5.0
that adds Amplify React Components for authentication and how to add Authorization
header with a Bearer
token for every backend API call.
The patch we provided can be checked here and it will enable a login flow backed by Amazon Cognito as shown in Fig. 2.
Note: we also provide patches for previous versions of MLflow
-1.30.0
. If you want to install that version, you need to ensure mlflow 1.30.0
here installed throughout this sample, and you adapt the lab sample to work with that same version as the SDK for deploying a model to SageMaker has changed
-2.2.1
. If you want to install that version, you need to ensure mlflow 2.2.1
here installed
-2.3.1
. If you want to install that version, you need to ensure mlflow 2.3.1
here installed**
cd ~/environment/sagemaker-studio-mlflow-integration/
git clone --depth 1 --branch v2.5.0 https://github.com/mlflow/mlflow.git
cd mlflow
git am ../cognito-mlflow_v2-5-0.patch
Before deploying, since we use CDK construct to build the container images locally, we need a larger disk size than the one provided by Cloud9 in its default environment configuration (i.e. 20GB, whivh is not enough). To resize it on the fly without rebooting the instance, you can run the following script specifying a new desired size.
cd ~/environment/sagemaker-studio-mlflow-integration/
./resize-cloud9.sh 100
Where 100
represents the new desired disk size in GB.
The AWS Cloud Development Kit (AWS CDK) is an open-source software development framework to model and provision your cloud application resources using familiar programming languages. If you would like to familiarize yourself the CDKWorkshop is a great place to start.
Using Cloud9 environment, open a new Terminal and use the following commands:
cd ~/environment/sagemaker-studio-mlflow-integration/cdk
npm install -g [email protected] --force
cdk --version
Take a note of the latest version that you install, at the time of writing this post it is 2.86.0
.
Open the package.json file and replace the version “2.86.0” of the following modules with the latest version that you have installed above.
"aws-cdk-lib": "2.86.0",
"@aws-cdk/aws-amplify-alpha": "2.86.0-alpha.0",
"@aws-cdk/aws-cognito-identitypool-alpha": "2.86.0-alpha.0",
"@aws-cdk/aws-lambda-python-alpha": "2.86.0-alpha.0",
This will install all the latest CDK modules under the node_modules
directory (npm install
) and prepare your AWS account to deploy resources with CDK (cdk bootstrap
).
cd ~/environment/sagemaker-studio-mlflow-integration/cdk
npm install
cdk bootstrap
Now we are ready to deploy our full solution.
cdk deploy --all --require-approval never
To run this sample, we reccommend to deploy all 4 Stacks to test out the SageMaker integration.
However, if you are only interested in the MLflow deployment (MLflow server, MLflow UI, and REST API Gateway), you can deploy only the first three stacks, i.e. MLflowVPCStack
, RestApiGatewayStack
and AmplifyMLflowStack
.
We have provided a script that will populate the Cognito User Pool with 3 users, each belonging to a different group. To execute the script, please run the following command. The script will prompt you to enter your desired password. Please ensure that the password you pick respects the password policy defined for Cognito
cd ~/environment/sagemaker-studio-mlflow-integration/src/cognito/
python add_users_and_groups.py
To check the script code here.
After running the script, if you check the Cognito User Pool in the console you should see the three users created
Fig. 2 - Cognito users in the Cognito User Pool.
On the REST API Gateway side, the Lambda Authorizer will first verify the signature of the token using the Cognito User Pool Key, verify the claims, and only after that, it will extract the cognito group the user belongs to from the claim in JWT token (i.e., cognito:groups
), and apply different permissions based on the group itself that we have programmed.
For our specific case, we have three groups:
admins
- can see and can edit everythingreaders
- can only read everythingmodel-approvers
- same asreaders
plus permissions to register models, create model versions, and update models to different stages.
Depending on the group, the Lambda Authorizer will generate different IAM Policies. This is just an example on how authorization can be achieved, in fact, with a Lambda Authorizer, you can implement any logic you want. If you want to restrict only a subset of actions, you need to be aware of the MLFlow REST API definition, which can be found here The code for the Lambda Authorizer can be explored here
Fig. 3 - MLflow login flow using AWS Amplify, Amazon Cognito and Lambda Authorizer on the API Gateway
One of the key aspect of this sample, is the integration with SageMaker.
Permissions in SageMaker are managed via IAM Roles, for SageMaker also called Execution Roles that are associated to the service when in use (both when using SageMaker Studio, or the SageMaker managed infrastructure).
By allowing the API Gateway to use IAM authentication on the <MLFLOW-Tracking-URL>/api/
, we can do exatly that.
Provisioning a new SageMaker Studio domain will do the following operations:
- Create a new SageMaker Studio domain in the default VPC. (unless already existing)
- Create three new SageMaker Studio users attached to the domain and three different execution role created attached to them. These execution role the same permissions that the Lambda Authorizer applies to the different groups.
mlflow-admin
- has associated an execution role with the similar permissions as the user in the cognito groupadmins
mlflow-reader
- has associated an execution role with the similar permissions as the user in the cognito groupreaders
mlflow-model-arrpover
- has associated an execution role with the similar permissions as the user in the cognito groupdeny-all
Fig. 3 - Accessing MLflow from SageMaker Studio and SageMaker Training Jobs using IAM Roles
In order to deploy to SageMaker an mlflow model, you need to create a serving container that implements what the SageMaker runtime expects to find.
MLflow makes this effor easier by providing a CLI command that build the image locally and pushes to your ECR the image.
Most recent versions of MLflow have dependencies on Python 3.8
.
python --version
If running this sample on Cloud9, you need to ensure you have Python 3.8
installed.
You can follow these instructions on how to do it
sudo yum install -y amazon-linux-extras
sudo amazon-linux-extras enable python3.8
sudo yum install -y python3.8
Il on Cloud9 run the following (after installing Python 3.8)
# install the libraries
pip3.8 install mlflow==2.5.0 boto3 # or pip install mlflow==2.5.0 boto3 if your default pip comes alongside a python version >= 3.8
# build and push the container to ECR into your account
mlflow sagemaker build-and-push-container
Before accessing the MLflow UI, we need to ensure the first build got successfully executed.
Navigate to the Amplify console, and select the MLflow-UI
app that we have created.
Once the build completes (might take some time) you can access the MLFlow UI from the link provided by Amplify as shown in Fig. 5.
Fig. 4 - Retrieve the URL of the MLflow UI
There might be cases when the first Amplify build fails.
If this is not the case, you should re-deploy manually the Amplify build by navigating to the failed build.
You first select the main
branch
Fig. 5 - Navigate to the Amplify main
branch
and then click on the "Redeploy this version".
Fig. 6 - Redeploy the same failed build
After a few minutes, you should see the successful build.
In the AWS console, navigate to Amazon SageMaker Studio and open Studio for the mlflow-admin
user as shown in the pictures below.
Fig 6 - Navigate to Amazon SageMaker Studio
Fig 7 - Launch Amazon SageMaker Studio for the mlflow-admin
Clone this repository either from the terminal or from the SageMaker Studio UI.
git clone https://github.com/aws-samples/sagemaker-studio-mlflow-integration.git
We provide three labs located in the ./sagemaker-studio-mlflow-integration/lab/
folder.
When running the labs, please make sure the kernel selected is Base Python 2.0
(it should be selected by default).
1_mlflow-admin-lab.ipynb
For this lab, please use themlflow-admin
user profile created for you in SageMaker Studio. In this lab you will test an admin permission. In here we access MLflow from both SageMaker Studio, and from a SageMaker Training Job using the execution role assigned to the user profilemlflow-admin
. Once the training is completed, we further show how to register models, create model versions from the artifact, and download locally the artifacts for testing purposes. Finally, we show how to deploy the model on the SageMaker Managed infrastructure. Furthermore, the lab shows how you can enrich MLflow metadata with SageMaker metadata, and vice versa, by storing MFlow specifics in SageMaker via SageMaker Experiments SDK and visualize them in the SageMaker Studio UI.2_mlflow-reader-lab.ipynb
For this lab, please use themlflow-reader
user profile created for you in SageMaker Studio. In this lab you will test read like permissions. You can see details about every experiment, every run, as well as registered models and model versions, however you cannot modify / create new entities.3_mlflow-model-approver-lab.ipynb
For this lab, please use themlflow-model-approver
user profile created for you in SageMaker Studio. In this lab you will test the permissions to register new models and new model versions.
SageMaker Studio is based upon Jupyter Lab, and it offers the same flexibility to extend its capabilities thanks for jupyter extensions. You have the possibility to build your own extension, or to access one of the existing one via the "Extension Manager" (see Fig. 9).
Fig. 9 - Enable Extension Manager in SageMaker Studio
For our excercise, the jupyterlab-iframe
extension provides us the capability to render websites within an iframe.
To install, you can either follow the instructions in the extension documentation, or install it via the Extension Manager.
Once successfully installed, from the SageMaker Studio menu, View
->Activate Command Palette
dialog, search for "iframe" as in figure
Fig. 10 - Open the jupyterlab-iframe dialog
Finally, set the MLflow UI URL generated by Amplify and open the tab. You can now access MLflow UI without leaving the SageMaker Studio UI using the same set of credentials you have stored in Amazon Cognito as shown in Fig. 11
Fig. 11 - Access MLflow UI from within SageMaker Studio
You can destroy the CDK stack by running the following command:
cd ~/environment/aws-mlflow-sagemaker-cdk/cdk
cdk destroy --all
At the prompt, enter y
.
There might be cases when the cleanup might not work.
Usually, this is due to the creation of different SageMaker Studio KernelApps than the ones have been provisioned by the CDK stack.
In this case, you should first delete all KernelApp
on all user profiles manually, and then try again to destroy the stack as explained earlier.
Cost of just running this sample: < 10$.
The biggest cost driver in this sample are the 3 KernelGateway
apps initialized for the SageMaker Studio Domain.
To save costs, you can delete the 3 KernelGateway
apps, one for each user profile, that spins up a ml.t3.medium
instance type each.
They can be deleted from the console, and they are named as instance-mlflow-basepython-2-0-ml-t3-medium
.
Alternatively, you could install the sagemaker-studio-auto-shutdown-extension to save on costs.
We have shown how you can add authentication and authorization to a single tenent MLflow serverless installation with minimal code changes to MLflow. The highlight of this exercise is the authentication to an MLflow tracking server via IAM Roles within SageMaker, leveraging the security the IAM carries with it.