This is the multi-page printable view of this section. Click here to print.
Documentation
1 - Overview
Azure Kubernetes Service(AKS) is a managed Kubernetes platform on Azure. It provides various features that makes it easy to get up and running on production grade Kubernetes Clusters. For more information about AKS, check out Introduction to Azure Kubernetes Service.
This project provides various deployment options for running and testing Kubeflow running on AKS. To get started, check out our Deployment Options page.
2 - Contribution Guidelines
Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
Trademarks
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft’s Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party’s policies.
3 - Deployment Options
Start by checking out the prerequisites page
If you want to deploy Kubeflow with minimal changes on AKS, then consider the vanilla deployment option. The Kubeflow control plane is installed on Azure Kubernetes Service (AKS), which is a managed container service used to run and scale Kubernetes applications in the cloud.
For a more secure deployment option that is has minimum baseline security, then consider the Deploy with TLS deployment option.
3.1 - Prerequisites
Kubeflow on AKS Prerequisites
For all Kubeflow on AKS deployment options, you will need the following
- An Azure Subscription (e.g. Free or Student account)
⚠️ Warning: In order to complete the deployments, you will need to have either
User Access Admin
andContributor
orOwner
access to the subscription you are deploying into. - The Azure CLI
- Bash shell (e.g. macOS, Linux, Windows Subsystem for Linux (WSL), Multipass, Azure Cloud Shell, GitHub Codespaces, devcontainers, etc). This repository comes with a .devcontainer folder that allows you to configure your Codespaces or devcontainers environment so that it has all the required Bash tools like kubelogin and the correct version of kustomize
- The following installed in your Bash shell if you are not going with the codespaces or devcontainers option
- Kustomize
- Install Kustomize
Verify the installation:curl -s "https://raw.githubusercontent.com/kubernetes-sigs/kustomize/master/hack/install_kustomize.sh" | bash sudo mv ./kustomize /usr/local/bin/kustomize
kustomize version
- Kubelogin
- to install both
kubectl
andkubelogin
use Azure CLI:az aks install-cli
- to install both
- git
- Bicep
- Kubectl
- sed (optional)
- Kustomize
3.2 - Vanilla Installation
Background
In this lab, you will use the Azure CLI to deploy an Azure Kubernetes Service (AKS) Automatic cluster. AKS Automatic offers a simplified, managed Kubernetes experience with automated node management, scaling, and security configurations. For more details, see the AKS Automatic documentation. Note that AKS Automatic is currently in preview, while it provides faster setup and less manual configuration, it is not recommended for production use. For production workloads or when advanced features and customization are required, use regular AKS instead. You will then install Kubeflow using the default settings using Kustomize and create a jupyter notebook server you can easily access on your browser.
You can follow these same instructions to deploy Kubeflow on a non-automatic AKS cluster.
Instructions for Basic Deployment without TLS and with Default Password
This deployment option is for testing only. To deploy with TLS, and change default password, please click here: Deploy kubeflow with TLS.
Deploy AKS Automatic
Use the Azure CLI to deploy an AKS Automatic cluster.
💡Note: In order to complete this deployment, you will need to have either following permissions on Resource Group:
- Microsoft.Authorization/policyAssignments/write
- Microsoft.Authorization/policyAssignments/read.
For detailed instructions on installing AKS Automatic, please refer to the AKS Automatic installation documentation.
Login to the Azure CLI.
az login
az account set --subscription <NAME_OR_ID_OF_SUBSCRIPTION>
.
Set up your environment variables
RGNAME=kubeflow
CLUSTERNAME=kubeflow-aks-automatic
LOCATION=eastus
Create the resource group
az group create -n $RGNAME -l $LOCATION
Add or Update AKS extension
az extension add --name aks-preview
This article requires the aks-preview
Azure CLI extension version 9.0.0b4 or later.
Create an AKS Automatic cluster
az aks create \
--resource-group $RGNAME \
--name $CLUSTERNAME \
--location $LOCATION \
--sku automatic \
--generate-ssh-keys
💡Note: AKS Automatic is in Preview and requires feature to be registered in subscription.
az feature register --namespace Microsoft.ContainerService --name AutomaticSKUPreview
Connect to AKS Automatic Cluster
After the cluster is created, you can connect to it using the Azure CLI. The following command retrieves the credentials for your AKS cluster and configures kubectl
to use them.
az aks get-credentials --resource-group $RGNAME --name $CLUSTERNAME
Verify connectivity to the cluster. This should return a list of nodes.
kubectl get nodes
Deploy KubeFlow
Clone this repo which includes the kubeflow/manifests repo as Git Submodules
git clone --recurse-submodules https://github.com/Azure/kubeflow-aks.git
--recurse-submodules
flag helps to get manifests from git submodule linked to this repo
Change directory into the newly cloned directory
cd kubeflow-aks
Run Kubeflow Kustomize deployment
This deployment option is for testing only. To deploy with TLS, and change default password, please click here: Deploy kubeflow with TLS.
From the root of the repo, cd
into kubeflow’s manifests
directory and make sure you are in the v1.10-branch
.
cd manifests/
git checkout v1.10-branch
cd ..
Install all of the components via a single command
cp -a deployments/vanilla manifests/vanilla
cd manifests/
while ! kustomize build vanilla | kubectl apply --server-side=true -f -; do echo "Retrying to apply resources"; sleep 10; done
--server-side=true
flag helps with large CRDs that may exceed annotation size limits. The retry loop handles dependency ordering issues during installation.
Once the command has completed, check the pods are ready
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com
Access the Kubeflow dashboard
Run kubectl port-forward
to access the Kubeflow dashboard
kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80
Finally, open http://localhost:8080 and login with the default user’s credentials. The default email address is user@example.com
and the default password is 12341234
Testing the deployment with a Notebook server
You can test that the deployments worked by creating a new Notebook server using the GUI.
-
Click on “Create a new Notebook”
-
Click on “+ New Notebook” in the top right corner of the resulting page
-
Enter a name for the server
-
Leave the “jupyterlab” option selected
-
Feel free to pick one of the images available, in this case we choose the default
-
Set Requested CPU to 0.5 and requested memory in Gi to 1
-
Under Data Volumes click on “+ Add new volume”
-
Expand the resulting section
-
Set the name to datavol-1. The default name provided would not work because it has characters that are not allowed
-
Set the size in Gi to 1
-
Uncheck “Use default class”
-
Choose a class from the provided options. In this case I will choose
azurefile-premium
-
Choose ReadWriteMany as the Access mode. Your data volume config should look like the picture below
-
Click on “Launch” at the bottom of the page. A successful deployment should have a green checkmark under status, after 1-2 minutes.
-
Click on “Connect” to access your jupyter lab
-
Under Notebook, click on Python 3 to access your jupyter notebook and start coding
Next steps
To connect to Kubeflow applications you need to set up HTTPS. The reason is that many of our web applications (e.g., Tensorboard Web Application, Jupyter Web Application, Katib UI) use Secure Cookies, so accessing Kubeflow with HTTP over a non-localhost domain does not work.
Deploy with TLS deployment option.
3.3 - Authenticate Kubeflow users with Custom Password or Entra Id
Background
In this lab, you will update the Kubeflow vanilla installation option to configure authentication using either custom users and passwords or Azure Entra ID.
Change default password
To change the default password for the Kubeflow dashboard, you need to update the Dex configuration.
- First generate Password/Hashes by following steps described in
kubeflow
docs using python to generate bcrypt hash. Or for simplicity you can use an online tool like bcrypt-generator to create a new hash.
pip3 install passlib
python3 -c 'from passlib.hash import bcrypt; import getpass; print(bcrypt.using(rounds=12, ident="2y").hash(getpass.getpass()))'
Password: ***
$2y$12$XXXXXXXXXXXXXXXXXXX
- Delete existing password
kubectl delete secret dex-passwords -n auth
- Create new password secret
kubectl create secret generic dex-passwords --from-literal=DEX_USER_PASSWORD='REPLACE_WITH_HASH' -n auth
- Restart the Dex deployment to pick up the new password secret:
kubectl rollout restart deployment dex -n auth
To add more users
- update
dex
config mapdeployments/vanilla/dex-config-map.yaml
with more entries in user array:
staticPasswords:
- email: user@example.com
hashFromEnv: DEX_USER_PASSWORD
username: user
userID: "15841185641784"
# Add more users here
- email: user2@example.com
hashFromEnv: DEX_USER2_PASSWORD
username: user2
userID: "15841185641785"
- Update
DEX_USER2_PASSWORD
with the new password hash.
kubectl patch secret dex-passwords -n auth --type='json' -p='[{"op": "replace", "path": "/data/DEX_USER2_PASSWORD", "value":"'$(echo -n 'REPLACE_WITH_HASH' | base64)'"}]'
- Apply config map and restart deployment
kubectl apply -f deployments/vanilla/dex-config-map.yaml
kubectl rollout restart deployment dex -n auth
Note: if need to update the default email address, change the params file located at manifests\common\user-namespace\base\params.env
before installing Kubeflow.
Entra ID Configuration
3.4 - Deploy Kubeflow with Password, Ingress and TLS
Background
In this lab, you will use the Azure CLI to deploy an Azure Kubernetes Service (AKS) Automatic cluster. AKS Automatic offers a simplified, managed Kubernetes experience with automated node management, scaling, and security configurations. For more details, see the AKS Automatic documentation. Note that AKS Automatic is currently in preview, while it provides faster setup and less manual configuration, it is not recommended for production use. For production workloads or when advanced features and customization are required, use regular AKS instead. You will then install Kubeflow with a custom password and TLS configuration. This deployment option uses a self-signed certificate and an ingress controller. Replace the self-signed certificate with your own CA certs for production workloads.
You can follow these same instructions to deploy Kubeflow on a non-automatic AKS cluster.
DeployAKS Automatic
Deploy AKS Automatic
Deploy AKS Automatic
Use the Azure CLI to deploy an AKS Automatic cluster.
💡Note: In order to complete this deployment, you will need to have either following permissions on Resource Group:
- Microsoft.Authorization/policyAssignments/write
- Microsoft.Authorization/policyAssignments/read.
For detailed instructions on installing AKS Automatic, please refer to the AKS Automatic installation documentation.
Login to the Azure CLI.
az login
az account set --subscription <NAME_OR_ID_OF_SUBSCRIPTION>
.
Set up your environment variables
RGNAME=kubeflow
CLUSTERNAME=kubeflow-aks-automatic
LOCATION=eastus
Create the resource group
az group create -n $RGNAME -l $LOCATION
Add or Update AKS extension
az extension add --name aks-preview
This article requires the aks-preview
Azure CLI extension version 9.0.0b4 or later.
Create an AKS Automatic cluster
az aks create \
--resource-group $RGNAME \
--name $CLUSTERNAME \
--location $LOCATION \
--sku automatic \
--generate-ssh-keys
💡Note: AKS Automatic is in Preview and requires feature to be registered in subscription.
az feature register --namespace Microsoft.ContainerService --name AutomaticSKUPreview
Connect to AKS Automatic Cluster
After the cluster is created, you can connect to it using the Azure CLI. The following command retrieves the credentials for your AKS cluster and configures kubectl
to use them.
az aks get-credentials --resource-group $RGNAME --name $CLUSTERNAME
Verify connectivity to the cluster. This should return a list of nodes.
kubectl get nodes
Deploy Kubeflow with Password, Ingress and TLS
Clone this repo which includes the kubeflow/manifests repo as Git Submodules
git clone --recurse-submodules https://github.com/Azure/kubeflow-aks.git
--recurse-submodules
flag helps to get manifests from git submodule linked to this repo
Change directory into the newly cloned directory
cd kubeflow-aks
From the root of the repo, ensure you’re using the v1.10-branch:
cd manifests/
git checkout v1.10-branch
cd ..
- Copy the TLS deployment files:
cp -a deployments/tls manifests/tls
Configure Custom password
In the next steps generate password hash for your custom password and replace it in the dex-passwords.yaml file.
First generate Password/Hash by following steps described in kubeflow
docs using python to generate bcrypt hash. Or for simplicity you can use an online tool like bcrypt-generator to create a new hash.
PASSWORD="your_custom_password"
PASSWORD_HASH=$(python3 -c "import bcrypt; print(bcrypt.hashpw(b'$PASSWORD', bcrypt.gensalt()).decode())")
Update the password hash in the manifests/tls/dex-passwords.yaml
secret:
sed -i "s|<YOUR_DEX_USER_PASSWORD>|$PASSWORD_HASH|g" manifests/tls/dex-passwords.yaml
Install Kubeflow
- Deploy Kubeflow
cd manifests/
while ! kustomize build tls | kubectl apply --server-side=true -f -; do echo "Retrying to apply resources"; sleep 10; done
--server-side=true
flag helps with large CRDs that may exceed annotation size limits. The retry loop handles dependency ordering issues during installation.
- Once the command has completed, check the pods are ready
kubectl get pods -n cert-manager
kubectl get pods -n istio-system
kubectl get pods -n auth
kubectl get pods -n knative-eventing
kubectl get pods -n knative-serving
kubectl get pods -n kubeflow
kubectl get pods -n kubeflow-user-example-com
Expose the Kubeflow dashboard using Ingress with TLS
There are couple options to expose your Kubeflow cluster with proper HTTPS using Ingress. See note in Kubeflow docs NodePort / LoadBalancer / Ingress In this example we will use the nginx ingress controller which is included as part of the app-routing-system addon in AKS Automatic.
Step 1: Create TLS Certificate
We can create a self-signed certificate for the Kubeflow with IP available on Nginx ingress LoadBalancer or assign DNS Label
Step 1: Find IP or DNS Label of Nginx ingress
- Obtain Nginx IP
NGINX_IP=$(kubectl get svc -n app-routing-system -o jsonpath='{.items[?(@.spec.type=="LoadBalancer")].status.loadBalancer.ingress[0].ip}')
echo "Nginx IP: $NGINX_IP"
- Optional: Use Azure DNS for a friendly URL You can also configure a custom domain name assigned to the Nginx ingress service using Azure DNS:
kubectl annotate service nginx -n app-routing-system \
service.beta.kubernetes.io/azure-dns-label-name=my-kubeflow-cluster
This will make Kubeflow accessible at: my-kubeflow-cluster.$LOCATION.cloudapp.azure.com
Step 2: Create TLS Certificate
If using IP address create following certificate:
echo "apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kubeflow-tls-cert
namespace: app-routing-system
spec:
secretName: kubeflow-tls-secret
ipAddresses:
- $NGINX_IP
isCA: false
issuerRef:
name: kubeflow-self-signing-issuer
kind: ClusterIssuer
group: cert-manager.io" | kubectl apply -f -
If using DNS label use following definition (replace my-kubeflow-cluster
with your unique dns label)
echo "apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: kubeflow-tls-cert
namespace: app-routing-system
spec:
secretName: kubeflow-tls-secret
dnsNames:
- my-kubeflow-cluster.$LOCATION.cloudapp.azure.com
isCA: false
issuerRef:
name: kubeflow-self-signing-issuer
kind: ClusterIssuer
group: cert-manager.io" | kubectl apply -f -
- Deploy the certificate:
kubectl apply -f tls/certificate.yaml
Wait for the certificate to be ready:
kubectl wait --for=condition=Ready certificate/kubeflow-tls-cert -n istio-system --timeout=300s
Step 3: Configure Ingress
Create and apply an ingress manifest to expose the Kubeflow components:
echo 'apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: kubeflow-ingress
namespace: istio-system
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
nginx.ingress.kubernetes.io/ssl-redirect: "false"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
spec:
ingressClassName: webapprouting.kubernetes.azure.com
tls:
- secretName: kubeflow-tls-secret
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: istio-ingressgateway
port:
number: 80' | kubectl apply -f -
Verify Ingress:
kubectl get ingress kubeflow-ingress -n istio-system
Wait for the ADDRESS
field to show an external IP address (this may take a few minutes).
NAME CLASS HOSTS ADDRESS PORTS AGE
kubeflow-ingress webapprouting.kubernetes.azure.com * xxx.149.0.222 443 16m
Access Kubeflow Dashboard
You can now access the Kubeflow dashboard at https://$NGINX_IP
or https://my-kubeflow-cluster.$LOCATION.cloudapp.azure.com
if DNS was configured.
Log in using:
- Email: user@example.com (or the email you configured)
- Password: The password you used to generate the hash
Testing the deployment with a Notebook server
You can test that the deployments worked by creating a new Notebook server using the GUI.
- Click on “Create a new Notebook” on the Kubeflow dashboard
- Click on “+ New Notebook” in the top right corner of the resulting page
- Enter a name for the server
- Leave the “jupyterlab” option selected
- Feel free to pick one of the images available, in this case we choose the default
- Set Requested CPU to 0.5 and requested memory in Gi to 1
- Under Data Volumes click on “+ Add new volume”
- Expand the resulting section
- Set the name to datavol-1. The default name provided would not work because it has characters that are not allowed
- Set the size in Gi to 1
- Uncheck “Use default class”
- Choose a class from the provided options. In this case I will choose “azurefile-premium”
- Choose ReadWriteMany as the Access mode. Your data volume config should look like the picture below
- Click on “Launch” at the bottom of the page. A successful deployment should have a green checkmark under status, after 1-2 minutes.
- Click on “Connect” to access your jupyter lab
- Under Notebook, click on Python 3 to access your jupyter notebook and start coding
Destroy the resources
Run the command below to destroy the resources you just created after you are done testing
az group delete -n $RGNAME