Deploy and take Flyte with an end-to-end ML orchestration solution on AKS

Learn about the open-source Flyte tools and how to deploy them securely on Azure Kubernetes Service (AKS)

Posted by Sachi Desai November 20, 2024 5 minute read

Background

Data is often at the heart of application design and development - it fuels user-centric design, provides insights for feature enhancements, and represents the value of an application as a whole. In that case, shouldn’t we use data science tools and workflows that are flexible and scalable on a platform like Kubernetes, for a range of application types?

In collaboration with David Espejo and Shalabh Chaudhri from Union.ai, we’ll dive into an example using Flyte, a platform built on Kubernetes itself. Flyte can help you manage and scale out data processing and machine learning pipelines through a simple user interface.

What is a Flyte cluster, and who uses it?

A Flyte cluster provides you with an API endpoint to register, compile, and execute ML workflows on Kubernetes. It’s where the main Flyte components (user plane, data plane, and control plane) run either as a single or multiple Pods.

Adopters of Flyte include organizations running large-scale data operations and ML operations, including social media platforms, music/video streaming services, and bioinformatics companies.

With a fully configured Flyte cluster, you can:

Process and visualize large, dynamic data sets and ensure up-to-date information retrieval
Train ML models with versioning, enabling reproducible experimentation
Leverage high levels of parallelism for efficient computations
… and more!

Starting with this reference implementation, you’ll be able to take a basic Azure resource group and build an end-to-end solution on a working Flyte cluster.

This example deploys Flyte as an open-source tool on your cluster. Currently, it’s not a managed feature on Azure Kubernetes Service (AKS).

Let’s get started

Before you begin, take a look at the prerequisites on Azure, including:

Azure subscription with at least Contributor role over all your resources using Azure RBAC.
Azure CLI version 2.0 or later installed and configured.
Terraform version 1.3.7 or later installed.
Helm version 3.15.4 or later installed.
Kubernetes command-line client, kubectl, installed and configured.

Now, you can deploy the dependencies and install Flyte following these steps and come back to this blog for a deep dive of your solution on AKS!

Which Flyte back-end components will be installed, and what do they do?

In your end-to-end solution, each of the following Flyte backend components will run on its own pod:

Flyte component	Description
Data catalog	Service that simplifies data indexing and allows you to query data artifacts based on metadata and/or tags. Note that it’s only used when you enable caching on a Task.
Flyte pod webhook	Deployment that creates the Webhook Pod and is used to inject secrets into Task pods.
Flyte admin	Main Flyte API that processes client requests, see API specification.
Flyte console	Web user interface for the Flyte platform, hosted in the same Flyte cluster as Admin API.
Flyte propeller	Core engine that executes workflows within the Flyte data plane .
Flyte scheduler	Cloud-agnostic native scheduler for fixed-rate and cron-based schedules, defined at the init time for your workflow and activated/deactivated using FlyteAdmin API.
Sync resources	Type of agent that enables request/response services (e.g. APIs) to return outputs.

Once you’ve applied and generated the reference Flyte Terraform, you’ll receive an endpoint to your Flyte cluster, and you can verify the following pod statuses:

kubectl get pods -n flyte

NAME                                 READY   STATUS    RESTARTS   AGE
datacatalog-6864645db6-99msb         1/1     Running   0          6m45s
flyte-pod-webhook-848d7db899-8wltj   1/1     Running   0          6m45s
flyteadmin-6cc67b49b4-cmt7j          1/1     Running   0          6m45s
flyteconsole-68f677797f-p4s98        1/1     Running   0          6m45s
flytepropeller-b88f7bf6d-lqc8s       1/1     Running   0          6m45s
flytescheduler-844db4658c-hfrhv      1/1     Running   0          6m45s
syncresources-767d7fc77b-5mj6n       1/1     Running   0          6m45s

Which AKS features are built into my Flyte solution?

Storage

Two Azure storage containers are created: one container stores the Terraform state and the other container stores both metadata and raw data queued up to be processed by Flyte.

Your metadata might consist of task inputs/outputs, data serialization format (protocol buffers), etc., while raw data will be unprocessed data and large objects that execution pods read from/write to the storage container.

Compute

You’ll start with a single-instance Standard_D2_v2 CPU node pool, with cluster autoscaler enabled on your AKS cluster to meet workload demands. As your workflow or application submits an increasing number of Flyte tasks, cluster autoscaler will watch for pending pods and scale up your node pool size (in this case, to a maximum of 10 nodes) to minimize downtime due to resource constraints.

To create a GPU-enabled node pool using your aks.tf configuration file, update the gpu_node_pool_count and gpu_machine_type to your desired node count and instance type, respectively, in the locals array. If you specify an accelerator type, make sure to select a supported option for flytekit.

Identity management

Workload Identity Federation with Entra ID enables fine-grained security controls to be applied. The flytepropeller, flyteadmin, and datacatalog backend components use one user-assigned MI, while the Flyte task execution pods use a separate user-assigned managed identity (MI).

Networking and security

Ingress resources are automatically configured to simplify access through a single endpoint. TLS certificates are also automated using cert-manager to secure the communication.

This reference implementation leverages open-source nginx, since Flyte uses both HTTP and gRPC to communicate the user with the control plane and to serialize workflow code, respectively.

Automatic cluster upgrades

AKS cluster auto-upgrade is enabled by default to minimize workload downtime and stay up-to-date on the latest AKS patches. The reference Flyte Terraform plan sets automatic_upgrade_channel = "stable", ensuring that the AKS cluster created will always remain in a supported version (i.e. within the N-2 rule).

Container image management

Thanks to the integration with Workload Identity Federation, Azure Container Registry (ACR) is created out-of-box with permissions for you to securely:

Push all workflow output images to a private ACR.
Pull your custom images through task execution pods from ACR (otherwise, a default container image is used for each Flyte workflow execution).

Monitoring

Container Insights is configured out-of-box, with system namespace logs and Flyte user workload logs ingested through Azure Monitor pipeline, to help you:

Troubleshoot deployment issues faster through Azure Portal.
Query historical logs using Log Analytics.

Check out my Flyte environment on AKS

Following these steps, you can update the cluster endpoint in your Flyte config, execute a sample workflow, and verify the successful execution in your Flyte console:

What next?

Check out more ML and feature engineering tutorials with Flyte
Learn about best practices for your MLOps pipelines on Azure Kubernetes Service (AKS)
Explore the Flyte GitHub repository: https://github.com/flyteorg/flyte
Join the #flyte-on-azure Slack channel: https://slack.flyte.org

Acknowledgements

Shoutout to Erin Schaffer for her contributions and feedback for this blog!

Twitter LinkedIn

Sachi Desai