Azure Availability Zones ensure high availability by offering independent locations within regions, equipped with their own power, cooling, and networking to ensure applications and data are protected from datacenter-level failures.
AKS assigns the kubernetes.azure.com/mode: system label to nodes in system node pools signaling the preference for system pods should be scheduled there. The CriticalAddonsOnly=true:NoSchedule taint can be added to your system nodes to prohibit application pods from being scheduled on them.
Configure Azure CNI networking for dynamic allocation of IPs or use CNI overlay
Impact:MediumCategory:Scalability
APRL GUID:c22db132-399b-4e7c-995d-577a60881be8
Description:
Azure CNI enhances cluster IP and network management, allowing dynamic IP allocation, scalable subnets, direct pod-VNET connectivity, and supports diverse network policies for pods and nodes with Azure Network Policies and Calico, optimizing network efficiency and security
Potential Benefits:
Dynamic IP allocation, scalable subnets, direct VNET access
Enable the cluster auto-scaler on an existing cluster
Impact:HighCategory:Scalability
APRL GUID:902c82ff-4910-4b61-942d-0d6ef7f39b67
Description:
The cluster auto-scaler in AKS adjusts node counts based on pod resource needs and available capacity, enabling scaling as per demand to prevent outages.
AKS, popular for stateful apps needing backups, can now use Azure Backup to secure clusters and attached volumes through an installed Backup Extension, enabling backup and restore operations via a Backup Vault.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//FindAKSclustersthatdonothavebackupenabledresources|wheretype=~'Microsoft.ContainerService/managedClusters'|extendlname=tolower(name)|joinkind=leftouter(recoveryservicesresources|wheretype=~'microsoft.dataprotection/backupvaults/backupinstances'|extendlname=tolower(tostring(split(properties.dataSourceInfo.resourceID,'/')[8]))|extendprotectionState=properties.currentProtectionState|projectlname,protectionState)onlname|whereprotectionState!='ProtectionConfigured'|extendparam1=iif(isnull(protectionState),'Protection Not Configured',strcat('Protection State: ',protectionState))|projectrecommendationId="269a9f1a-6675-460a-831e-b05a887a8c4b",name,id,tags,param1
Use zone-redundant storage for persistent volumes when running multi-zone AKS
Impact:MediumCategory:High Availability
APRL GUID:d3111036-355d-431b-ab49-8ddad042800b
Description:
ZRS ensures data replication across three zones, protecting against zonal outages. It's available for Azure Disks, Container Storage, Files, and Blob by setting the SKU to ZRS in storage classes, enhancing multi-zone AKS clusters from v1.29.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Upgrade Persistent Volumes using in-tree drivers to Azure CSI drivers
Impact:HighCategory:Governance
APRL GUID:b002c030-72e6-4a37-8217-1cb276c43169
Description:
From Kubernetes 1.26, Azure Disk and Azure File in-tree drivers are deprecated in favor of CSI drivers. Existing deployments remain operational but untested; users should switch to CSI drivers for new features and SKUs.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Update AKS tier to Standard or Premium
Impact:HighCategory:High Availability
APRL GUID:0611251f-e70f-4243-8ddd-cfe894bec2e7
Description:
Production AKS clusters require the Standard or Premium tier for a financially backed SLA and enhanced node scalability, as the free service lacks these features. Use the Premium tier for mission-critical workloads.
Azure Monitor enables real-time health and performance insights for AKS by collecting events, capturing container logs, and gathering CPU/Memory data from the Metrics API. It allows data visualization using Azure Monitor Container Insights, Prometheus, Grafana, or others.
Ephemeral OS disks on AKS offer lower read/write latency due to local attachment, eliminating the need for replication seen with managed disks. This enhances performance and speeds up cluster operations such as scaling or upgrading due to quicker re-imaging and boot times.
Enable and remediate Azure Policies configured for AKS
Impact:LowCategory:Governance
APRL GUID:26ebaf1f-c70d-4ebd-8641-4b60a0ce0094
Description:
Azure Policies in AKS clusters help enforce governance best practices concerning security, authentication, provisioning, networking, and more, ensuring a robust and secure environment for operations.
Use pod topology spread constraints to ensure that pods are spread across different nodes or zones
Impact:HighCategory:High Availability
APRL GUID:928fcc6f-5e9a-42d9-9bd4-260af42de2e5
Description:
Enhance availability and reliability by using pod topology spread constraints to control pod distribution based on node or zone topology, ensuring pods are spread across your cluster.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Configures Pods Liveness, Readiness, and Startup Probes
Impact:HighCategory:High Availability
APRL GUID:cd6791b1-c60e-4b37-ac98-9897b1e6f4b8
Description:
AKS kubelet controller uses liveness probes to validate containers and applications health, ensuring the system knows when to restart a container based on its health status.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Use deployments with multiple replicas in production applications to guarantee availability
Impact:HighCategory:High Availability
APRL GUID:bcfe71f1-ebed-49e5-a84a-193b81ad5d27
Description:
Configuring multiple replicas in Pod or Deployment manifests stabilizes the number of replica Pods, ensuring that a specified number of identical Pods are always available, thereby guaranteeing their availability.
Configuring the user node pool with at least two nodes is essential for applications needing high availability, ensuring they remain operational and accessible without interruption.
A Pod Disruption Budget is a Kubernetes resource configuring the minimum number or percentage of pods that should remain available during disruptions like maintenance or scaling, ensuring a minimum number of pods are always available in the cluster.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Nodepool subnet size needs to accommodate maximum auto-scale settings
Impact:HighCategory:High Availability
APRL GUID:e620fa98-7a40-41a0-bfc9-b4407297fb58
Description:
Nodepool subnets sized for max auto-scale settings enable AKS to efficiently scale out nodes, meeting increased demand while reducing resource constraints and potential service disruptions.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ReturnseachAKSclusterwithnodepoolsthathaveusernodepoolswithasubnetmaskthatdoesnotmatchautoscaleconfiguredmax-nodes//Subtractingthenetworkaddress,broadcastaddress,anddefault3addressesAzurereserveswithineachsubnetresources|wheretype=="microsoft.containerservice/managedclusters"|extendnodePools=properties['agentPoolProfiles']|mv-expandnodePools=properties.agentPoolProfiles|wherenodePools.enableAutoScaling==true|extendnodePoolName=nodePools.name,maxNodes=nodePools.maxCount,subnetId=tostring(nodePools.vnetSubnetID)|projectclusterId=id,clusterName=name,nodePoolName=nodePools.name,toint(maxNodes),subnetId|joinkind=leftouter(resources|wheretype=='microsoft.network/virtualnetworks'|extendsubnets=properties.subnets|mv-expandsubnets|projectid=tostring(subnets.id),addressPrefix=tostring(subnets.properties['addressPrefix'])|extendsubnetmask=toint(substring(addressPrefix,indexof(addressPrefix,'/')+1,string_size(addressPrefix)))|extendpossibleMaxNodeCount=toint(exp2(32-subnetmask)-5))on$left.subnetId==$right.id|project-awayid,subnetmask|wherepossibleMaxNodeCount<=maxNodes|extendparam1=strcat(nodePoolName," autoscaler upper limit: ",maxNodes)|extendparam2=strcat("ip addresses on subnet: ",possibleMaxNodeCount)|projectrecommendationId="e620fa98-7a40-41a0-bfc9-b4407297fb58",name=clusterName,id=clusterId,param1,param2
Subscription core quota should be increased if Node pool auto-scale settings exceed the quota
Impact:HighCategory:High Availability
APRL GUID:a01afc4c-7439-4919-b2da-3565992ea2a7
Description:
Node pool settings should not exceed the subscription core quota to ensure AKS can scale out nodes efficiently, meeting increased demand while reducing resource constraints and potential service disruptions.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Use Azure Linux for Linux nodepools
Impact:HighCategory:High Availability
APRL GUID:f46b0d1d-56ef-4795-b98a-f6ee00cb341a
Description:
Azure Linux on AKS boosts resiliency with a native image using validated, source-built components. It's lightweight, reducing the attack surface and maintenance. A Microsoft-hardened kernel, optimized for Azure, enhances stability and security for container workloads.