Azure Proactive Resiliency Library v2
Tools Glossary GitHub GitHub Issues Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

managedClusters

Summary

RecommendationImpactCategoryAutomation AvailableIn Azure Advisor
Deploy AKS cluster across availability zonesHighHigh AvailabilityYesNo
Isolate system and application podsHighHigh AvailabilityYesNo
Configure Azure CNI networking for dynamic allocation of IPs or use CNI overlayMediumScalabilityYesNo
Enable the cluster auto-scaler on an existing clusterHighScalabilityYesNo
Back up Azure Kubernetes ServiceLowDisaster RecoveryYesNo
Use zone-redundant storage for persistent volumes when running multi-zone AKSMediumHigh AvailabilityNoNo
Upgrade Persistent Volumes using in-tree drivers to Azure CSI driversHighGovernanceNoNo
Update AKS tier to Standard or PremiumHighHigh AvailabilityYesNo
Enable AKS MonitoringHighMonitoring and AlertingYesNo
Use Ephemeral OS disks on AKS clustersMediumScalabilityYesNo
Enable and remediate Azure Policies configured for AKSLowGovernanceYesNo
Use pod topology spread constraints to ensure that pods are spread across different nodes or zonesHighHigh AvailabilityNoNo
Configures Pods Liveness, Readiness, and Startup ProbesHighHigh AvailabilityNoNo
Use deployments with multiple replicas in production applications to guarantee availabilityHighHigh AvailabilityNoNo
Configure system nodepool countHighHigh AvailabilityYesNo
Configure user nodepool countHighHigh AvailabilityYesNo
Configure pod disruption budgets (PDBs)MediumHigh AvailabilityNoNo
Nodepool subnet size needs to accommodate maximum auto-scale settingsHighHigh AvailabilityYesNo
Subscription core quota should be increased if Node pool auto-scale settings exceed the quotaHighHigh AvailabilityNoNo
Use Azure Linux for Linux nodepoolsHighHigh AvailabilityYesNo

Details


Deploy AKS cluster across availability zones

Impact:  High Category:  High Availability

APRL GUID:  4f63619f-5001-439c-bacb-8de891287727

Description:

Azure Availability Zones ensure high availability by offering independent locations within regions, equipped with their own power, cooling, and networking to ensure applications and data are protected from datacenter-level failures.

Potential Benefits:

Enhanced fault tolerance for AKS
Learn More:
AKS Availability Zones
Zone Balancing

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns AKS clusters that do not have any availability zones enabled or only use a single zone
resources
| where type =~ "Microsoft.ContainerService/managedClusters"
| where location in~ ("australiaeast", "brazilsouth", "canadacentral", "centralindia", "centralus", "eastasia", "eastus", "eastus2", "francecentral", "germanywestcentral", "israelcentral", "italynorth", "japaneast", "japanwest", "koreacentral", "mexicocentral", "newzealandnorth", "northeurope", "norwayeast", "polandcentral", "qatarcentral", "southafricanorth", "southcentralus", "southeastasia", "spaincentral", "swedencentral", "switzerlandnorth", "uaenorth", "uksouth", "westeurope", "westus2", "westus3", "usgovvirginia", "chinanorth3")
| project id, name, tags, location, pools = properties.agentPoolProfiles
| mv-expand pool = pools
| extend
    numOfAvailabilityZones = iif(isnull(pool.availabilityZones), 0, array_length(pool.availabilityZones))
| where numOfAvailabilityZones < 2
| project
    recommendationId = "4f63619f-5001-439c-bacb-8de891287727",
    id,
    name,
    tags,
    param1 = strcat("NodePoolName: ", pool.name),
    param2 = strcat("Mode: ", pool.mode),
    param3 = strcat("AvailabilityZones: ", iif(numOfAvailabilityZones == 0, "None", strcat("Zone ", strcat_array(pool.availabilityZones, ", ")))),
    param4 = strcat("Location: ", location)


Isolate system and application pods

Impact:  High Category:  High Availability

APRL GUID:  5ee083cd-6ac3-4a83-8913-9549dd36cf56

Description:

AKS assigns the kubernetes.azure.com/mode: system label to nodes in system node pools signaling the preference for system pods should be scheduled there. The CriticalAddonsOnly=true:NoSchedule taint can be added to your system nodes to prohibit application pods from being scheduled on them.

Potential Benefits:

Enhanced reliability via pod isolation
Learn More:
System and user node pools

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that do not have system pods labelled with CriticalAddonsOnly
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| where agentPoolProfile.mode =~ 'System' // system node pools
| extend taint = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend hasCriticalAddonsTaint = agentPoolProfile.kubeletConfig has 'CriticalAddonsOnly'
| extend hasNodeLabel = agentPoolProfile.customNodeLabels has 'CriticalAddonsOnly'
| extend hasCriticalAddonsOnly = hasCriticalAddonsTaint or hasNodeLabel or isempty(taint)
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where hasCriticalAddonsOnly
| project
    recommendationId="5ee083cd-6ac3-4a83-8913-9549dd36cf56",
    id,
    name,
    tags,
    param1=strcat("nodepoolName: ", nodePool)


Configure Azure CNI networking for dynamic allocation of IPs or use CNI overlay

Impact:  Medium Category:  Scalability

APRL GUID:  c22db132-399b-4e7c-995d-577a60881be8

Description:

Azure CNI enhances cluster IP and network management, allowing dynamic IP allocation, scalable subnets, direct pod-VNET connectivity, and supports diverse network policies for pods and nodes with Azure Network Policies and Calico, optimizing network efficiency and security

Potential Benefits:

Dynamic IP allocation, scalable subnets, direct VNET access
Learn More:
Configure Azure CNI networking
Configure Azure CNI Overlay networking

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Check AKS Clusters using kubenet network profile
resources
| where type == "microsoft.containerservice/managedclusters"
| extend networkProfile = tostring (parse_json(properties.networkProfile.networkPlugin))
| where networkProfile =="kubenet"
| project recommendationId="c22db132-399b-4e7c-995d-577a60881be8", name, id, tags, param1=strcat("networkProfile :",networkProfile)



Enable the cluster auto-scaler on an existing cluster

Impact:  High Category:  Scalability

APRL GUID:  902c82ff-4910-4b61-942d-0d6ef7f39b67

Description:

The cluster auto-scaler in AKS adjusts node counts based on pod resource needs and available capacity, enabling scaling as per demand to prevent outages.

Potential Benefits:

Optimizes scaling and prevents outages
Learn More:
Use the Cluster Autoscaler on AKS
Best practices for advanced scheduler features
Node pool scaling considerations and best practices
Best practices for basic scheduler features

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Find AKS clusters with auto-scaling disabled
Resources
| where type == "microsoft.containerservice/managedclusters"
| extend autoScaling = tostring (parse_json(properties.agentPoolProfiles.[0].enableAutoScaling))
| where autoScaling == "false"
| project recommendationId="902c82ff-4910-4b61-942d-0d6ef7f39b67", name, id, tags, param1=strcat("autoScaling :", autoScaling)



Back up Azure Kubernetes Service

Impact:  Low Category:  Disaster Recovery

APRL GUID:  269a9f1a-6675-460a-831e-b05a887a8c4b

Description:

AKS, popular for stateful apps needing backups, can now use Azure Backup to secure clusters and attached volumes through an installed Backup Extension, enabling backup and restore operations via a Backup Vault.

Potential Benefits:

Ensures data safety for AKS
Learn More:
AKS Backups
Best Practices for AKS Backups

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Find AKS clusters that do not have backup enabled

resources
| where type =~ 'Microsoft.ContainerService/managedClusters'
| extend lname = tolower(name)
| join kind=leftouter(recoveryservicesresources
    | where type =~ 'microsoft.dataprotection/backupvaults/backupinstances'
    | extend lname = tolower(tostring(split(properties.dataSourceInfo.resourceID, '/')[8]))
    | extend protectionState = properties.currentProtectionState
    | project lname, protectionState) on lname
| where protectionState != 'ProtectionConfigured'
| extend param1 = iif(isnull(protectionState), 'Protection Not Configured', strcat('Protection State: ', protectionState))
| project recommendationId = "269a9f1a-6675-460a-831e-b05a887a8c4b", name, id, tags, param1



Use zone-redundant storage for persistent volumes when running multi-zone AKS

Impact:  Medium Category:  High Availability

APRL GUID:  d3111036-355d-431b-ab49-8ddad042800b

Description:

ZRS ensures data replication across three zones, protecting against zonal outages. It's available for Azure Disks, Container Storage, Files, and Blob by setting the SKU to ZRS in storage classes, enhancing multi-zone AKS clusters from v1.29.

Potential Benefits:

Increases data durability and availability
Learn More:
Availability zones overview
Zone-redundant storage
ZRS disks
Convert a disk from LRS to ZRS
Enable multi-zone storage redundancy in Azure Container Storage

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Upgrade Persistent Volumes using in-tree drivers to Azure CSI drivers

Impact:  High Category:  Governance

APRL GUID:  b002c030-72e6-4a37-8217-1cb276c43169

Description:

From Kubernetes 1.26, Azure Disk and Azure File in-tree drivers are deprecated in favor of CSI drivers. Existing deployments remain operational but untested; users should switch to CSI drivers for new features and SKUs.

Potential Benefits:

Ensures future compatibility
Learn More:
CSI Storage Drivers
CSI Migrate in Tree Volumes

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Update AKS tier to Standard or Premium

Impact:  High Category:  High Availability

APRL GUID:  0611251f-e70f-4243-8ddd-cfe894bec2e7

Description:

Production AKS clusters require the Standard or Premium tier for a financially backed SLA and enhanced node scalability, as the free service lacks these features. Use the Premium tier for mission-critical workloads.

Potential Benefits:

SLA guarantee and better scalability
Learn More:
Pricing Tiers
AKS Baseline Architecture

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns all AKS clusters not running on the Standard tier or the Premium tier.
resources
| where type =~ "Microsoft.ContainerService/managedClusters"
| where sku.tier !in~ ("Standard", "Premium")
| project recommendationId = "0611251f-e70f-4243-8ddd-cfe894bec2e7", id, name, tags, param1 = strcat("skuName: ", sku.name), param2 = strcat("skuTier: ", sku.tier)


Enable AKS Monitoring

Impact:  High Category:  Monitoring and Alerting

APRL GUID:  dcaf8128-94bd-4d53-9235-3a0371df6b74

Description:

Azure Monitor enables real-time health and performance insights for AKS by collecting events, capturing container logs, and gathering CPU/Memory data from the Metrics API. It allows data visualization using Azure Monitor Container Insights, Prometheus, Grafana, or others.

Potential Benefits:

Real-time AKS health/performance insights
Learn More:
Monitor AKS

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns AKS clusters where either Azure Monitor is not enabled and/or Container Insights is not enabled
resources
|  where type == "microsoft.containerservice/managedclusters"
|  extend azureMonitor = tostring(parse_json(properties.azureMonitorProfile.metrics.enabled))
|  extend insights = tostring(parse_json(properties.addonProfiles.omsagent.enabled))
|  where isempty(azureMonitor) or isempty(insights)
|  project recommendationId="dcaf8128-94bd-4d53-9235-3a0371df6b74",id, name, tags, param1=strcat("azureMonitorProfileEnabled: ", iff(isempty(azureMonitor), "false", azureMonitor)), param2=strcat("containerInsightsEnabled: ", iff(isempty(insights), "false", insights))



Use Ephemeral OS disks on AKS clusters

Impact:  Medium Category:  Scalability

APRL GUID:  a7bfcc18-b0d8-4d37-81f3-8131ed8bead5

Description:

Ephemeral OS disks on AKS offer lower read/write latency due to local attachment, eliminating the need for replication seen with managed disks. This enhances performance and speeds up cluster operations such as scaling or upgrading due to quicker re-imaging and boot times.

Potential Benefits:

Lower latency, faster re-imaging and booting
Learn More:
Ephemeral OS disk
Configure an AKS cluster
Everything you want to know about ephemeral OS disks and AKS

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns any AKS cluster nodepools that do not have Ephemeral Disks
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend type = tostring(agentPoolProfile.osDiskType)
| where type != 'Ephemeral'
| project recommendationId="a7bfcc18-b0d8-4d37-81f3-8131ed8bead5", name, id, param1=strcat("osDiskType: ", type)


Enable and remediate Azure Policies configured for AKS

Impact:  Low Category:  Governance

APRL GUID:  26ebaf1f-c70d-4ebd-8641-4b60a0ce0094

Description:

Azure Policies in AKS clusters help enforce governance best practices concerning security, authentication, provisioning, networking, and more, ensuring a robust and secure environment for operations.

Potential Benefits:

Enhanced AKS governance and security
Learn More:
AKS Baseline - Policy Management
Built-in Policy Definitions for AKS

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns a count of non-compliant policy items per AKS cluster
PolicyResources
| where type =~ 'Microsoft.PolicyInsights/PolicyStates'
| extend complianceState = tostring(properties.complianceState)
| where complianceState == 'NonCompliant'
| where properties.resourceType =~ 'Microsoft.ContainerService/managedClusters'
| extend
    id = tostring(properties.resourceId)
| summarize count() by id
| join kind=inner (
    resources
    | where type =~ 'Microsoft.ContainerService/managedClusters'
    | project id, name
) on id
| project recommendationId="26ebaf1f-c70d-4ebd-8641-4b60a0ce0094", id, name, param1=strcat("numNonCompliantAlerts: ", count_)


Use pod topology spread constraints to ensure that pods are spread across different nodes or zones

Impact:  High Category:  High Availability

APRL GUID:  928fcc6f-5e9a-42d9-9bd4-260af42de2e5

Description:

Enhance availability and reliability by using pod topology spread constraints to control pod distribution based on node or zone topology, ensuring pods are spread across your cluster.

Potential Benefits:

Ensures high availability and efficient use
Learn More:
Topology Spread Constraints
Assign Pod Node

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Configures Pods Liveness, Readiness, and Startup Probes

Impact:  High Category:  High Availability

APRL GUID:  cd6791b1-c60e-4b37-ac98-9897b1e6f4b8

Description:

AKS kubelet controller uses liveness probes to validate containers and applications health, ensuring the system knows when to restart a container based on its health status.

Potential Benefits:

Enhances container health monitoring
Learn More:
Configure probes
Assign Pod Node

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Use deployments with multiple replicas in production applications to guarantee availability

Impact:  High Category:  High Availability

APRL GUID:  bcfe71f1-ebed-49e5-a84a-193b81ad5d27

Description:

Configuring multiple replicas in Pod or Deployment manifests stabilizes the number of replica Pods, ensuring that a specified number of identical Pods are always available, thereby guaranteeing their availability.

Potential Benefits:

Ensures stable pod availability
Learn More:
Replica Sets

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Configure system nodepool count

Impact:  High Category:  High Availability

APRL GUID:  7f7ae535-a5ba-4665-b7e0-c451dbdda01f

Description:

The system node pool should be configured with a minimum node count of two to ensure critical system pods are resilient to node outages.

Potential Benefits:

Ensures pod resilience
Learn More:
System nodepools

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have system nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints has "CriticalAddonsOnly=true:NoSchedule" and  agentPoolProfile.minCount < 2
| project recommendationId="7f7ae535-a5ba-4665-b7e0-c451dbdda01f", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)



Configure user nodepool count

Impact:  High Category:  High Availability

APRL GUID:  005ccbbd-aeab-46ef-80bd-9bd4479412ec

Description:

Configuring the user node pool with at least two nodes is essential for applications needing high availability, ensuring they remain operational and accessible without interruption.

Potential Benefits:

Ensures high app availability
Learn More:
Azure Well-Architected Framework review for Azure Kubernetes Service (AKS)

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints !has "CriticalAddonsOnly=true:NoSchedule" and  agentPoolProfile.minCount < 2
| project recommendationId="005ccbbd-aeab-46ef-80bd-9bd4479412ec", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)



Configure pod disruption budgets (PDBs)

Impact:  Medium Category:  High Availability

APRL GUID:  a08a06a0-e41a-4b99-83bb-69ce8bca54cb

Description:

A Pod Disruption Budget is a Kubernetes resource configuring the minimum number or percentage of pods that should remain available during disruptions like maintenance or scaling, ensuring a minimum number of pods are always available in the cluster.

Potential Benefits:

Ensures cluster resiliency during disruptions
Learn More:
Configure PDBs
Plan availability using PDBs

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg



Nodepool subnet size needs to accommodate maximum auto-scale settings

Impact:  High Category:  High Availability

APRL GUID:  e620fa98-7a40-41a0-bfc9-b4407297fb58

Description:

Nodepool subnets sized for max auto-scale settings enable AKS to efficiently scale out nodes, meeting increased demand while reducing resource constraints and potential service disruptions.

Potential Benefits:

Efficient scaling, reduced disruptions
Learn More:
Azure CNI Dynamic IP Allocation

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with a subnetmask that does not match autoscale configured max-nodes
// Subtracting the network address, broadcast address, and default 3 addresses Azure reserves within each subnet

resources
| where type == "microsoft.containerservice/managedclusters"
| extend nodePools = properties['agentPoolProfiles']
| mv-expand nodePools = properties.agentPoolProfiles
| where nodePools.enableAutoScaling == true
| extend nodePoolName=nodePools.name, maxNodes = nodePools.maxCount, subnetId = tostring(nodePools.vnetSubnetID)
| project clusterId = id, clusterName=name, nodePoolName=nodePools.name, toint(maxNodes), subnetId
| join kind = leftouter (
    resources
    | where type == 'microsoft.network/virtualnetworks'
    | extend subnets = properties.subnets
    | mv-expand subnets
    | project id = tostring(subnets.id), addressPrefix = tostring(subnets.properties['addressPrefix'])
    | extend subnetmask = toint(substring(addressPrefix, indexof(addressPrefix, '/')+1, string_size(addressPrefix)))
    | extend possibleMaxNodeCount = toint(exp2(32-subnetmask) - 5)
) on $left.subnetId == $right.id
| project-away id, subnetmask
| where possibleMaxNodeCount <= maxNodes
| extend param1 = strcat(nodePoolName, " autoscaler upper limit: ", maxNodes)
| extend param2 = strcat("ip addresses on subnet: ", possibleMaxNodeCount)
| project recommendationId="e620fa98-7a40-41a0-bfc9-b4407297fb58", name=clusterName, id=clusterId, param1, param2



Subscription core quota should be increased if Node pool auto-scale settings exceed the quota

Impact:  High Category:  High Availability

APRL GUID:  a01afc4c-7439-4919-b2da-3565992ea2a7

Description:

Node pool settings should not exceed the subscription core quota to ensure AKS can scale out nodes efficiently, meeting increased demand while reducing resource constraints and potential service disruptions.

Potential Benefits:

Reduced disruptions
Learn More:
Azure Quotas

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg


Use Azure Linux for Linux nodepools

Impact:  High Category:  High Availability

APRL GUID:  f46b0d1d-56ef-4795-b98a-f6ee00cb341a

Description:

Azure Linux on AKS boosts resiliency with a native image using validated, source-built components. It's lightweight, reducing the attack surface and maintenance. A Microsoft-hardened kernel, optimized for Azure, enhances stability and security for container workloads.

Potential Benefits:

Reduced disruptions
Learn More:
Azure Linux

ARG Query:

Click the Azure Resource Graph tab to view the query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have Linux nodepools not using Azure Linux
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| where agentPoolProfile.osType == 'Linux' and agentPoolProfile.osSKU != 'AzureLinux'
| project recommendationid="f46b0d1d-56ef-4795-b98a-f6ee00cb341a", name, id, param1=strcat("nodePoolName: ", agentPoolProfile.name)