AKS
The presented resiliency recommendations in this guidance include Aks and associated settings.
Summary of Recommendations
Recommendations Details
AKS-1 - Deploy AKS cluster across availability zones
Category: Availability
Impact: High
Guidance
Azure Availability Zones are a high-availability offering that protects applications and data from datacenter-level failures. Availability Zones are unique physical locations within an Azure region that are equipped with independent power, cooling, and networking. Each Availability Zone is made up of one or more datacenters and is designed to be highly available and fault tolerant.
By deploying resources such as aks clusters, virtual machines, storage, and databases across multiple Availability Zones in the same region, you can protect your applications and data from datacenter-level failures. If one Availability Zone goes down, the other Availability Zones in the region can continue to provide service.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns AKS clusters that do not have any availability zones enabled
resources
| where type == 'microsoft.containerservice/managedclusters'
| project id, name, location, properties.agentPoolProfiles
| mv-expand properties_agentPoolProfiles
| where isempty(array_length(properties_agentPoolProfiles.availabilityZones))
| project recommendationId="aks-1", id, name, tags, param1=strcat("nodePoolName: ", properties_agentPoolProfiles.name), param2=strcat("orchestratorVersion: ", properties_agentPoolProfiles.orchestratorVersion), param3=strcat("currentOrchestratorVersion: ", properties_agentPoolProfiles.currentOrchestratorVersion), param4=strcat("numberOfZones: ", iff(isempty(array_length(properties_agentPoolProfiles.availabilityZones)), 0, array_length(properties_agentPoolProfiles.availabilityZones)))
AKS-2 - Isolate system and application pods
Category: Governance
Impact: High
Guidance
AKS automatically assigns the label kubernetes.azure.com/mode: system to nodes in a system node pool. This label signals to AKS that system pods should be scheduled on nodes in this pool. However, you can still schedule application pods on these nodes if you choose to do so.
To prevent misconfigured or rogue application pods from accidentally killing system pods, it is recommended that you isolate critical system pods from your application pods. This can be achieved by scheduling system pods on dedicated node pools or by using node selectors to ensure that system pods are only scheduled on nodes with the kubernetes.azure.com/mode: system label.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that do not have taints set
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taint = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where isempty(taint)
| project recommendationid="aks-2", id, name, tags, param1=strcat("nodepoolName: ", nodePool), param2=strcat("taint: ", iff(isempty(taint), "None", taint))
AKS-3 - Disable local accounts
Category: Access & Security
Impact: High
Guidance
Local Kubernetes accounts provide a legacy non-auditable means of accessing an AKS cluster and are not recommended for use. Enabling Microsoft Entra integration on an AKS cluster provides several benefits for managing access to the cluster. By using Microsoft Entra, you can centralize user and group management, enforce multi-factor authentication, and enable role-based access control (RBAC) for fine-grained access control to cluster resources. Additionally, Microsoft Entra provides a secure and scalable authentication mechanism that can be integrated with other Azure services and third-party identity providers.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns a list of AKS clusters not using AAD enabled
resources
| where type == "microsoft.containerservice/managedclusters"
| extend aadProfile = tostring (parse_json(properties.aadProfile))
| extend disablelocalAdmin = tostring(parse_json(properties.disableLocalAccounts))
| extend RBAC = tostring(parse_json(properties.enableRBAC))
| where RBAC == "false"
| project recommendationId="aks-3", name, id, tags, param1=strcat("aadProfile: ", aadProfile), param2=strcat("disablelocalAdmin: ",disablelocalAdmin), param3=strcat("RBAC: ", RBAC)
AKS-4 - Configure Azure CNI networking for dynamic allocation of IPs
Category: Networking
Impact: Medium
Guidance
The Azure CNI networking solution provides several benefits for managing IP addresses and network connectivity for cluster pods including dynamic allocation of IPs to pods, allowing node and pod subnets to scale independently, direct network connectivity between pods and resources in the VNET and allowing different network policies for pods and nodes. It also supports different networking policies including Azure Network Policies and Calico.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Check AKS Clusters using kubenet network profile
resources
| where type == "microsoft.containerservice/managedclusters"
| extend networkProfile = tostring (parse_json(properties.networkProfile.networkPlugin))
| where networkProfile =="kubenet"
| project recommendationId="aks-4", name, id, tags, param1=strcat("networkProfile :",networkProfile)
AKS-5 - Enable the cluster auto-scaler on an existing cluster
Category: System Efficiency
Impact: High
Guidance
The cluster auto-scaler automatically scales the number of nodes in a node pool based on pod resource requests and the available capacity in the cluster. It helps ensure that the cluster can scale according to demand and prevent outages.
If the cluster has availability zones enabled, the following configuration changes need to be verified or established:
- Persistent Volumes - If the cluster is using persistent volumes backed by Azure Storage, ensure you have one nodepool per availability zone. Persistent volumes do not work across AZs and the auto-scaler could fail to create new pods if the nodepool cannot access the persistent volume.
- Multiple Nodepools per Zone - If the cluster has multiple nodepools per AZ, enable the
--balance-similar-node-groups
property through the auto-scaler profile. This feature detects similar nodepools and balances the number of nodes across them.
Resources
- Use the Cluster Autoscaler on AKS
- Best practices for advanced scheduler features
- Node pool scaling considerations and best practices
- Best practices for basic scheduler features
Resource Graph Query
// Azure Resource Graph Query
// Find AKS clusters with auto-scaling disabled
Resources
| where type == "microsoft.containerservice/managedclusters"
| extend autoScaling = tostring (parse_json(properties.agentPoolProfiles.[0].enableAutoScaling))
| where autoScaling == "false"
| project recommendationId="aks-5", name, id, tags, param1=strcat("autoScaling :", autoScaling)
AKS-6 - Back up Azure Kubernetes Service
Category: Disaster Recovery
Impact: Low
Guidance
AKS is increasingly being used for stateful applications that require a backup strategy. Azure Backup now allows you to back up AKS clusters (cluster resources and persistent volumes attached to the cluster) using a backup extension, which must be installed in the cluster. Backup vault communicates with the cluster via this Backup Extension to perform backup and restore operations."
Resources
Resource Graph Query
// Azure Resource Graph Query
// Find AKS clusters that do not have backup enabled
resources
| where type =~ 'Microsoft.ContainerService/managedClusters'
| extend lname = tolower(name)
| join kind=leftouter(recoveryservicesresources
| where type =~ 'microsoft.dataprotection/backupvaults/backupinstances'
| extend lname = tolower(tostring(split(properties.dataSourceInfo.resourceID, '/')[8]))
| extend protectionState = properties.currentProtectionState
| project lname, protectionState) on lname
| where protectionState != 'ProtectionConfigured'
| extend param1 = iif(isnull(protectionState), 'Protection Not Configured', strcat('Protection State: ', protectionState))
| project recommendationID = "aks-6", name, id, tags, param1
AKS-7 - Plan an AKS version upgrade
Category: Compliance
Impact: High
Guidance
Minor version releases include new features and improvements. Patch releases are more frequent (sometimes weekly) and are intended for critical bug fixes within a minor version. Patch releases include fixes for security vulnerabilities or major bugs. If you’re running an unsupported Kubernetes version, you’ll be asked to upgrade when requesting support for the cluster. Clusters running unsupported Kubernetes releases aren’t covered by the AKS support policies.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-8 - Use zone-redundant storage for persistent volumes when running multi-zone AKS
Category: Availability
Impact: Low
Guidance
For applications that need replication of data across availability zones to protect against zonal outages, customers should leverage zone-redundant storage (ZRS) with multi-zone AKS clusters. ZRS replicates data synchronously across three Azure availability zones in the primary region.
- Azure Disks: Use ZRS disks by setting the disk SKU to StandardSSD_ZRS or Premium_ZRS in a storage class. Also, starting from AKS v1.29 onward, multi-zone AKS clusters will have default storage classes that use ZRS disks.
- Azure Container Storage: Customers can leverage ZRS disks in Azure Container Storage by creating a storage pool and specifying StandardSSD_ZRS or Premium_ZRS as the SKU. Customers can also create a multi-zone storage pool where the total storage capacity will be distributed across zones.
- Azure Files: Use ZRS files by setting the SKU to Standard_ZRS or Premium_ZRS in a storage class.
- Azure Blob: Use ZRS blob by setting the SKU to Standard_ZRS or Premium_ZRS in a storage class.
Resources
- Availability zones overview
- Zone-redundant storage
- ZRS disks
- Convert a disk from LRS to ZRS
- Enable multi-zone storage redundancy in Azure Container Storage
- ZRS files
- Change the redundancy configuration for a storage account
Resource Graph Query
// cannot-be-validated-with-arg
AKS-9 - Upgrade Persistent Volumes using in-tree drivers to Azure CSI drivers
Category: Storage
Impact: High
Guidance
From Kubernetes version 1.26 onward, Azure Disk and Azure File in-tree drivers are no longer supported (persistent volume types with the provisioners kubernetes.io/azure-disk and kubernetes.io/azure-file), due to the deprecation of in-tree storage drivers by the Kubernetes Community. Azure Storage is now provided by the Azure Disk and File CSI drivers. While existing deployments using the in-tree drivers are not expected to break, these are no longer tested and customers should update them to use the CSI drivers. Also, to leverage new storage capabilities (new SKUs, features, etc.), customers should be using the CSI drivers.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-10 - Implement Resource Quota to ensure that Kubernetes resources do not exceed hard resource limits
Category: System Efficiency
Impact: Low
Guidance
A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per namespace. It can limit the quantity of objects that can be created in a namespace by type, as well as the total amount of compute resources that may be consumed by resources in that namespace.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-11 - Attach Virtual Nodes (ACI) to the AKS cluster
Category: System Efficiency
Impact: Low
Guidance
To rapidly scale application workloads in an AKS cluster, you can use virtual nodes. With virtual nodes, pods provision much faster than through the Kubernetes cluster auto-scaler.
If the cluster has availability zones enabled, the following configuration changes need to be verified or established:
- Persistent Volumes - If the cluster is using persistent volumes backed by Azure Storage, ensure you have one nodepool per availability zone. Persistent volumes do not work across AZs and the auto-scaler could fail to create new pods if the nodepool cannot access the persistent volume.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-12 - Update AKS tier to Standard
Category: Availability
Impact: High
Guidance
Production AKS clusters should be configured with the Standard tier. The AKS free service doesn’t offer a financially backed SLA and node scalability is limited. To obtain that SLA, Standard tier must be selected.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns all AKS clusters not running on the Standard tier
resources
| where type == "microsoft.containerservice/managedclusters"
| where sku.tier != "Standard"
| project recommendationId="aks-12", id, name, tags, param1=strcat("skuName: ", sku.name), param2=strcat("skuTier: ", sku.tier)
AKS-13 - Enable AKS Monitoring
Category: Monitoring
Impact: High
Guidance
Azure Monitor collects events, captures container logs, collects CPU/Memory information from the Metrics API and allows the visualization of the data, to validate the near real time health and performance of AKS environments. The visualization tool can be Azure Monitor Container Insights, Prometheus, Grafana or others.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns AKS clusters where either Azure Monitor is not enabled and/or Container Insights is not enabled
resources
| where type == "microsoft.containerservice/managedclusters"
| extend azureMonitor = tostring(parse_json(properties.azureMonitorProfile.metrics.enabled))
| extend insights = tostring(parse_json(properties.addonProfiles.omsagent.enabled))
| where isempty(azureMonitor) or isempty(insights)
| project recommendationId="aks-13",id, name, tags, param1=strcat("azureMonitorProfileEnabled: ", iff(isempty(azureMonitor), "false", azureMonitor)), param2=strcat("containerInsightsEnabled: ", iff(isempty(insights), "false", insights))
AKS-14 - Use Ephemeral OS disks on AKS clusters
Category: System Efficiency
Impact: Medium
Guidance
Ephemeral disks are ideal as OS disks for stateless applications since they provide better performance and improved reliability by decreasing IO incidents. Additionally, customers won’t incur additional storage costs for the OS, and they can get faster cluster operations like scale or upgrade thanks to faster re-imaging and boot times. AKS will default to using an ephemeral disk as the OS disk if it’s available for the VM SKU selected for node pools if customers don’t explicitly request an Azure managed disk for the OS.
Resources
- Ephemeral OS disk
- Configure an AKS cluster
- Everything you want to know about ephemeral OS disks and AKS
Resource Graph Query
// cannot-be-validated-with-arg
AKS-15 - Enable and remediate Azure Policies configured for AKS
Category: Governance
Impact: Low
Guidance Azure Policies allow companies to enforce governance best practices in the AKS cluster around security, authentication, provisioning, networking and others.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-16 - Enable GitOps when using DevOps frameworks
Category: Automation
Impact: Low
Guidance
GitOps is an operating model for cloud-native applications that stores application and declarative infrastructure code in Git to be used as the source of truth for automated continuous delivery. With GitOps, you describe the desired state of your entire system in a git repository, and a GitOps operator deploys it to your environment, which is often a Kubernetes cluster. To prevent potential outages or unsuccessful failover scenarios, GitOps helps maintain the configuration of all AKS clusters to the intended configuration.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns AKS clusters where GitOps is not enabled
resources
| where type == "microsoft.containerservice/managedclusters"
| extend gitops = tostring (parse_json(properties.addOnProfiles.gitops.enabled))
| where isempty(gitops)
| project recommendationId="aks-16", id, name, tags, param1=strcat("gitopsEnabled: ", "false")
AKS-17 - Configure affinity or anti-affinity rules based on application requirements
Category: Availability
Impact: High
Guidance
Configure Topology Spread Constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-18 - Configures Pods Liveness, Readiness, and Startup Probes
Category: Availability
Impact: High
Guidance
AKS kubelet controller uses liveness probes to validate containers and applications health. Based on containers health, kubelet will know when to restart a container.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-19 - Configure pod replica sets in production applications to guarantee availability
Category: Availability
Impact: High
Guidance
Configure ReplicaSets in the Pod or Deployment manifests to maintain a stable set of replica Pods running at any given time. This feature will guarantee the availability of a specified number of identical Pods.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-20 - Configure system nodepool count
Category: Availability
Impact: High
Guidance
The system node pool should be configured with a minimum node count of two to ensure critical system pods are resilient to node outages.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have system nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints has "CriticalAddonsOnly=true:NoSchedule" and agentPoolProfile.minCount < 2
| project recommendationid="aks-20", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)
AKS-21 - Configure user nodepool count
Category: Availability
Impact: High
Guidance
The user node pool should be configured with a minimum node count of two if the application requires high availability.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints !has "CriticalAddonsOnly=true:NoSchedule" and agentPoolProfile.minCount < 2
| project recommendationid="aks-21", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)
AKS-22 - Configure pod disruption budgets (PDBs)
Category: Availability
Impact: Medium
Guidance
A Pod Disruption Budget (PDB) is a Kubernetes resource that allows you to configure the minimum number or percentage of pods that should remain available during voluntary disruptions, such as maintenance or scaling events. To maintain the availability of applications, define Pod Disruption Budgets (PDBs) to make sure that a minimum number of pods are available in the cluster.
Resources
Resource Graph Query
// cannot-be-validated-with-arg
AKS-23 - Nodepool subnet size needs to accommodate maximum auto-scale settings
Category: Availability
Impact: High
Guidance
Nodepool subnets should be sized to accommodate maximum auto-scale settings. By properly sizing the subnet, AKS can efficiently scale out nodes to meet increased demand, reducing the risk of resource constraints and potential service disruptions.
Resources
Resource Graph Query
// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with a subnetmask that does not match autoscale configured max-nodes
// Subtracting the network address, broadcast address, and default 3 addresses Azure reserves within each subnet
resources
| where type == "microsoft.containerservice/managedclusters"
| extend nodePools = properties['agentPoolProfiles']
| mv-expand nodePools = properties.agentPoolProfiles
| where nodePools.enableAutoScaling == true
| extend nodePoolName=nodePools.name, maxNodes = nodePools.maxCount, subnetId = tostring(nodePools.vnetSubnetID)
| project clusterId = id, clusterName=name, nodePoolName=nodePools.name, toint(maxNodes), subnetId
| join kind = leftouter (
resources
| where type == 'microsoft.network/virtualnetworks'
| extend subnets = properties.subnets
| mv-expand subnets
| project id = tostring(subnets.id), addressPrefix = tostring(subnets.properties['addressPrefix'])
| extend subnetmask = toint(substring(addressPrefix, indexof(addressPrefix, '/')+1, string_size(addressPrefix)))
| extend possibleMaxNodeCount = toint(exp2(32-subnetmask) - 5)
) on $left.subnetId == $right.id
| project-away id, subnetmask
| where possibleMaxNodeCount <= maxNodes
| extend param1 = strcat(nodePoolName, " autoscaler upper limit: ", maxNodes)
| extend param2 = strcat("ip addresses on subnet: ", possibleMaxNodeCount)
| project recommendationId="aks-23", name=clusterName, id=clusterId, param1, param2
AKS-24 - Enforce resource quotas at the namespace level
Category: Availability
Impact: High
Guidance
Enforcing namespace-level resource quotas is crucial for ensuring reliability by preventing resource exhaustion and maintaining cluster stability. This helps prevent individual applications or users from monopolizing resources, which can lead to degraded performance or outages for other applications in the cluster.
Resources
Resource Graph Query
// cannot-be-validated-with-arg