AKS


The presented resiliency recommendations in this guidance include Aks and associated settings.

Summary of Recommendations

Recommendations Details

AKS-1 - Deploy AKS cluster across availability zones

Category: Availability

Impact: High

Guidance

Azure Availability Zones are a high-availability offering that protects applications and data from datacenter-level failures. Availability Zones are unique physical locations within an Azure region that are equipped with independent power, cooling, and networking. Each Availability Zone is made up of one or more datacenters and is designed to be highly available and fault tolerant.

By deploying resources such as aks clusters, virtual machines, storage, and databases across multiple Availability Zones in the same region, you can protect your applications and data from datacenter-level failures. If one Availability Zone goes down, the other Availability Zones in the region can continue to provide service.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns AKS clusters that do not have any availability zones enabled
resources
| where type == 'microsoft.containerservice/managedclusters'
| project id, name, location, properties.agentPoolProfiles
| mv-expand properties_agentPoolProfiles
| where isempty(array_length(properties_agentPoolProfiles.availabilityZones))
| project recommendationId="aks-1", id, name, tags, param1=strcat("nodePoolName: ", properties_agentPoolProfiles.name), param2=strcat("orchestratorVersion: ", properties_agentPoolProfiles.orchestratorVersion), param3=strcat("currentOrchestratorVersion: ", properties_agentPoolProfiles.currentOrchestratorVersion), param4=strcat("numberOfZones: ", iff(isempty(array_length(properties_agentPoolProfiles.availabilityZones)), 0, array_length(properties_agentPoolProfiles.availabilityZones)))



AKS-2 - Isolate system and application pods

Category: Governance

Impact: High

Guidance

AKS automatically assigns the label kubernetes.azure.com/mode: system to nodes in a system node pool. This label signals to AKS that system pods should be scheduled on nodes in this pool. However, you can still schedule application pods on these nodes if you choose to do so.

To prevent misconfigured or rogue application pods from accidentally killing system pods, it is recommended that you isolate critical system pods from your application pods. This can be achieved by scheduling system pods on dedicated node pools or by using node selectors to ensure that system pods are only scheduled on nodes with the kubernetes.azure.com/mode: system label.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that do not have taints set
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taint = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where isempty(taint)
| project recommendationid="aks-2", id, name, tags, param1=strcat("nodepoolName: ", nodePool), param2=strcat("taint: ", iff(isempty(taint), "None", taint))



AKS-3 - Disable local accounts

Category: Access & Security

Impact: High

Guidance

Local Kubernetes accounts provide a legacy non-auditable means of accessing an AKS cluster and are not recommended for use. Enabling Microsoft Entra integration on an AKS cluster provides several benefits for managing access to the cluster. By using Microsoft Entra, you can centralize user and group management, enforce multi-factor authentication, and enable role-based access control (RBAC) for fine-grained access control to cluster resources. Additionally, Microsoft Entra provides a secure and scalable authentication mechanism that can be integrated with other Azure services and third-party identity providers.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns a list of AKS clusters not using AAD enabled
resources
| where type == "microsoft.containerservice/managedclusters"
| extend aadProfile = tostring (parse_json(properties.aadProfile))
| extend disablelocalAdmin = tostring(parse_json(properties.disableLocalAccounts))
| extend RBAC = tostring(parse_json(properties.enableRBAC))
| where RBAC == "false"
| project recommendationId="aks-3", name, id, tags, param1=strcat("aadProfile: ", aadProfile), param2=strcat("disablelocalAdmin: ",disablelocalAdmin), param3=strcat("RBAC: ", RBAC)



AKS-4 - Configure Azure CNI networking for dynamic allocation of IPs

Category: Networking

Impact: Medium

Guidance

The Azure CNI networking solution provides several benefits for managing IP addresses and network connectivity for cluster pods including dynamic allocation of IPs to pods, allowing node and pod subnets to scale independently, direct network connectivity between pods and resources in the VNET and allowing different network policies for pods and nodes. It also supports different networking policies including Azure Network Policies and Calico.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Check AKS Clusters using kubenet network profile
resources
| where type == "microsoft.containerservice/managedclusters"
| extend networkProfile = tostring (parse_json(properties.networkProfile.networkPlugin))
| where networkProfile =="kubenet"
| project recommendationId="aks-4", name, id, tags, param1=strcat("networkProfile :",networkProfile)



AKS-5 - Enable the cluster auto-scaler on an existing cluster

Category: System Efficiency

Impact: High

Guidance

The cluster auto-scaler automatically scales the number of nodes in a node pool based on pod resource requests and the available capacity in the cluster. It helps ensure that the cluster can scale according to demand and prevent outages.

If the cluster has availability zones enabled, the following configuration changes need to be verified or established:

  • Persistent Volumes - If the cluster is using persistent volumes backed by Azure Storage, ensure you have one nodepool per availability zone. Persistent volumes do not work across AZs and the auto-scaler could fail to create new pods if the nodepool cannot access the persistent volume.
  • Multiple Nodepools per Zone - If the cluster has multiple nodepools per AZ, enable the --balance-similar-node-groups property through the auto-scaler profile. This feature detects similar nodepools and balances the number of nodes across them.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Find AKS clusters with auto-scaling disabled
Resources
| where type == "microsoft.containerservice/managedclusters"
| extend autoScaling = tostring (parse_json(properties.agentPoolProfiles.[0].enableAutoScaling))
| where autoScaling == "false"
| project recommendationId="aks-5", name, id, tags, param1=strcat("autoScaling :", autoScaling)



AKS-6 - Back up Azure Kubernetes Service

Category: Disaster Recovery

Impact: Low

Guidance

AKS is increasingly being used for stateful applications that require a backup strategy. Azure Backup now allows you to back up AKS clusters (cluster resources and persistent volumes attached to the cluster) using a backup extension, which must be installed in the cluster. Backup vault communicates with the cluster via this Backup Extension to perform backup and restore operations."

Resources

Resource Graph Query

// Azure Resource Graph Query
// Find AKS clusters that do not have backup enabled

resources
| where type =~ 'Microsoft.ContainerService/managedClusters'
| extend lname = tolower(name)
| join kind=leftouter(recoveryservicesresources
    | where type =~ 'microsoft.dataprotection/backupvaults/backupinstances'
    | extend lname = tolower(tostring(split(properties.dataSourceInfo.resourceID, '/')[8]))
    | extend protectionState = properties.currentProtectionState
    | project lname, protectionState) on lname
| where protectionState != 'ProtectionConfigured'
| extend param1 = iif(isnull(protectionState), 'Protection Not Configured', strcat('Protection State: ', protectionState))
| project recommendationID = "aks-6", name, id, tags, param1



AKS-7 - Plan an AKS version upgrade

Category: Compliance

Impact: High

Guidance

Minor version releases include new features and improvements. Patch releases are more frequent (sometimes weekly) and are intended for critical bug fixes within a minor version. Patch releases include fixes for security vulnerabilities or major bugs. If you’re running an unsupported Kubernetes version, you’ll be asked to upgrade when requesting support for the cluster. Clusters running unsupported Kubernetes releases aren’t covered by the AKS support policies.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-8 - Use zone-redundant storage for persistent volumes when running multi-zone AKS

Category: Availability

Impact: Low

Guidance

For applications that need replication of data across availability zones to protect against zonal outages, customers should leverage zone-redundant storage (ZRS) with multi-zone AKS clusters. ZRS replicates data synchronously across three Azure availability zones in the primary region.

  • Azure Disks: Use ZRS disks by setting the disk SKU to StandardSSD_ZRS or Premium_ZRS in a storage class. Also, starting from AKS v1.29 onward, multi-zone AKS clusters will have default storage classes that use ZRS disks.
  • Azure Container Storage: Customers can leverage ZRS disks in Azure Container Storage by creating a storage pool and specifying StandardSSD_ZRS or Premium_ZRS as the SKU. Customers can also create a multi-zone storage pool where the total storage capacity will be distributed across zones.
  • Azure Files: Use ZRS files by setting the SKU to Standard_ZRS or Premium_ZRS in a storage class.
  • Azure Blob: Use ZRS blob by setting the SKU to Standard_ZRS or Premium_ZRS in a storage class.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-9 - Upgrade Persistent Volumes using in-tree drivers to Azure CSI drivers

Category: Storage

Impact: High

Guidance

From Kubernetes version 1.26 onward, Azure Disk and Azure File in-tree drivers are no longer supported (persistent volume types with the provisioners kubernetes.io/azure-disk and kubernetes.io/azure-file), due to the deprecation of in-tree storage drivers by the Kubernetes Community. Azure Storage is now provided by the Azure Disk and File CSI drivers. While existing deployments using the in-tree drivers are not expected to break, these are no longer tested and customers should update them to use the CSI drivers. Also, to leverage new storage capabilities (new SKUs, features, etc.), customers should be using the CSI drivers.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-10 - Implement Resource Quota to ensure that Kubernetes resources do not exceed hard resource limits

Category: System Efficiency

Impact: Low

Guidance

A resource quota, defined by a ResourceQuota object, provides constraints that limit aggregate resource consumption per namespace. It can limit the quantity of objects that can be created in a namespace by type, as well as the total amount of compute resources that may be consumed by resources in that namespace.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-11 - Attach Virtual Nodes (ACI) to the AKS cluster

Category: System Efficiency

Impact: Low

Guidance

To rapidly scale application workloads in an AKS cluster, you can use virtual nodes. With virtual nodes, pods provision much faster than through the Kubernetes cluster auto-scaler.

If the cluster has availability zones enabled, the following configuration changes need to be verified or established:

  • Persistent Volumes - If the cluster is using persistent volumes backed by Azure Storage, ensure you have one nodepool per availability zone. Persistent volumes do not work across AZs and the auto-scaler could fail to create new pods if the nodepool cannot access the persistent volume.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-12 - Update AKS tier to Standard

Category: Availability

Impact: High

Guidance

Production AKS clusters should be configured with the Standard tier. The AKS free service doesn’t offer a financially backed SLA and node scalability is limited. To obtain that SLA, Standard tier must be selected.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns all AKS clusters not running on the Standard tier
resources
| where type == "microsoft.containerservice/managedclusters"
| where sku.tier != "Standard"
| project recommendationId="aks-12", id, name, tags, param1=strcat("skuName: ", sku.name), param2=strcat("skuTier: ", sku.tier)



AKS-13 - Enable AKS Monitoring

Category: Monitoring

Impact: High

Guidance

Azure Monitor collects events, captures container logs, collects CPU/Memory information from the Metrics API and allows the visualization of the data, to validate the near real time health and performance of AKS environments. The visualization tool can be Azure Monitor Container Insights, Prometheus, Grafana or others.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns AKS clusters where either Azure Monitor is not enabled and/or Container Insights is not enabled
resources
|  where type == "microsoft.containerservice/managedclusters"
|  extend azureMonitor = tostring(parse_json(properties.azureMonitorProfile.metrics.enabled))
|  extend insights = tostring(parse_json(properties.addonProfiles.omsagent.enabled))
|  where isempty(azureMonitor) or isempty(insights)
|  project recommendationId="aks-13",id, name, tags, param1=strcat("azureMonitorProfileEnabled: ", iff(isempty(azureMonitor), "false", azureMonitor)), param2=strcat("containerInsightsEnabled: ", iff(isempty(insights), "false", insights))



AKS-14 - Use Ephemeral OS disks on AKS clusters

Category: System Efficiency

Impact: Medium

Guidance

Ephemeral disks are ideal as OS disks for stateless applications since they provide better performance and improved reliability by decreasing IO incidents. Additionally, customers won’t incur additional storage costs for the OS, and they can get faster cluster operations like scale or upgrade thanks to faster re-imaging and boot times. AKS will default to using an ephemeral disk as the OS disk if it’s available for the VM SKU selected for node pools if customers don’t explicitly request an Azure managed disk for the OS.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-15 - Enable and remediate Azure Policies configured for AKS

Category: Governance

Impact: Low

Guidance Azure Policies allow companies to enforce governance best practices in the AKS cluster around security, authentication, provisioning, networking and others.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-16 - Enable GitOps when using DevOps frameworks

Category: Automation

Impact: Low

Guidance

GitOps is an operating model for cloud-native applications that stores application and declarative infrastructure code in Git to be used as the source of truth for automated continuous delivery. With GitOps, you describe the desired state of your entire system in a git repository, and a GitOps operator deploys it to your environment, which is often a Kubernetes cluster. To prevent potential outages or unsuccessful failover scenarios, GitOps helps maintain the configuration of all AKS clusters to the intended configuration.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns AKS clusters where GitOps is not enabled
resources
|  where type == "microsoft.containerservice/managedclusters"
|  extend gitops = tostring (parse_json(properties.addOnProfiles.gitops.enabled))
|  where isempty(gitops)
|  project recommendationId="aks-16", id, name, tags, param1=strcat("gitopsEnabled: ", "false")



AKS-17 - Configure affinity or anti-affinity rules based on application requirements

Category: Availability

Impact: High

Guidance

Configure Topology Spread Constraints to control how Pods are spread across your cluster among failure-domains such as regions, zones, nodes, and other user-defined topology domains. This can help to achieve high availability as well as efficient resource utilization.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-18 - Configures Pods Liveness, Readiness, and Startup Probes

Category: Availability

Impact: High

Guidance

AKS kubelet controller uses liveness probes to validate containers and applications health. Based on containers health, kubelet will know when to restart a container.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-19 - Configure pod replica sets in production applications to guarantee availability

Category: Availability

Impact: High

Guidance

Configure ReplicaSets in the Pod or Deployment manifests to maintain a stable set of replica Pods running at any given time. This feature will guarantee the availability of a specified number of identical Pods.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-20 - Configure system nodepool count

Category: Availability

Impact: High

Guidance

The system node pool should be configured with a minimum node count of two to ensure critical system pods are resilient to node outages.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have system nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints has "CriticalAddonsOnly=true:NoSchedule" and  agentPoolProfile.minCount < 2
| project recommendationid="aks-20", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)



AKS-21 - Configure user nodepool count

Category: Availability

Impact: High

Guidance

The user node pool should be configured with a minimum node count of two if the application requires high availability.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with less than 2 nodes
resources
| where type == "microsoft.containerservice/managedclusters"
| mv-expand agentPoolProfile = properties.agentPoolProfiles
| extend taints = tostring(parse_json(agentPoolProfile.nodeTaints))
| extend nodePool = tostring(parse_json(agentPoolProfile.name))
| where taints !has "CriticalAddonsOnly=true:NoSchedule" and  agentPoolProfile.minCount < 2
| project recommendationid="aks-21", id, name, param1=strcat("nodePoolName: ", nodePool), param2=strcat("nodePoolMinNodeCount: ", agentPoolProfile.minCount)



AKS-22 - Configure pod disruption budgets (PDBs)

Category: Availability

Impact: Medium

Guidance

A Pod Disruption Budget (PDB) is a Kubernetes resource that allows you to configure the minimum number or percentage of pods that should remain available during voluntary disruptions, such as maintenance or scaling events. To maintain the availability of applications, define Pod Disruption Budgets (PDBs) to make sure that a minimum number of pods are available in the cluster.

Resources

Resource Graph Query

// cannot-be-validated-with-arg



AKS-23 - Nodepool subnet size needs to accommodate maximum auto-scale settings

Category: Availability

Impact: High

Guidance

Nodepool subnets should be sized to accommodate maximum auto-scale settings. By properly sizing the subnet, AKS can efficiently scale out nodes to meet increased demand, reducing the risk of resource constraints and potential service disruptions.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Returns each AKS cluster with nodepools that have user nodepools with a subnetmask that does not match autoscale configured max-nodes
// Subtracting the network address, broadcast address, and default 3 addresses Azure reserves within each subnet

resources
| where type == "microsoft.containerservice/managedclusters"
| extend nodePools = properties['agentPoolProfiles']
| mv-expand nodePools = properties.agentPoolProfiles
| where nodePools.enableAutoScaling == true
| extend nodePoolName=nodePools.name, maxNodes = nodePools.maxCount, subnetId = tostring(nodePools.vnetSubnetID)
| project clusterId = id, clusterName=name, nodePoolName=nodePools.name, toint(maxNodes), subnetId
| join kind = leftouter (
    resources
    | where type == 'microsoft.network/virtualnetworks'
    | extend subnets = properties.subnets
    | mv-expand subnets
    | project id = tostring(subnets.id), addressPrefix = tostring(subnets.properties['addressPrefix'])
    | extend subnetmask = toint(substring(addressPrefix, indexof(addressPrefix, '/')+1, string_size(addressPrefix)))
    | extend possibleMaxNodeCount = toint(exp2(32-subnetmask) - 5)
) on $left.subnetId == $right.id
| project-away id, subnetmask
| where possibleMaxNodeCount <= maxNodes
| extend param1 = strcat(nodePoolName, " autoscaler upper limit: ", maxNodes)
| extend param2 = strcat("ip addresses on subnet: ", possibleMaxNodeCount)
| project recommendationId="aks-23", name=clusterName, id=clusterId, param1, param2



AKS-24 - Enforce resource quotas at the namespace level

Category: Availability

Impact: High

Guidance

Enforcing namespace-level resource quotas is crucial for ensuring reliability by preventing resource exhaustion and maintaining cluster stability. This helps prevent individual applications or users from monopolizing resources, which can lead to degraded performance or outages for other applications in the cluster.

Resources

Resource Graph Query

// cannot-be-validated-with-arg