Configure Azure Service Health notifications and alerts for Azure VMware Solution
Impact:HighCategory:Monitoring and Alerting
APRL GUID:74fcb9f2-9a25-49a6-8c42-d32851c4afb7
Description:
Ensure Azure Service Health notifications are set for Azure VMware Solution across all used regions and subscriptions. This communicates service/security issues and maintenance activities like host replacements and upgrades, reducing service request submissions.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ProvidesalistofAzureVMwareSolutionresourcesthatdon't have one or more service health alerts covering AVS private clouds in the deployed subscription and region pairs.
//full list of private clouds
(resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend locale = tolower(location)
| extend subscriptionId = tolower(subscriptionId)
| project id, name, tags, subscriptionId, locale)
| join kind=leftouter
//Alert ID'sthatincludeallincidenttypesfilteredbyAVSServiceHealthalerts((resources|wheretype=="microsoft.insights/activitylogalerts"|extendalertproperties=todynamic(properties)|wherealertproperties.condition.allOf[0].field=="category"andalertproperties.condition.allOf[0].equals=="ServiceHealth"|wherealertproperties.condition.allOf[1].field=="properties.impactedServices[*].ServiceName"andset_has_element(alertproperties.condition.allOf[1].containsAny,"Azure VMware Solution")|extendlocale=strcat_array(split(tolower(alertproperties.condition.allOf[2].containsAny),' '),'')|mv-expandtodynamic(locale)|wherelocale!="global"|projectsubscriptionId,tostring(locale))|union//AlertID's that include only some of the incident types after filtering by service health alerts covering AVS private clouds.
(resources
| where type == "microsoft.insights/activitylogalerts"
| extend subscriptionId = tolower(subscriptionId)
| extend alertproperties = todynamic(properties)
| where alertproperties.condition.allOf[0].field == "category" and alertproperties.condition.allOf[0].equals == "ServiceHealth"
| where alertproperties.condition.allOf[2].field == "properties.impactedServices[*].ServiceName" and set_has_element(alertproperties.condition.allOf[2].containsAny, "Azure VMware Solution")
| extend locale = strcat_array(split(tolower(alertproperties.condition.allOf[3].containsAny),''), '')
| mv-expand todynamic(locale)
| mv-expand alertproperties.condition.allOf[1].anyOf
| extend incidentType = alertproperties_condition_allOf_1_anyOf.equals
| where locale != "global"
| project id, subscriptionId, locale, incidentType
| distinct subscriptionId, tostring(locale), tostring(incidentType)
| summarize incidentTypes=count() by subscriptionId, locale
| where incidentTypes == 5 //only include this subscription, region pair if it includes all the incident types.
| project subscriptionId, locale)) on subscriptionId, locale
| where subscriptionId1 == "" or locale1 == "" or isnull(subscriptionId1) or isnull(locale1)
| project recommendationId = "74fcb9f2-9a25-49a6-8c42-d32851c4afb7", name, id, tags, param1 = "avsServiceHealthAlertsAllIncidentTypesConfigured: False"
Monitor when Azure VMware Solution Private Cloud is reaching the capacity limit
Impact:MediumCategory:Monitoring and Alerting
APRL GUID:29d7a115-dfb6-4df1-9205-04824109548f
Description:
Set an alert for when the node count in Azure VMware Solution Private Cloud hits or exceeds 90 hosts, enabling timely planning for a new private cloud.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Monitor when Azure VMware Solution Cluster Size is approaching the host limit
Impact:MediumCategory:Monitoring and Alerting
APRL GUID:f86355e3-de7c-4dad-8080-1b0b411e66c8
Description:
Alert when the cluster size reaches 14 hosts. Set up periodic alerts for planning new clusters or datastores due to growth, especially from storage needs. Beyond 14 hosts, trigger alerts for each new host addition for proactive resource monitoring.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Enable Stretched Clusters for Multi-AZ Availability of the vSAN Datastore
Impact:LowCategory:High Availability
APRL GUID:9ec5b4c8-3dd8-473a-86ee-3273290331b9
Description:
For Azure VMware Solution, enabling Stretched Clusters offers 99.99% SLA, synchronous storage replication (RPO=0), and spreads vSAN datastore across two AZs. Must be done at initial setup, needing double quota due to extension across AZs.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ProvidesalistofAzureVMwareSolutionresourcesthataren't configured as stretched clusters and in supported regions.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend avsproperties = todynamic(properties)
| where avsproperties.availability.strategy != "DualZone"
| where location in ("uksouth", "westeurope", "germanywestcentral", "australiaeast")
| project recommendationId = "9ec5b4c8-3dd8-473a-86ee-3273290331b9", name, id, tags, param1 = "stretchClusters: Disabled"
Configure Azure Monitor Alert warning thresholds for vSAN datastore utilization
Impact:HighCategory:Monitoring and Alerting
APRL GUID:4232eb32-3241-4049-9e14-9b8005817b56
Description:
Ensure VMware vSAN datastore slack space is maintained for SLA by monitoring storage utilization and setting alerts at 70% and 75% utilization to allow for capacity planning. To expand, add hosts or external storage like Azure Elastic SAN, Azure NetApp Files, if CPU and RAM requirements are met.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ProvidesalistofAzureVMwareSolutionresourcesthatdon't have a vSAN capacity critical alert with a threshold of 75% or a warning capacity of 70%.
(
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "DiskUsedPercentage"
| where threshold == 75
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "4232eb32-3241-4049-9e14-9b8005817b56", name, id, tags, param1 = "vsanCapacityCriticalAlert: isNull or threshold != 75"
)
| union (
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "DiskUsedPercentage"
| where threshold == 70
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "4232eb32-3241-4049-9e14-9b8005817b56", name, id, tags, param1 = "vsanCapacityWarningAlert: isNull or threshold != 70"
)
Configure Syslog in Diagnostic Settings for Azure VMware Solution
Impact:HighCategory:Monitoring and Alerting
APRL GUID:fa4ab927-bced-429a-971a-53350de7f14b
Description:
Ensure Diagnostic Settings are configured for each private cloud to send syslogs to external sources for analysis and/or archiving. Azure VMware Solution Syslogs contain data for troubleshooting and performance, aiding quicker issue resolution and early detection of issues.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Monitor CPU Utilization to ensure sufficient resources for workloads
Impact:HighCategory:Monitoring and Alerting
APRL GUID:4ee5d535-c47b-470a-9557-4a3dd297d62f
Description:
Ensure sufficient compute resources to avoid host resource exhaustion in Azure VMware Solution, which utilizes vSphere DRS and HA for dynamic workload resource management. However, sustained CPU utilization over 95% may increase CPU Ready times, impacting workloads.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ProvidesalistofAzureVMwareSolutionresourcesthatdon't have a Cluster CPU capacity critical alert with a threshold of 95%.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "EffectiveCpuAverage"
| where threshold == 95
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "4ee5d535-c47b-470a-9557-4a3dd297d62f", name, id, tags, param1 = "hostCpuCriticalAlert: isNull or threshold != 95"
Monitor Memory Utilization to ensure sufficient resources for workloads
Impact:HighCategory:Monitoring and Alerting
APRL GUID:029208c8-5186-4a76-8ee8-6e3445fef4dd
Description:
Ensure sufficient memory resources to prevent host resource exhaustion in Azure VMware Solution. It uses vSphere DRS and vSphere HA for dynamic workload management. Yet, continuous memory use over 95% leads to disk swapping, affecting workloads.
Click the Azure Resource Graph tab to view the query
//AzureResourceGraphQuery//ProvidesalistofAzureVMwareSolutionresourcesthatdon't have a cluster host memory critical alert with a threshold of 95%.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "UsageAverage"
| where threshold == 95
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "029208c8-5186-4a76-8ee8-6e3445fef4dd", name, id, tags, param1 = "hostMemoryCriticalAlert: isNull or threshold != 95"
Apply Resource delete lock on the resource group hosting the private cloud
Impact:HighCategory:Governance
APRL GUID:a5ef7c05-c611-4842-9af5-11efdc99123a
Description:
Applying a resource delete lock to the Azure VMware Solution Private Cloud resource group prevents unauthorized or accidental deletion by anyone with contributor access, ensuring the protection and reliability of the Azure VMware Solution Private Cloud.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Use key autorotation for vSAN datastore customer-managed keys
Impact:HighCategory:Security
APRL GUID:e0ac2f57-c8c0-4b8c-a7c8-19e5797828b5
Description:
When using customer-managed keys for encrypting vSAN datastores, leveraging Azure Key Vault for central management and accessing them via a managed identity linked to the private cloud is advised. The expiration of these keys can render the vSAN datastore and its associated workloads inaccessible.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Use multiple DNS servers per private FQDN zone
Impact:HighCategory:High Availability
APRL GUID:fcc2e257-23af-4c68-aac8-9cc03033c939
Description:
Azure VMware Solution private clouds support up to three DNS servers for a single FQDN, preventing a single DNS server from becoming a point of failure. It's crucial to use multiple DNS servers for on-premises FQDN resolution from each private cloud.