Azure VMware Solution


The presented resiliency recommendations in this guidance include Azure VMware Solution and associated resources and settings.

Summary of Recommendations

Recommendations Details

AVS-1 - Configure Azure Service Health notifications and alerts for Azure VMware Solution

Category: Monitoring

Impact: Medium

Recommendation/Guidance

Ensure Azure Service Health notifications and alerts are configured for the Azure VMware Solution service in the subscriptions and regions where Azure VMware Solution is deployed.

Azure Service Health is the mechanism used to inform customers of any service or security issues affecting their private cloud deployment. Additionally, Azure Service Health is used to inform customers of maintenance activities in their Azure VMware Solution environments including host replacements, upgrades, and any service updates which could potentially impact customer operations. Proper configuration of Azure Service Health notifications and alerts ensures that customers receive relevant notifications and can reduce service request submissions due to Azure VMware Solution maintenance.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Provides a list of Azure VMware Solution resources that don't have one or more service health alerts covering AVS private clouds in the deployed subscription and region pairs.
//full list of private clouds
(resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend locale = tolower(location)
| extend subscriptionId = tolower(subscriptionId)
| project id, name, tags, subscriptionId, locale)
| join kind=leftouter
//Alert ID's that include all incident types filtered by AVS Service Health alerts
((resources
| where type == "microsoft.insights/activitylogalerts"
| extend alertproperties = todynamic(properties)
| where alertproperties.condition.allOf[0].field == "category" and alertproperties.condition.allOf[0].equals == "ServiceHealth"
| where alertproperties.condition.allOf[1].field == "properties.impactedServices[*].ServiceName" and set_has_element(alertproperties.condition.allOf[1].containsAny, "Azure VMware Solution")
| extend locale = strcat_array(split(tolower(alertproperties.condition.allOf[2].containsAny),' '), '')
| mv-expand todynamic(locale)
| where locale != "global"
| project subscriptionId, tostring(locale) )
| union
//Alert ID's that include only some of the incident types after filtering by service health alerts covering AVS private clouds.
(resources
| where type == "microsoft.insights/activitylogalerts"
| extend subscriptionId = tolower(subscriptionId)
| extend alertproperties = todynamic(properties)
| where alertproperties.condition.allOf[0].field == "category" and alertproperties.condition.allOf[0].equals == "ServiceHealth"
| where alertproperties.condition.allOf[2].field == "properties.impactedServices[*].ServiceName" and set_has_element(alertproperties.condition.allOf[2].containsAny, "Azure VMware Solution")
| extend locale = strcat_array(split(tolower(alertproperties.condition.allOf[3].containsAny),' '), '')
| mv-expand todynamic(locale)
| mv-expand alertproperties.condition.allOf[1].anyOf
| extend incidentType = alertproperties_condition_allOf_1_anyOf.equals
| where locale != "global"
| project id, subscriptionId, locale, incidentType
| distinct subscriptionId, tostring(locale), tostring(incidentType)
| summarize incidentTypes=count() by subscriptionId, locale
| where incidentTypes == 5 //only include this subscription, region pair if it includes all the incident types.
| project subscriptionId, locale)) on subscriptionId, locale
| where subscriptionId1 == "" or locale1 == "" or isnull(subscriptionId1) or isnull(locale1)
| project recommendationId = "avs-1", name, id, tags, param1 = "avsServiceHealthAlertsAllIncidentTypesConfigured: False"



AVS-2 - Configure Syslog in Diagnostic Settings for Azure VMware Solution

Category: Monitoring

Impact: High

Recommendation/Guidance

Ensure Diagnostic Settings are configured for each private cloud to send the syslogs to one or more external sources for analysis and/or archiving.

Azure VMware Solution Syslogs have useful data for troubleshooting and performance that can help with quicker issue resolution and can also enable early detection of some kinds of issues. Configure Diagnostic Settings on the private cloud to send the Syslogs to one or more external sources for querying and/or archiving in case of an audit.

Resources

Resource Graph Query

// cannot be validated with ARG



AVS-3 - Configure Azure Monitor Alert warning thresholds for vSAN datastore utilization

Category: Monitoring

Impact: High

Recommendation/Guidance

Ensure storage utilization is monitored and alerts are configured so that VMware vSAN datastore slack space is maintained at the level the service-level agreement (SLA) mandates.

For service-level agreement (SLA) purposes, Azure VMware Solution requires 25% slack space available on vSAN. vSAN storage utilization should be regularly monitored, and alerts should be configured at 70% utilization (30% slack space available on vSAN) and 75% utilization (25% slack space available on vSAN) to provide enough time for capacity planning.

To expand the vSAN datastore, additional hosts can be added, up to the maximum supported cluster size (16 hosts). Note, you may need to request host quota. In addition, external storage can be added (e.g. Azure Elastic SAN, Azure NetApp Files, Pure Cloud Block Storage) if the CPU and RAM requirements are being met by the Azure VMware Solution cluster.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Provides a list of Azure VMware Solution resources that don't have a vSAN capacity critical alert with a threshold of 75% or a warning capacity of 70%.
(
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "DiskUsedPercentage"
| where threshold == 75
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "avs-3", name, id, tags, param1 = "vsanCapacityCriticalAlert: isNull or threshold != 75"
)
| union (
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "DiskUsedPercentage"
| where threshold == 70
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "avs-3", name, id, tags, param1 = "vsanCapacityWarningAlert: isNull or threshold != 70"
)



AVS-4 - Enable Stretched Clusters for Multi-AZ Availability of the vSAN Datastore

Category: Availability

Impact: Low

Recommendation/Guidance

If a Multi-AZ deployment of Azure VMware Solution is required, needs a financially backed SLA of 99.99%, or needs synchronous storage replication between AZs (RPO=0), then Azure VMware Solution Stretched Clusters should be considered. If you are in a region that supports stretched clusters, consider enabling this feature to spread the VMware vSAN datastore across two availability zones. Note: Configuring an Azure VMware Solution private cloud as a stretched cluster can only be done during initial implementation and requires twice the quota. This is due to a stretched cluster extending the cluster to the second availability zone.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Provides a list of Azure VMware Solution resources that aren't configured as stretched clusters and in supported regions.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend avsproperties = todynamic(properties)
| where avsproperties.availability.strategy != "DualZone"
| where location in ("uksouth", "westeurope", "germanywestcentral", "australiaeast")
| project recommendationId = "avs-4", name, id, tags, param1 = "stretchClusters: Disabled"



AVS-5 - Monitor CPU Utilization to ensure sufficient resources for workloads

Category: Monitoring

Impact: Medium

Recommendation/Guidance

Ensure there are enough compute resources to avoid host resource exhaustion. Azure VMware Solution uses vSphere DRS and vSphere HA to manage workload resources dynamically. However, sustained host CPU utilization of over 95% can contribute to high CPU Ready times, which will impact running workloads.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Provides a list of Azure VMware Solution resources that don't have a Cluster CPU capacity critical alert with a threshold of 95%.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "EffectiveCpuAverage"
| where threshold == 95
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "avs-5", name, id, tags, param1 = "hostCpuCriticalAlert: isNull or threshold != 95"



AVS-6 - Monitor Memory Utilization to ensure sufficient resources for workloads

Category: Monitoring

Impact: Medium

Recommendation/Guidance

Ensure there are enough memory resources to avoid host resource exhaustion. Azure VMware Solution uses vSphere DRS and vSphere HA to manage workload resources dynamically. However, sustained host memory utilization of over 95% can contribute to host memory swapping to disk, which will impact running workloads.

Resources

Resource Graph Query

// Azure Resource Graph Query
// Provides a list of Azure VMware Solution resources that don't have a cluster host memory critical alert with a threshold of 95%.
resources
| where ['type'] == "microsoft.avs/privateclouds"
| extend scopeId = tolower(tostring(id))
| project ['scopeId'], name, id, tags
| join kind=leftouter (
resources
| where type == "microsoft.insights/metricalerts"
| extend alertProperties = todynamic(properties)
| mv-expand alertProperties.scopes
| mv-expand alertProperties.criteria.allOf
| extend scopeId = tolower(tostring(alertProperties_scopes))
| extend metric = alertProperties_criteria_allOf.metricName
| extend threshold = alertProperties_criteria_allOf.threshold
| project scopeId, tostring(metric), toint(['threshold'])
| where metric == "UsageAverage"
| where threshold == 95
) on scopeId
| where isnull(['threshold'])
| project recommendationId = "avs-6", name, id, tags, param1 = "hostMemoryCriticalAlert: isNull or threshold != 95"



AVS-7 - Monitor when Azure VMware Solution Cluster Size is approaching the host limit

Category: Monitoring

Impact: Medium

Recommendation/Guidance

Alert when the cluster size of 14 hosts is reached. Additionally, periodic alerts should be set up to indicate when growth, especially driven by storage requirements, necessitates planning for a new cluster or the addition of extra datastores. Furthermore, beyond the threshold of 14 hosts, alerts should be triggered each time a new host is added to the cluster, allowing proactive monitoring and management of resource utilization.

Resources

Resource Graph Query

// cannot be validated with ARG



AVS-8 - Monitor when Azure VMware Solution Private Cloud is reaching the capacity limit

Category: Monitoring

Impact: Medium

Recommendation/Guidance

Alert when the total node count is greater than or equal to 90 hosts so that it’s clear when to start planning for a new private cloud.

Resources

Resource Graph Query

// cannot be validated with ARG



AVS-9 - Apply Resource delete lock on the resource group hosting the private cloud

Category: Governance

Impact: High

Recommendation/Guidance

Anyone with contributor access to the resource group hosting Azure VMware Solution Private Cloud can delete it. Applying a resource delete lock to the Azure VMware Solution Private Cloud resource group to prevent deletion of the Azure VMware Solution Private Cloud.

Resources

Resource Graph Query

// cannot be validated with ARG



AVS-10 - Align ExpressRoute configuration with best practices for circuit resilience

Category: Networking

Impact: High

Recommendation/Guidance

For critical workloads, Microsoft recommends deploying two (or more) ExpressRoute circuits in different ExpressRoute peering locations. Use Global Reach to connect multiple ExpressRoute circuits and your Azure VMware Solutions private clouds. Please review the APRL recommendations for ExpressRoute circuits in the Resources section below.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-11 - Deploy two or more circuits in different peering locations when using stretched clusters

Category: Networking

Impact: High

Recommendation/Guidance

Azure VMware Solution vSAN stretched clusters span two Availability Zones (AZs) in the region where they are deployed (plus a third AZ for the witness node). When using ExpressRoute to connect to the vSAN stretched clusters from on-premises, align the ExpressRoute implementation’s resilience to the clusters’ resilience by deploying two circuits in different peering locations (i.e., different sites/DC facilities). When using Global Reach, implement a mesh topology by connecting the on-premises circuits to the managed circuits provided by the Azure VMware Solution private cloud.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-12 - Deploy two Azure VMware Solution private clouds in different regions for geographical disaster recovery

Category: Disaster Recovery

Impact: High

Recommendation/Guidance

Two Azure VMware Solution private clouds can be deployed in different regions for business continuity. Implement a mesh network topology based on ExpressRoute Gateway Connections and Global Reach Connections.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-13 - Use the AVS Interconnect feature to connect private clouds in different availability zones

Category: Availability

Impact: High

Recommendation/Guidance

Use the Interconnect feature for direct communication between private clouds in different availability zones, enabling connectivity between the private clouds management and workload networks. The IP address for each private cloud should be unique to avoid overlap, as the AVS Interconnect does not check for this.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-14 - Use key autorotation for vSAN datastore customer-managed keys

Category: Storage

Impact: High

Recommendation/Guidance

When using customer-managed keys to encrypt the vSAN datastore(s), use Azure Key Vault for centralized management and access them using a managed identity mapped to the private cloud. Key expiration can result in the vSAN datastore and its workloads becoming unavailable. Configure key autorotation to avoid unplanned outages due to key rotation not occurring before expiration.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-15 - Configure LDAPS Identity integration with two sources for NSX and vCenter Server management consoles

Category: Access and Security

Impact: High

Recommendation/Guidance

Ensure that two external identity sources are configured for NSX and vCenter Server. The VMware vCenter Server and NSX Manager use identity sources to enable authentication using external identities. These sources can be temporarily unavailable during maintenance times. Having two sources ensures that administrators can continue to log in to the control surfaces when one source becomes unavailable.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-16 - Use HCX Network Extension High Availability

Category: Availability

Impact: High

Recommendation/Guidance

Enable Network Extension High Availability to provide appliance failure tolerance to the HCX Network Extension service. When Network Extension High Availability is enabled for a selected appliance, HCX will pair it with an eligible appliance and enable an Active Standby resiliency configuration. This enables highly available configurations that can remain in-service in the event of an unplanned appliance level failure. When either of the HA Actives fail, both standby appliances take over. The Network Extension High Availability is designed to recover within a few seconds after a single appliance has failed.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-17 - Verify Management Networks are not extended with HCX Network Extension

Category: Networking

Impact: High

Recommendation/Guidance

Do not extend the network on which the HCX Management devices are deployed.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-18 - Use multiple DNS servers per private FQDN zone

Category: Networking

Impact: High

Recommendation/Guidance

Azure VMware Solution private clouds can support upto three DNS servers for a single FQDN. Using a single DNS server for DNS resolution becomes single point of failure. Ensure that multiple DNS servers are used for any on-premises FQDN resolution from each Azure VMware Solution private cloud.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG



AVS-19 - Verify vSAN FTT configuration aligns with the cluster size

Category: Application Resilience

Impact: High

Recommendation/Guidance

The Azure VMware Solution service SLA also depends upon the vSAN storage policies configured, which vary depending upon the cluster size. In clusters with more than 6 hosts, the vSAN storage policy should be configured with an FTT-2 policy (RAID-1, or RAID-6). FTT stands for failures to tolerate, which in this case refers to how many hosts in a cluster can fail, beofre there is potential data or VM impact.

The default storage policy is set to RAID-1 FTT-1, with Object Space Reservation set to Thin provisioning. Unless you adjust the storage policy or apply a new policy, the cluster grows with this configuration. Please note that the storage policy is not automatically updated based on cluster size. Similarly, changing the default does not automatically update the running VM policies.

Resources

Resource Graph Query/Scripts

// cannot be validated with ARG