3 - Test

The presented Microsoft Azure Well-Architected Framework recommendations in this guidance include Reliability Stage “3 - Test (Workload Testing)” and associated resources and their settings.

Before deploying the system, comprehensive tests are conducted to validate the design and implementation. This stage is crucial for identifying any weaknesses that could compromise reliability.

Summary of Recommendations

Recommendation	Category	Impact	State	ARG Query Available
WATS-1 - Test your applications for availability and resiliency	Application Resilience	High	Verified	No
WATS-2 - Consider building logic into your workload to handle errors	Application Resilience	High	Verified	No
WATS-3 - Perform disaster recovery tests regularly	Disaster Recovery	High	Verified	No
WATS-4 - Use chaos engineering to test Azure applications	Application Resilience	Medium	Verified	No
WATS-5 - Test application fault resiliency	Application Resilience	High	Verified	No

Definitions of states can be found here

Recommendations Details

WATS-1 - Test your applications for availability and resiliency

Category: Application Resilience

Impact: High

Recommendation/Guidance

Applications should be tested to ensure availability and resiliency. Availability describes the amount of time that an application runs in a healthy state without significant downtime. Resiliency describes how quickly an application recovers from failure.

Being able to measure availability and resiliency can answer questions like: How much downtime is acceptable? How much does potential downtime cost your business? What are your availability requirements? How much do you invest in making your application highly available? What is the risk versus the cost? Testing plays a critical role in making sure your applications can meet these requirements.

Key points:

Test regularly to validate existing thresholds, targets, and assumptions.
Automate testing as much as possible.
Perform testing on both key Test environments and the production environment.
Verify how the end-to-end workload performs under intermittent failure conditions.
Test the application against critical functional and nonfunctional requirements for performance.
Conduct load testing with expected peak volumes to Test scalability and performance under load.
Perform chaos testing by injecting faults.

Resources

Testing applications for availability and resiliency

WATS-2 - Consider building logic into your workload to handle errors

Category: Application Resilience

Impact: High

Recommendation/Guidance

In a distributed system, ensuring that your application can recover from errors is critical. You can test your applications to prevent errors and failure, but you need to prepare for a wide range of issues. Testing doesn’t always catch everything, so you should understand how to handle errors and prevent potential failure.

Many things in a distributed system, such as underlying cloud infrastructure and third-party runtime dependencies, are outside your span of control and your means to test. You can be sure something will fail eventually, so you need to be prepared.

Key points:

Implement retry logic to handle transient application failures and transient failures with internal or external dependencies.
Uncover issues or failures in your application’s retry logic.
Configure request timeouts to manage intercomponent calls.
Configure and test health probes for your load balancers and traffic managers.
Segregate read operations from update operations across application data stores.

Resources

Error handling for resilient applications in Azure

WATS-3 - Perform disaster recovery tests regularly

Category: Disaster Recovery

Impact: High

Recommendation/Guidance

Disaster recovery is the process of restoring application functionality after a catastrophic loss. In cloud environments, we acknowledge up front that failures happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Testing is one way to minimize these effects. You should automate testing of your applications where possible, but you also need to be prepared for when they fail. When a failure happens, having backup and recovery strategies becomes important.

Your tolerance for reduced functionality during a disaster is a business decision that varies from one application to the next. It might be acceptable for some applications to be temporarily unavailable, or partially available with reduced functionality or delayed processing. For other applications, any reduced functionality is unacceptable.

Key points

Create and test a disaster recovery plan regularly using key failure scenarios.
Design a disaster recovery strategy to run most applications with reduced functionality.
Design a backup strategy that’s tailored for the business requirements and circumstances of the application.
Automate failover and failback steps and processes.
Test and validate the failover and failback approach successfully at least once.

Resources

Backup and disaster recovery for Azure applications

WATS-4 - Use chaos engineering to test Azure applications

Category: Application Resilience

Impact: Medium

Recommendation/Guidance

Ideally, you should apply chaos principles continuously. There’s constant change in the environments in which software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on components, you can help expose issues early, before small problems are compounded by many other factors.

Apply chaos engineering principles when you:

Deploy new code.
Add dependencies.
Observe changes in usage patterns.
Mitigate problems.

Resources

Use chaos engineering to test Azure applications

WATS-5 - Test application fault resiliency

Category: Application Resilience

Impact: High

Guidance

High availability is a fundamental part of the SQL Database platform that works transparently for your database application. However, we recognize that you may want to test how the automatic failover operations initiated during planned or unplanned events would impact an application before you deploy it to production. You can manually trigger a failover by calling a special API to restart a database, or an elastic pool.

In the case of a zone-redundant serverless or provisioned General Purpose database or elastic pool, the API call would result in redirecting client connections to the new primary in an Availability Zone different from the Availability Zone of the old primary. So in addition to testing how failover impacts existing database sessions, you can also verify if it changes the end-to-end performance due to changes in network latency. Because the restart operation is intrusive and a large number of them could stress the platform, only one failover call is allowed every 15 minutes for each database or elastic pool.

Resources

Test application fault resiliency

Azure Databricks

Batch Accounts

Azure Site Recovery

Compute Gallery

Image Templates

Virtual Machine Scale Sets

Virtual Machines

AKS

Container Registry

SQL DB

Cosmos DB

DB for MySQL

DB for PostgreSQL

Redis Cache

Api Management

Event Grid

Event Hub

Service Bus

IoT Hub

Automation Account

Management Groups

Resource Groups

Subscription

Azure Backup

Application Insights

Log Analytics

Resource Health Alerts

Service Health Alerts

Application Gateway

DDoS Protection Plans

ExpressRoute Circuits

ExpressRoute Connection

ExpressRoute Direct

ExpressRoute Gateway

ExpressRoute Traffic Collector

Firewall

Front Door

Load Balancer

Network Security Group

Network Watcher

Private DNS Zones

Private Endpoints

Public Ip

Route Table

Traffic Manager

Virtual Networks

VPN Gateway

Web Application Firewall

Key Vault

Azure High Performance Computing

Azure Virtual Desktop

Azure VMware Solution

SAP on Azure

Azure NetApp Files

Storage Accounts (Blob/Azure Data Lake Storage Gen2)

App Service Plan

SignalR

Web App

3 - Test

Summary of Recommendations

Recommendations Details

WATS-1 - Test your applications for availability and resiliency

WATS-2 - Consider building logic into your workload to handle errors

WATS-3 - Perform disaster recovery tests regularly

WATS-4 - Use chaos engineering to test Azure applications

WATS-5 - Test application fault resiliency