3 - Test


The presented Microsoft Azure Well-Architected Framework recommendations in this guidance include Reliability Stage “3 - Test (Workload Testing)” and associated resources and their settings.

Before deploying the system, comprehensive tests are conducted to validate the design and implementation. This stage is crucial for identifying any weaknesses that could compromise reliability.

Summary of Recommendations

Recommendations Details

WATS-1 - Test your applications for availability and resiliency

Category: Application Resilience

Impact: High

Recommendation/Guidance

Applications should be tested to ensure availability and resiliency. Availability describes the amount of time that an application runs in a healthy state without significant downtime. Resiliency describes how quickly an application recovers from failure.

Being able to measure availability and resiliency can answer questions like: How much downtime is acceptable? How much does potential downtime cost your business? What are your availability requirements? How much do you invest in making your application highly available? What is the risk versus the cost? Testing plays a critical role in making sure your applications can meet these requirements.

Key points:

  • Test regularly to validate existing thresholds, targets, and assumptions.
  • Automate testing as much as possible.
  • Perform testing on both key Test environments and the production environment.
  • Verify how the end-to-end workload performs under intermittent failure conditions.
  • Test the application against critical functional and nonfunctional requirements for performance.
  • Conduct load testing with expected peak volumes to Test scalability and performance under load.
  • Perform chaos testing by injecting faults.

Resources



WATS-2 - Consider building logic into your workload to handle errors

Category: Application Resilience

Impact: High

Recommendation/Guidance

In a distributed system, ensuring that your application can recover from errors is critical. You can test your applications to prevent errors and failure, but you need to prepare for a wide range of issues. Testing doesn’t always catch everything, so you should understand how to handle errors and prevent potential failure.

Many things in a distributed system, such as underlying cloud infrastructure and third-party runtime dependencies, are outside your span of control and your means to test. You can be sure something will fail eventually, so you need to be prepared.

Key points:

  • Implement retry logic to handle transient application failures and transient failures with internal or external dependencies.
  • Uncover issues or failures in your application’s retry logic.
  • Configure request timeouts to manage intercomponent calls.
  • Configure and test health probes for your load balancers and traffic managers.
  • Segregate read operations from update operations across application data stores.

Resources



WATS-3 - Perform disaster recovery tests regularly

Category: Disaster Recovery

Impact: High

Recommendation/Guidance

Disaster recovery is the process of restoring application functionality after a catastrophic loss. In cloud environments, we acknowledge up front that failures happen. Instead of trying to prevent failures altogether, the goal is to minimize the effects of a single failing component. Testing is one way to minimize these effects. You should automate testing of your applications where possible, but you also need to be prepared for when they fail. When a failure happens, having backup and recovery strategies becomes important.

Your tolerance for reduced functionality during a disaster is a business decision that varies from one application to the next. It might be acceptable for some applications to be temporarily unavailable, or partially available with reduced functionality or delayed processing. For other applications, any reduced functionality is unacceptable.

Key points

  • Create and test a disaster recovery plan regularly using key failure scenarios.
  • Design a disaster recovery strategy to run most applications with reduced functionality.
  • Design a backup strategy that’s tailored for the business requirements and circumstances of the application.
  • Automate failover and failback steps and processes.
  • Test and validate the failover and failback approach successfully at least once.

Resources



WATS-4 - Use chaos engineering to test Azure applications

Category: Application Resilience

Impact: Medium

Recommendation/Guidance

Ideally, you should apply chaos principles continuously. There’s constant change in the environments in which software and hardware run, so monitoring the changes is key. By constantly applying stress or faults on components, you can help expose issues early, before small problems are compounded by many other factors.

Apply chaos engineering principles when you:

  • Deploy new code.
  • Add dependencies.
  • Observe changes in usage patterns.
  • Mitigate problems.

Resources



WATS-5 - Test application fault resiliency

Category: Application Resilience

Impact: High

Guidance

High availability is a fundamental part of the SQL Database platform that works transparently for your database application. However, we recognize that you may want to test how the automatic failover operations initiated during planned or unplanned events would impact an application before you deploy it to production. You can manually trigger a failover by calling a special API to restart a database, or an elastic pool.

In the case of a zone-redundant serverless or provisioned General Purpose database or elastic pool, the API call would result in redirecting client connections to the new primary in an Availability Zone different from the Availability Zone of the old primary. So in addition to testing how failover impacts existing database sessions, you can also verify if it changes the end-to-end performance due to changes in network latency. Because the restart operation is intrusive and a large number of them could stress the platform, only one failover call is allowed every 15 minutes for each database or elastic pool.

Resources