1 - Define

The presented Microsoft Azure Well-Architected Framework recommendations in this guidance include Reliability Stage “1 - Define (Requirements)” and associated resources and their settings.

In this initial stage, the objectives and requirements for system reliability are established. This often involves specifying availability and recovery targets, latency tolerances, criticality classifications, and disaster recovery objectives.

Summary of Recommendations

Recommendation	Category	Impact	State	ARG Query Available
WADF-1 - Ensure the Availability Targets are well defined and communicated across teams working on the Workload	Availability	High	Verified	No
WADF-2 - Ensure the Recovery Targets are well defined and communicated across teams working on the Workload	Disaster Recovery	High	Verified	No

Definitions of states can be found here

Recommendations Details

WADF-1 - Ensure the Availability Targets are well defined and communicated across teams working on the Workload

Category: Availability

Impact: High

Recommendation/Guidance

Ensure the Availability Targets (SLA, SLO, SLI) are well defined, tested, monitored and communicated across teams working on the Workload.

A Service Level Agreement (SLA) is an availability target that represents a commitment around performance and availability of the application. Understanding the SLA of individual components within the system is essential to define reliability targets. Knowing the SLA of dependencies will also provide a justification for additional spend when making the dependencies highly available and with proper support contracts. Availability targets for any dependencies leveraged by the application should be understood and ideally align with application targets should also be considered.

Understanding your availability expectations is vital to reviewing overall operations for the application.

For example, if you are striving to achieve an application Service Level Objective (SLO) of 99.999%, the level of inherent operational action required by the application is going to be far greater than if an SLO of 99.9% was the goal.

Resources

WADF-2 - Ensure the Recovery Targets are well defined and communicated across teams working on the Workload

Category: Disaster Recovery

Impact: High

Recommendation/Guidance

Ensure the Recovery Targets are well defined and communicated across teams working on the Workload. Two important metrics to consider are the recovery time objective and recovery point objective, as they pertain to disaster recovery.

Recovery time objective (RTO) is the maximum acceptable time that an application can be unavailable after an incident. If your RTO is 90 minutes, you must be able to restore the application to a running state within 90 minutes from the start of a disaster. If you have a very low RTO, you might keep a second regional deployment continually running an active/passive configuration on standby, to protect against a regional outage. In some cases, you might deploy an active/active configuration to achieve even lower RTO.
Recovery point objective (RPO) is the maximum duration of data loss that is acceptable during a disaster. For example, if you store data in a single database, with no replication to other databases, and perform hourly backups, you could lose up to an hour of data. RTO and RPO are non-functional requirements of a system and should be dictated by business requirements. To derive these values, it’s a good idea to conduct a risk assessment, and clearly understanding the cost of downtime or data loss.

Monitoring and measuring application availability is vital to qualifying overall application health and progress towards defined targets. Make sure you measure and monitor key targets such as:

Mean Time Between Failures (MTBF) — The average time between failures of a particular component.
Mean Time to Recover (MTTR) — The average time it takes to restore a component after a failure.

Resources

Target functional and nonfunctional requirements

Azure Databricks

Batch Accounts

Azure Site Recovery

Compute Gallery

Image Templates

Virtual Machine Scale Sets

Virtual Machines

AKS

Container Registry

SQL DB

Cosmos DB

DB for MySQL

DB for PostgreSQL

Redis Cache

Api Management

Event Grid

Event Hub

Service Bus

IoT Hub

Automation Account

Management Groups

Resource Groups

Subscription

Azure Backup

Application Insights

Log Analytics

Resource Health Alerts

Service Health Alerts

Application Gateway

DDoS Protection Plans

ExpressRoute Circuits

ExpressRoute Connection

ExpressRoute Direct

ExpressRoute Gateway

ExpressRoute Traffic Collector

Firewall

Front Door

Load Balancer

Network Security Group

Network Watcher

Private DNS Zones

Private Endpoints

Public Ip

Route Table

Traffic Manager

Virtual Networks

VPN Gateway

Web Application Firewall

Key Vault

Azure High Performance Computing

Azure Virtual Desktop

Azure VMware Solution

SAP on Azure

Azure NetApp Files

Storage Accounts (Blob/Azure Data Lake Storage Gen2)

App Service Plan

SignalR

Web App

1 - Define

Summary of Recommendations

Recommendations Details

WADF-1 - Ensure the Availability Targets are well defined and communicated across teams working on the Workload

WADF-2 - Ensure the Recovery Targets are well defined and communicated across teams working on the Workload