Reliability

This section contains all recommendations from the Azure Well-Architected Framework’s Reliability pillar.

Summary

Recommendation	Impact	Category
RE:01 Design your workload to align with business objectives	Medium	OtherBestPractices
RE:02 Identify and rate user and system flows	Medium	HighAvailability
RE:03 Use failure mode analysis to identify and prioritize potential failures	Medium	OtherBestPractices
RE:04 Define reliability and recovery targets	Medium	HighAvailability
RE:05 Design for redundancy	Medium	HighAvailability
RE:05 Design for multi-region high availability	Medium	HighAvailability
RE:05 Design for high availability with availability zones	Medium	HighAvailability
RE:06 Design for data partitioning	Medium	HighAvailability
RE:06 Design for reliable scaling	Medium	Scalability
RE:07 Implement self-preservation and self-healing measures	Medium	HighAvailability
RE:08 Design a reliability testing strategy	Medium	OtherBestPractices
RE:09 Implement business continuity and disaster recovery plan	Medium	DisasterRecovery
RE:10 Design a reliable monitoring and alerting strategy	Medium	MonitoringAndAlerting

Details

RE:01 Design your workload to align with business objectives

Impact: Medium Category: OtherBestPractices

APRL GUID: 8c0a0a4c-9e34-41af-9f6d-89d8dc00370e

Description:

Design your workload to align with business objectives and avoid unnecessary complexity or overhead. Use a practical and balanced approach to make design decisions that deliver the desired results. Contain your design to the necessities to reduce inefficiencies and potential problems.

Potential Benefits:

Meet business requirements

Learn More:

RE:01 Simplicity and efficiency

RE:02 Identify and rate user and system flows

Impact: Medium Category: HighAvailability

APRL GUID: 74415e66-7baf-43f3-8def-164bc7b48215

Description:

Identify and rate user and system flows. Use a criticality scale based on your business requirements to prioritize the flows.

Potential Benefits:

Align architecture with reliability goals

Learn More:

RE:02 Critical flows

RE:03 Use failure mode analysis to identify and prioritize potential failures

Impact: Medium Category: OtherBestPractices

APRL GUID: f5fbe3d4-7196-46b8-9b09-0e29e7cf43ac

Description:

Use failure mode analysis (FMA) to identify and prioritize potential failures in your solution components. Perform FMA to help you assess the risk and effect of each failure mode. Determine how the workload responds and recovers.

Potential Benefits:

Reduce risk of unpredicted behavior

Learn More:

RE:03 Failure mode analysis

RE:04 Define reliability and recovery targets

Impact: Medium Category: HighAvailability

APRL GUID: 2c41b97c-af27-47b5-aafb-81bbf95fe8ba

Description:

Define reliability and recovery targets for the components, the flows, and the overall solution. Use the defined targets to build the health model. The health model defines what healthy, degraded, and unhealthy states look like.

Potential Benefits:

Communicate reliability expectations with stakeholders

Learn More:

RE:04 Target metrics

RE:05 Design for redundancy

Impact: Medium Category: HighAvailability

APRL GUID: e404ef3f-e427-4e43-a1df-09da987e744f

Description:

Add redundancy at different levels, especially for critical flows. Apply redundancy to the compute, data, network, and other infrastructure tiers in accordance with the identified reliability targets.

Potential Benefits:

Optimize for resiliency

Learn More:

RE:05 Redundancy

RE:05 Design for multi-region high availability

Impact: Medium Category: HighAvailability

APRL GUID: df93ae26-260e-408f-860c-42cd189f8bf8

Description:

High availability is a foundational tenet of designing for reliability. A highly available architecture can help you avoid downtime as much as possible and recover efficiently if downtime does occur.

Potential Benefits:

Minimize downtime from regional outages

Learn More:

RE:05 High-availability multi-region design

RE:05 Design for high availability with availability zones

Impact: Medium Category: HighAvailability

APRL GUID: 3d6adb0a-042f-47f7-a7ea-db2e360903d5

Description:

High availability is a foundational tenet of designing for reliability. A highly available architecture can help you avoid downtime as much as possible and recover efficiently if downtime does occur.

Potential Benefits:

Minimize downtime from zonal outages

Learn More:

Regions and availability zones

RE:06 Design for data partitioning

Impact: Medium Category: HighAvailability

APRL GUID: 7f0b9ea3-0159-4ea7-b854-a4313fe76d7f

Description:

Partitioning data improves scalability, reduces contention, and optimizes performance. Implement data partitioning to divide data by usage pattern.

Potential Benefits:

Improve data estate reliability

Learn More:

RE:06 Data partitioning

RE:06 Design for reliable scaling

Impact: Medium Category: Scalability

APRL GUID: 340fe5c3-d599-448a-8e52-15e96771a3f0

Description:

Implement a timely and reliable scaling strategy at the application, data, and infrastructure levels.

Potential Benefits:

Dynamically handle increased load

Learn More:

RE:06 Scaling

RE:07 Implement self-preservation and self-healing measures

Impact: Medium Category: HighAvailability

APRL GUID: 7b5008cf-1853-44c4-827d-bca091678c3f

Description:

Strengthen the resiliency and recoverability of your workload by implementing self-preservation and self-healing measures. Self-healing capabilities help you avoid downtime by building in failure detection and automatic corrective actions to respond to different failure types.

Potential Benefits:

Reduce the likelihood of outages

Learn More:

RE:07 Self-preservation

RE:08 Design a reliability testing strategy

Impact: Medium Category: OtherBestPractices

APRL GUID: 7db74a6a-4062-46a8-a0cd-18684fb0ec08

Description:

Test resiliency and availability scenarios by applying the principles of chaos engineering in your test and production environments. Use testing to ensure that your graceful degradation implementation and scaling strategies are effective by performing active malfunction and simulated load testing.

Potential Benefits:

Validate and optimize workload reliability

Learn More:

RE:08 Testing

RE:09 Implement business continuity and disaster recovery plan

Impact: Medium Category: DisasterRecovery

APRL GUID: 5f95df03-cae2-4761-90b7-7afd657ac124

Description:

Implement structured, tested, and documented business continuity and disaster recovery (BCDR) plans that align with the recovery targets. Plans must cover all components and the system as a whole.

Potential Benefits:

Reliable disaster recovery

Learn More:

RE:09 Disaster recovery

RE:10 Design a reliable monitoring and alerting strategy

Impact: Medium Category: MonitoringAndAlerting

APRL GUID: 90adebf7-bc90-4939-9aa8-119c46bee0fc

Description:

Measure and publish the solution's health indicators. Continuously capture uptime and other reliability data from across the workload and also from individual components and key flows.

Potential Benefits:

Observability into workload health

Learn More:

RE:10 Monitoring and alerting