Azure Proactive Resiliency Library v2
Tools Glossary GitHub GitHub Issues Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

Reliability

This section contains all recommendations from the Azure Well-Architected Framework’s Reliability pillar.

Summary

RecommendationImpactCategory
RE:01 Design your workload to align with business objectivesMediumOther Best Practices
RE:02 Identify and rate user and system flowsMediumHigh Availability
RE:03 Use failure mode analysis to identify and prioritize potential failuresMediumOther Best Practices
RE:04 Define reliability and recovery targetsMediumHigh Availability
RE:05 Design for redundancyMediumHigh Availability
RE:05 Design for multi-region high availabilityMediumHigh Availability
RE:05 Design for high availability with availability zonesMediumHigh Availability
RE:06 Design for data partitioningMediumHigh Availability
RE:06 Design for reliable scalingMediumScalability
RE:07 Use background jobsMediumOther Best Practices
RE:07 Implement self-preservation and self-healing measuresMediumHigh Availability
RE:07 Handle transient faultsMediumHigh Availability
RE:08 Design a reliability testing strategyMediumOther Best Practices
RE:09 Implement business continuity and disaster recovery planMediumDisaster Recovery
RE:10 Design a reliable monitoring and alerting strategyMediumMonitoring and Alerting

Details


RE:01 Design your workload to align with business objectives

Impact:  Medium Category:  Other Best Practices

APRL GUID:  8c0a0a4c-9e34-41af-9f6d-89d8dc00370e

Description:

Design your workload to align with business objectives and avoid unnecessary complexity or overhead. Use a practical and balanced approach to make design decisions that deliver the desired results. Contain your design to the necessities to reduce inefficiencies and potential problems.

Potential Benefits:

Meet business requirements
Learn More:
RE:01 Simplicity and efficiency


RE:02 Identify and rate user and system flows

Impact:  Medium Category:  High Availability

APRL GUID:  74415e66-7baf-43f3-8def-164bc7b48215

Description:

Identify and rate user and system flows. Use a criticality scale based on your business requirements to prioritize the flows.

Potential Benefits:

Align architecture with reliability goals
Learn More:
RE:02 Critical flows


RE:03 Use failure mode analysis to identify and prioritize potential failures

Impact:  Medium Category:  Other Best Practices

APRL GUID:  f5fbe3d4-7196-46b8-9b09-0e29e7cf43ac

Description:

Use failure mode analysis (FMA) to identify and prioritize potential failures in your solution components. Perform FMA to help you assess the risk and effect of each failure mode. Determine how the workload responds and recovers.

Potential Benefits:

Reduce risk of unpredicted behavior
Learn More:
RE:03 Failure mode analysis


RE:04 Define reliability and recovery targets

Impact:  Medium Category:  High Availability

APRL GUID:  2c41b97c-af27-47b5-aafb-81bbf95fe8ba

Description:

Define reliability and recovery targets for the components, the flows, and the overall solution. Use the defined targets to build the health model. The health model defines what healthy, degraded, and unhealthy states look like.

Potential Benefits:

Communicate reliability expectations with stakeholders
Learn More:
RE:04 Target metrics


RE:05 Design for redundancy

Impact:  Medium Category:  High Availability

APRL GUID:  e404ef3f-e427-4e43-a1df-09da987e744f

Description:

Add redundancy at different levels, especially for critical flows. Apply redundancy to the compute, data, network, and other infrastructure tiers in accordance with the identified reliability targets.

Potential Benefits:

Optimize for resiliency
Learn More:
RE:05 Redundancy


RE:05 Design for multi-region high availability

Impact:  Medium Category:  High Availability

APRL GUID:  df93ae26-260e-408f-860c-42cd189f8bf8

Description:

High availability is a foundational tenet of designing for reliability. A highly available architecture can help you avoid downtime as much as possible and recover efficiently if downtime does occur.

Potential Benefits:

Minimize downtime from regional outages
Learn More:
RE:05 High-availability multi-region design


RE:05 Design for high availability with availability zones

Impact:  Medium Category:  High Availability

APRL GUID:  3d6adb0a-042f-47f7-a7ea-db2e360903d5

Description:

High availability is a foundational tenet of designing for reliability. A highly available architecture can help you avoid downtime as much as possible and recover efficiently if downtime does occur.

Potential Benefits:

Minimize downtime from zonal outages
Learn More:
Regions and availability zones


RE:06 Design for data partitioning

Impact:  Medium Category:  High Availability

APRL GUID:  7f0b9ea3-0159-4ea7-b854-a4313fe76d7f

Description:

Partitioning data improves scalability, reduces contention, and optimizes performance. Implement data partitioning to divide data by usage pattern.

Potential Benefits:

Improve data estate reliability
Learn More:
RE:06 Data partitioning


RE:06 Design for reliable scaling

Impact:  Medium Category:  Scalability

APRL GUID:  340fe5c3-d599-448a-8e52-15e96771a3f0

Description:

Implement a timely and reliable scaling strategy at the application, data, and infrastructure levels.

Potential Benefits:

Dynamically handle increased load
Learn More:
RE:06 Scaling


RE:07 Use background jobs

Impact:  Medium Category:  Other Best Practices

APRL GUID:  4e1094dd-2d85-4a1a-8ca8-1e6ea21206fb

Description:

Background jobs help minimize the load on the application UI, which improves availability and reduces interactive response time.

Potential Benefits:

Minimize application load
Learn More:
RE:07 Background jobs


RE:07 Implement self-preservation and self-healing measures

Impact:  Medium Category:  High Availability

APRL GUID:  7b5008cf-1853-44c4-827d-bca091678c3f

Description:

Strengthen the resiliency and recoverability of your workload by implementing self-preservation and self-healing measures. Self-healing capabilities help you avoid downtime by building in failure detection and automatic corrective actions to respond to different failure types.

Potential Benefits:

Reduce the likelihood of outages
Learn More:
RE:07 Self-preservation


RE:07 Handle transient faults

Impact:  Medium Category:  High Availability

APRL GUID:  66ae4a5c-7f58-4293-bed8-5caa4f9f34e2

Description:

Build capabilities into the solution by using infrastructure-based reliability patterns and software-based design patterns to handle component failures and transient errors.

Potential Benefits:

Reduce the likelihood of outages
Learn More:
RE:07 Transient faults


RE:08 Design a reliability testing strategy

Impact:  Medium Category:  Other Best Practices

APRL GUID:  7db74a6a-4062-46a8-a0cd-18684fb0ec08

Description:

Test resiliency and availability scenarios by applying the principles of chaos engineering in your test and production environments. Use testing to ensure that your graceful degradation implementation and scaling strategies are effective by performing active malfunction and simulated load testing.

Potential Benefits:

Validate and optimize workload reliability
Learn More:
RE:08 Testing


RE:09 Implement business continuity and disaster recovery plan

Impact:  Medium Category:  Disaster Recovery

APRL GUID:  5f95df03-cae2-4761-90b7-7afd657ac124

Description:

Implement structured, tested, and documented business continuity and disaster recovery (BCDR) plans that align with the recovery targets. Plans must cover all components and the system as a whole.

Potential Benefits:

Reliable disaster recovery
Learn More:
RE:09 Disaster recovery


RE:10 Design a reliable monitoring and alerting strategy

Impact:  Medium Category:  Monitoring and Alerting

APRL GUID:  90adebf7-bc90-4939-9aa8-119c46bee0fc

Description:

Measure and publish the solution's health indicators. Continuously capture uptime and other reliability data from across the workload and also from individual components and key flows.

Potential Benefits:

Observability into workload health
Learn More:
RE:10 Monitoring and alerting