Cosmos DB
The presented resiliency recommendations in this guidance include Cosmos DB and associated resources and settings.
Summary of Recommendations
Recommendation | Category | Impact | State | ARG Query Available |
---|---|---|---|---|
COSMOS-1 - Configure at least two regions for high availability | Availability | High | Verified | Yes |
COSMOS-2 - Enable service-managed failover for multi-region accounts with single write region | Disaster Recovery | High | Verified | Yes |
COSMOS-3 - Evaluate multi-region write capability | Disaster Recovery | High | Verified | Yes |
COSMOS-4 - Choose appropriate consistency mode reflecting data durability requirements | Disaster Recovery | High | Preview | No |
COSMOS-5 - Configure continuous backup mode | Disaster Recovery | High | Verified | Yes |
COSMOS-6 - Ensure query results are fully drained | System Efficiency | High | Verified | No |
COSMOS-7 - Maintain singleton pattern in your client | System Efficiency | Medium | Verified | No |
COSMOS-8 - Implement retry logic in your client | Application Resilience | Medium | Verified | No |
COSMOS-9 - Monitor Cosmos DB health and set up alerts | Monitoring | Medium | Verified | No |
Recommendations Details
COSMOS-1 - Configure at least two regions for high availability
Category: Availability
Impact: High
Guidance
Azure implements multi-tier isolation approach with rack, DC, zone, and region isolation levels. Cosmos DB is by default highly resilient by running four replicas, but it is still susceptible to failures or issues with entire regions or availability zones. As such, it is crucial to enable at least a secondary region on your Cosmos DB to achieve higher SLA. Doing so does not incur any downtime at all and it is as easy as selecting a pin on map. Cosmos DB instances utilizing Strong consistency need to configure at least three regions to retain write availability in case of one region failure.
Resources
- Distribute data globally with Azure Cosmos DB | Microsoft Learn
- Tips for building highly available applications | Microsoft Learn
Resource Graph Query
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) < 2 or
(array_length(properties.locations) < 3 and properties.consistencyPolicy.defaultConsistencyLevel == 'Strong')
| project recommendationId='cosmos-1', name, id, tags
COSMOS-2 - Enable service-managed failover for multi-region accounts with single write region
Category: Disaster Recovery
Impact: High
Guidance
Cosmos DB is a battle-tested service with extremely high uptime and resiliency, but even the most resilient of systems sometimes run into a small hiccup. Should a region become unavailable, the Service-Managed failover option allows Azure Cosmos DB to fail over automatically to the next available region with no user action needed.
Resources
Resource Graph Query
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) > 1 and
tobool(properties.enableAutomaticFailover) == false and
tobool(properties.enableMultipleWriteLocations) == false
| project recommendationId='cosmos-2', name, id, tags
COSMOS-3 - Evaluate multi-region write capability
Category: Disaster Recovery
Impact: High
Guidance
Multi-region write capability enables you to design multi-region application that is inherently highly available by virtue of being active in multiple regions. This, however, requires that you pay close considerations to consistency requirements and handle potential writes conflicts by way of conflict resolution policy. On the flip side, blindly enabling this configuration can lead to decreased availability due to unexpected application behavior and data corruption due to unhandled conflicts.
Resources
- Distribute data globally with Azure Cosmos DB | Microsoft Learn
- Conflict resolution types and resolution policies in Azure Cosmos DB | Microsoft Learn
Resource Graph Query
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
array_length(properties.locations) > 1 and
properties.enableMultipleWriteLocations == false
| project recommendationId='cosmos-3', name, id, tags
COSMOS-4 - Choose appropriate consistency mode reflecting data durability requirements
Category: Disaster Recovery
Impact: High
Guidance
Within a globally distributed database environment, there is a direct relationship between the consistency level and data durability in the presence of a region-wide outage. As you develop your business continuity plan, you need to understand the maximum period of recent data updates the application can tolerate losing when recovering after a disruptive event. We recommend using Session consistency unless you have established that stronger consistency mode is needed, you are willing to tolerate higher write latencies, and understand that outages on read-only regions can affect the ability of write region to accept writes.
Resources
Resource Graph Query
// under-development
COSMOS-5 - Configure continuous backup mode
Category: Disaster Recovery
Impact: High
Guidance
Cosmos DB automatically backs up your data and there is no way to turn back ups off. In short, you are always protected. But should any mishap occur – a process that went haywire and deleted data it shouldn’t, customer data was overwritten by accident, etc. – minimizing the time it takes to revert the changes is of the essence. With continuous mode, you can self-serve restore your database/collection to a point in time before such mishap occurred. With periodic mode, however, you must contact Microsoft support, which despite us striving to provide speedy help will inevitably increase the restore time.
Resources
Resource Graph Query
Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
properties.backupPolicy.type == 'Periodic' and
properties.enableMultipleWriteLocations == false and
properties.enableAnalyticalStorage == false
| project recommendationId='cosmos-5', name, id, tags
COSMOS-6 - Ensure query results are fully drained
Category: System Efficiency
Impact: High
Guidance
Cosmos DB limits single response to 4 MB. If your query requests a large amount of data or data from multiple backend partitions, the results will span multiple pages for which separate requests must be issued. Each result page will indicate whether more results are available and provide a continuation token to access the next page. You must include a while loop in your code and traverse the pages until no more results are available.
Resources
Resource Graph Query
// under-development
COSMOS-7 - Maintain singleton pattern in your client
Category: System Efficiency
Impact: Medium
Guidance
Not only is establishing a new database connection expensive, so is maintaining it. As such it is critical to maintain only one instance, a so-called “singleton”, of the SDK client per account per application. Connections, both HTTP and TCP, are scoped to the client instance. Most compute environments have limitations in terms of the number of connections that can be open at the same time and when these limits are reached, connectivity will be affected.
Resources
- Designing resilient applications with Azure Cosmos DB SDKs | Microsoft Learn Resource Graph Query
// under-development
COSMOS-8 - Implement retry logic in your client
Category: Application Resilience
Impact: Medium
Guidance
Cosmos DB SDKs by default handle large number of transient errors and automatically retry operations, where possible. That said, your application should add retry policies for certain well-defined cases that cannot be generically handled by the SDKs.
Resources
Resource Graph Query
// under-development
COSMOS-9 - Monitor Cosmos DB health and set up alerts
Category: Monitoring
Impact: Medium
Guidance
It is good practice to monitor the availability and responsiveness of your Azure Cosmos DB resources and have alerts in place for your workload to stay proactive in case an unforeseen event occurs.
Resources
Resource Graph Query
// under-development