Cosmos DB


The presented resiliency recommendations in this guidance include Cosmos DB and associated resources and settings.

Summary of Recommendations

Recommendations Details

COSMOS-1 - Configure at least two regions for high availability

Category: Availability

Impact: High

Guidance

Azure implements multi-tier isolation approach with rack, DC, zone, and region isolation levels. Cosmos DB is by default highly resilient by running four replicas, but it is still susceptible to failures or issues with entire regions or availability zones. As such, it is crucial to enable at least a secondary region on your Cosmos DB to achieve higher SLA. Doing so does not incur any downtime at all and it is as easy as selecting a pin on map. Cosmos DB instances utilizing Strong consistency need to configure at least three regions to retain write availability in case of one region failure.

Resources

Resource Graph Query

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
     array_length(properties.locations) < 2 or
    (array_length(properties.locations) < 3 and properties.consistencyPolicy.defaultConsistencyLevel == 'Strong')
| project recommendationId='cosmos-1', name, id, tags



COSMOS-2 - Enable service-managed failover for multi-region accounts with single write region

Category: Disaster Recovery

Impact: High

Guidance

Cosmos DB is a battle-tested service with extremely high uptime and resiliency, but even the most resilient of systems sometimes run into a small hiccup. Should a region become unavailable, the Service-Managed failover option allows Azure Cosmos DB to fail over automatically to the next available region with no user action needed.

Resources

Resource Graph Query

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    array_length(properties.locations) > 1 and
    tobool(properties.enableAutomaticFailover) == false and
    tobool(properties.enableMultipleWriteLocations) == false
| project recommendationId='cosmos-2', name, id, tags



COSMOS-3 - Evaluate multi-region write capability

Category: Disaster Recovery

Impact: High

Guidance

Multi-region write capability enables you to design multi-region application that is inherently highly available by virtue of being active in multiple regions. This, however, requires that you pay close considerations to consistency requirements and handle potential writes conflicts by way of conflict resolution policy. On the flip side, blindly enabling this configuration can lead to decreased availability due to unexpected application behavior and data corruption due to unhandled conflicts.

Resources

Resource Graph Query

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    array_length(properties.locations) > 1 and
    properties.enableMultipleWriteLocations == false
| project recommendationId='cosmos-3', name, id, tags



COSMOS-4 - Choose appropriate consistency mode reflecting data durability requirements

Category: Disaster Recovery

Impact: High

Guidance

Within a globally distributed database environment, there is a direct relationship between the consistency level and data durability in the presence of a region-wide outage. As you develop your business continuity plan, you need to understand the maximum period of recent data updates the application can tolerate losing when recovering after a disruptive event. We recommend using Session consistency unless you have established that stronger consistency mode is needed, you are willing to tolerate higher write latencies, and understand that outages on read-only regions can affect the ability of write region to accept writes.

Resources

Resource Graph Query

// under-development



COSMOS-5 - Configure continuous backup mode

Category: Disaster Recovery

Impact: High

Guidance

Cosmos DB automatically backs up your data and there is no way to turn back ups off. In short, you are always protected. But should any mishap occur – a process that went haywire and deleted data it shouldn’t, customer data was overwritten by accident, etc. – minimizing the time it takes to revert the changes is of the essence. With continuous mode, you can self-serve restore your database/collection to a point in time before such mishap occurred. With periodic mode, however, you must contact Microsoft support, which despite us striving to provide speedy help will inevitably increase the restore time.

Resources

Resource Graph Query

Resources
| where type =~ 'Microsoft.DocumentDb/databaseAccounts'
| where
    properties.backupPolicy.type == 'Periodic' and
    properties.enableMultipleWriteLocations == false and
    properties.enableAnalyticalStorage == false
| project recommendationId='cosmos-5', name, id, tags



COSMOS-6 - Ensure query results are fully drained

Category: System Efficiency

Impact: High

Guidance

Cosmos DB limits single response to 4 MB. If your query requests a large amount of data or data from multiple backend partitions, the results will span multiple pages for which separate requests must be issued. Each result page will indicate whether more results are available and provide a continuation token to access the next page. You must include a while loop in your code and traverse the pages until no more results are available.

Resources

Resource Graph Query

// under-development



COSMOS-7 - Maintain singleton pattern in your client

Category: System Efficiency

Impact: Medium

Guidance

Not only is establishing a new database connection expensive, so is maintaining it. As such it is critical to maintain only one instance, a so-called “singleton”, of the SDK client per account per application. Connections, both HTTP and TCP, are scoped to the client instance. Most compute environments have limitations in terms of the number of connections that can be open at the same time and when these limits are reached, connectivity will be affected.

Resources

// under-development



COSMOS-8 - Implement retry logic in your client

Category: Application Resilience

Impact: Medium

Guidance

Cosmos DB SDKs by default handle large number of transient errors and automatically retry operations, where possible. That said, your application should add retry policies for certain well-defined cases that cannot be generically handled by the SDKs.

Resources

Resource Graph Query

// under-development



COSMOS-9 - Monitor Cosmos DB health and set up alerts

Category: Monitoring

Impact: Medium

Guidance

It is good practice to monitor the availability and responsiveness of your Azure Cosmos DB resources and have alerts in place for your workload to stay proactive in case an unforeseen event occurs.

Resources

Resource Graph Query

// under-development