Azure Proactive Resiliency Library v2
Tools Glossary GitHub GitHub Issues Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

workspaces

Summary

RecommendationImpactCategoryAutomation AvailablePG Verified
Databricks runtime version is not latest or is not LTS versionMediumGovernanceNoVerified
Use Databricks PoolsHighScalabilityNoVerified
Use SSD backed VMs for Worker VM Type and Driver typeMediumScalabilityNoVerified
Enable autoscaling for batch workloadsHighScalabilityNoVerified
Enable autoscaling for SQL warehouseHighScalabilityNoVerified
Use Delta Live Tables enhanced autoscalingMediumScalabilityNoVerified
Automatic Job Termination is enabled, ensure there are no user-defined local processesMediumHigh AvailabilityNoVerified
Enable Logging-Cluster log deliveryMediumMonitoring and AlertingNoVerified
Use Delta Lake for higher reliabilityHighHigh AvailabilityNoVerified
Use Photon AccelerationLowHigh AvailabilityNoVerified
Automatically rescue invalid or nonconforming data with Databricks Auto Loader or Delta Live TablesLowBusiness ContinuityNoVerified
Configure jobs for automatic retries and terminationHighHigh AvailabilityNoVerified
Use a scalable and production-grade model serving infrastructureHighScalabilityNoVerified
Use a layered storage architectureMediumHigh AvailabilityNoVerified
Improve data integrity by reducing data redundancyLowBusiness ContinuityNoVerified
Actively manage schemasMediumOther Best PracticesNoVerified
Use constraints and data expectationsLowBusiness ContinuityNoVerified
Create regular backupsLowDisaster RecoveryNoVerified
Recover from Structured Streaming query failuresHighHigh AvailabilityNoVerified
Recover ETL jobs based on Delta time travelMediumDisaster RecoveryNoVerified
Use Databricks Workflows and built-in recoveryLowDisaster RecoveryNoVerified
Configure a disaster recovery patternHighDisaster RecoveryNoPreview
Automate deployments and workloadsHighOther Best PracticesNoPreview
Set up monitoring, alerting, and loggingHighMonitoring and AlertingNoPreview
Deploy workspaces in separate SubscriptionsHighScalabilityNoPreview
Isolate each workspace in its own VNetHighScalabilityNoPreview
Do not Store any Production Data in Default DBFS FoldersHighHigh AvailabilityNoPreview
Do not use Azure Spot VMs for critical Production workloadsHighHigh AvailabilityNoPreview
Evaluate regional isolation for workspacesHighHigh AvailabilityNoPreview
Define alternate VM SKUsMediumPersonalizedNoPreview

Details


Databricks runtime version is not latest or is not LTS version

Impact:  Medium Category:  Governance PG Verified:  Verified

APRL GUID:  0e835cc2-2551-a247-b1f1-3c5f25c9cb70

Description:

Databricks recommends migrating workloads to the latest or LTS version of its runtime for enhanced stability and support. If on Runtime 11.3 LTS or above, move directly to the latest 12.x version. If below, first migrate to 11.3 LTS, then to the latest 12.x version as per the migration guide.

Potential Benefits:

Enhanced stability and support
Learn More:
Databricks runtime support lifecycles

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use Databricks Pools

Impact:  High Category:  Scalability PG Verified:  Verified

APRL GUID:  c166602e-0804-e34b-be8f-09b4d56e1fcd

Description:

Databricks pools pre-provision VMs, reducing risks of provisioning errors during cluster start or scale, enhancing reliability.

Potential Benefits:

Reduces provisioning errors
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use SSD backed VMs for Worker VM Type and Driver type

Impact:  Medium Category:  Scalability PG Verified:  Verified

APRL GUID:  5877a510-8444-7a4c-8412-a8dab8662f7e

Description:

Upgrade HDDs in premium VMs to SSDs for better speed and reliability. Premium SSDs boost IO-heavy apps; Standard SSDs balance cost and performance. Ideal for critical workloads, upgrading improves connectivity with brief reboot. Consider for vital VMs

Potential Benefits:

Faster, reliable VM performance
Learn More:
Azure managed disk types

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Enable autoscaling for batch workloads

Impact:  High Category:  Scalability PG Verified:  Verified

APRL GUID:  5c72f0d6-55ec-d941-be84-36c194fa78c0

Description:

Autoscaling adjusts cluster sizes automatically based on workload demands, offering benefits for many use cases in terms of costs and performance. It includes guidance on when and how to best utilize Autoscaling. For streaming, Delta Live Tables with autoscaling is advised.

Potential Benefits:

Cost and performance optimization
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Enable autoscaling for SQL warehouse

Impact:  High Category:  Scalability PG Verified:  Verified

APRL GUID:  362ad2b6-b92c-414f-980a-0cf69467ccce

Description:

The scaling parameter of a SQL warehouse defines the min and max number of clusters for distributing queries. By default, it's set to one. Increasing the cluster count can accommodate more concurrent users effectively.

Potential Benefits:

Improves concurrency and efficiency
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use Delta Live Tables enhanced autoscaling

Impact:  Medium Category:  Scalability PG Verified:  Verified

APRL GUID:  cd77db98-9b13-6e4b-bd2b-74c2cb538628

Description:

Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.

Potential Benefits:

Optimized resource use and minimal latency
Learn More:
Best practices for reliability
Databricks enhanced autoscaling

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Automatic Job Termination is enabled, ensure there are no user-defined local processes

Impact:  Medium Category:  High Availability PG Verified:  Verified

APRL GUID:  3d3e53b5-ebd1-db42-b43b-d4fad74824ec

Description:

To conserve cluster resources, you can terminate a cluster to store its configuration for future reuse or autostart jobs. Clusters can auto-terminate after inactivity, but this only tracks Spark jobs, not local processes, which might still be running even after Spark jobs end.

Potential Benefits:

Saves cluster resources, avoids idle use
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Enable Logging-Cluster log delivery

Impact:  Medium Category:  Monitoring and Alerting PG Verified:  Verified

APRL GUID:  7fb90127-5364-bb4d-86fa-30778ed713fb

Description:

When creating a Databricks cluster, you can set a log delivery location for the Spark driver, worker nodes, and events. Logs are delivered every 5 mins and archived hourly. Upon cluster termination, all generated logs until that point are guaranteed to be delivered.

Potential Benefits:

Improved troubleshooting and audit
Learn More:
Create a cluster

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use Delta Lake for higher reliability

Impact:  High Category:  High Availability PG Verified:  Verified

APRL GUID:  da4ea916-4df3-8c4d-8060-17b49da45977

Description:

Delta Lake is an open source storage format enhancing data lakes' reliability with ACID transactions, schema enforcement, and scalable metadata handling.

Potential Benefits:

Enhances data reliability and processing
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use Photon Acceleration

Impact:  Low Category:  High Availability PG Verified:  Verified

APRL GUID:  892ca809-e2b5-9a47-924a-71132bf6f902

Description:

Apache Spark in Databricks Lakehouse ensures resilient distributed data processing by automatically rescheduling failed tasks, aiding in overcoming external issues like network problems or revoked VMs.

Potential Benefits:

Boosts speed and reliability for Spark tasks
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Automatically rescue invalid or nonconforming data with Databricks Auto Loader or Delta Live Tables

Impact:  Low Category:  Business Continuity PG Verified:  Verified

APRL GUID:  7e52d64d-8cc0-8548-a593-eb49ab45630d

Description:

Invalid or nonconforming data can crash workloads dependent on specific data formats. Best practices recommend filtering such data at ingestion to improve end-to-end resilience, ensuring no data is lost or missed.

Potential Benefits:

Enhanced data resilience and integrity
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Configure jobs for automatic retries and termination

Impact:  High Category:  High Availability PG Verified:  Verified

APRL GUID:  84e44da6-8cd7-b349-b02c-c8bf72cf587c

Description:

Use Databricks and MLflow for deploying models as Spark UDFs for job scheduling, retries, autoscaling. Model serving offers scalable infrastructure, processes models using MLflow, and serves them via REST API using serverless compute managed in Databricks cloud.

Potential Benefits:

Enhanced reliability and autoscaling
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use a scalable and production-grade model serving infrastructure

Impact:  High Category:  Scalability PG Verified:  Verified

APRL GUID:  4cbb7744-ff3d-0447-badb-baf068c95696

Description:

Use Databricks and MLflow for deploying models as Apache Spark UDFs, benefiting from job scheduling, retries, autoscaling, etc.

Potential Benefits:

Enhances scalability and reliability
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use a layered storage architecture

Impact:  Medium Category:  High Availability PG Verified:  Verified

APRL GUID:  1b0d0893-bf0e-8f4c-9dc6-f18f145c1ecf

Description:

Curate data by creating a layered architecture to increase data quality across layers. Start with a raw layer for ingested source data, continue with a curated layer for cleansed and refined data, and finish with a final layer catered to business needs, focusing on security and performance.

Potential Benefits:

Enhances data quality and trust
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Improve data integrity by reducing data redundancy

Impact:  Low Category:  Business Continuity PG Verified:  Verified

APRL GUID:  e93fe702-e385-d741-ba37-1f1656482ecd

Description:

Copying data leads to redundancy, lost integrity, lineage, and access issues, affecting lakehouse data quality. Temporary copies are useful for agility and innovation but can become problematic operational data silos, questioning data's master status and currency.

Potential Benefits:

Enhanced data integrity and quality
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Actively manage schemas

Impact:  Medium Category:  Other Best Practices PG Verified:  Verified

APRL GUID:  b7e1d13f-54c9-1648-8a52-34c0abe8ce16

Description:

Uncontrolled schema changes can lead to invalid data and failing jobs. Databricks validates and enforces schema through Delta Lake, which prevents bad records during ingestion, and Auto Loader, which detects new columns and supports schema evolution to maintain data integrity.

Potential Benefits:

Prevents invalid data and job failures
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use constraints and data expectations

Impact:  Low Category:  Business Continuity PG Verified:  Verified

APRL GUID:  a42297c4-7e4f-8b41-8d4b-114033263f0e

Description:

Delta tables verify data quality automatically with SQL constraints, triggering an error for violations. Delta Live Tables enhance this by defining expectations for data quality, utilizing Python or SQL, to manage actions for record failures, ensuring data integrity and compliance.

Potential Benefits:

Ensures data quality and integrity
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Create regular backups

Impact:  Low Category:  Disaster Recovery PG Verified:  Verified

APRL GUID:  932d45d6-b46d-e341-abfb-d97bce832f1f

Description:

To recover from a failure, regular backups are needed. The Databricks Labs project migrate lets admins create backups by exporting workspace assets using the Databricks CLI/API. These backups help in restoring or migrating workspaces.

Potential Benefits:

Ensures data recovery and migration
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Recover from Structured Streaming query failures

Impact:  High Category:  High Availability PG Verified:  Verified

APRL GUID:  12e9d852-5cdc-2743-bffe-ee21f2ef7781

Description:

Structured Streaming ensures fault-tolerance and data consistency in streaming queries. With Azure Databricks workflows, you can set up your queries to automatically restart after failure, picking up precisely where they left off.

Potential Benefits:

Fault-tolerance and auto-restart for queries
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Recover ETL jobs based on Delta time travel

Impact:  Medium Category:  Disaster Recovery PG Verified:  Verified

APRL GUID:  a18d60f8-c98c-ba4e-ad6e-2fac72879df1

Description:

Despite thorough testing, a production job can fail or yield unexpected data. Sometimes, repairs are done by adding jobs post-issue identification and pipeline correction.

Potential Benefits:

Easy rollback and fix for ETL jobs
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Use Databricks Workflows and built-in recovery

Impact:  Low Category:  Disaster Recovery PG Verified:  Verified

APRL GUID:  c0e22580-3819-444d-8546-a80e4ed85c83

Description:

Databricks Workflows enable efficient error recovery in multi-task jobs by offering a matrix view for issue examination. Fixes can be applied to initiate repair runs targeting only failed and dependent tasks, preserving successful outcomes and thereby saving time and money.

Potential Benefits:

Saves time and money with smart recovery
Learn More:
Best practices for reliability

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Configure a disaster recovery pattern

Impact:  High Category:  Disaster Recovery PG Verified:  Preview

APRL GUID:  4fdb7112-4531-6f48-b60e-c917a6068d9b

Description:

Implementing a disaster recovery pattern is vital for Azure Databricks, ensuring data teams' access even during rare regional outages.

It is important to note that the Azure Databricks service is not entirely zone redudant and does support zonal failover.

Potential Benefits:

Ensures service continuity during disasters
Learn More:
Azure Databricks Best Practices

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Automate deployments and workloads

Impact:  High Category:  Other Best Practices PG Verified:  Preview

APRL GUID:  42aedaa8-6151-424d-b782-b8666c779969

Description:

The Databricks Terraform provider manages Azure Databricks workspaces and cloud infrastructure flexibly and powerfully.

Potential Benefits:

Efficient, reliable automation
Learn More:
Best practices for operational excellence

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Set up monitoring, alerting, and logging

Impact:  High Category:  Monitoring and Alerting PG Verified:  Preview

APRL GUID:  20193ff9-dbcd-a74e-b197-71d7d9d3c1e6

Description:

The Databricks Terraform provider is a flexible, powerful tool for managing Azure Databricks workspaces and cloud infrastructure.

Potential Benefits:

Enhanced reliability and automation
Learn More:
Best practices for operational excellence

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Deploy workspaces in separate Subscriptions

Impact:  High Category:  Scalability PG Verified:  Preview

APRL GUID:  397cdebb-9d6e-ab4f-83a1-8c481de0a3a7

Description:

Customers often naturally divide workspaces by teams or departments. However, it's crucial to also consider Azure Subscription and Azure Databricks (ADB) Workspace limits when partitioning.

Potential Benefits:

Enhanced limits management, team separation
Learn More:
Azure Databricks Best Practices

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Isolate each workspace in its own VNet

Impact:  High Category:  Scalability PG Verified:  Preview

APRL GUID:  5e722c4f-415a-9b4c-bd4c-96b74dce29ad

Description:

Deploying only one Databricks Workspace per VNet aligns with Azure Databricks' isolation model.

Potential Benefits:

Enhanced security and resource isolation
Learn More:
Azure Databricks Best Practices

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Do not Store any Production Data in Default DBFS Folders

Impact:  High Category:  High Availability PG Verified:  Preview

APRL GUID:  14310ba6-77ad-3641-a2db-57a2218b9bc7

Description:

Driven by security and data availability concerns, each Azure Databricks Workspace comes with a default DBFS designed for system-level artifacts like libraries and Init scripts, not for production data.

Potential Benefits:

Enhanced security, data protection
Learn More:
Azure Databricks Best Practices

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Do not use Azure Spot VMs for critical Production workloads

Impact:  High Category:  High Availability PG Verified:  Preview

APRL GUID:  b5af7e26-3939-1b48-8fba-f8d4a475c67a

Description:

Azure Spot VMs are not suitable for critical production workloads needing high availability and reliability. They are meant for fault-tolerant tasks and can be evicted with 30-seconds notice if Azure needs the capacity, with no SLA guarantees.

Potential Benefits:

Ensures high reliability for production
Learn More:
Use Azure Spot Virtual Machines

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development



Evaluate regional isolation for workspaces

Impact:  High Category:  High Availability PG Verified:  Preview

APRL GUID:  8aa63c34-dd9d-49bd-9582-21ec310dfbdd

Description:

Move workspaces to in-region control plane for increased regional isolation. Identify current control plane region using the workspace URL and nslookup. When region from CNAME differs from workspace region and an in-region control is available, consider migration using tools provided below.

Potential Benefits:

Improves resilience and data sovereignty
Learn More:
Azure Databricks control plane addresses
Migrate - maintained by Databricks Inc.
Databricks Terraform Exporter - maintained by Databricks Inc. (Experimental)

ARG Query:

Click the Azure Resource Graph tab to view the query

// cannot-be-validated-with-arg


Define alternate VM SKUs

Impact:  Medium Category:  Personalized PG Verified:  Preview

APRL GUID:  028593be-956e-4736-bccf-074cb10b92f4

Description:

Azure Databricks planning should include VM SKU swap strategies for capacity issues. VMs are regional, and allocation failures may occur, shown by a "CLOUD PROVIDER" error.

Potential Benefits:

Ensures service availability
Learn More:
Compute configuration best practices
GPU-enabled compute

ARG Query:

Click the Azure Resource Graph tab to view the query

// under-development