Databricks runtime version is not latest or is not LTS version
Impact:MediumCategory:Governance
APRL GUID:0e835cc2-2551-a247-b1f1-3c5f25c9cb70
Description:
Databricks recommends migrating workloads to the latest or LTS version of its runtime for enhanced stability and support. If on Runtime 11.3 LTS or above, move directly to the latest 12.x version. If below, first migrate to 11.3 LTS, then to the latest 12.x version as per the migration guide.
Click the Azure Resource Graph tab to view the query
//under-development
Use SSD backed VMs for Worker VM Type and Driver type
Impact:MediumCategory:Scalability
APRL GUID:5877a510-8444-7a4c-8412-a8dab8662f7e
Description:
Upgrade HDDs in premium VMs to SSDs for better speed and reliability. Premium SSDs boost IO-heavy apps; Standard SSDs balance cost and performance. Ideal for critical workloads, upgrading improves connectivity with brief reboot. Consider for vital VMs
Click the Azure Resource Graph tab to view the query
//under-development
Enable autoscaling for batch workloads
Impact:HighCategory:Scalability
APRL GUID:5c72f0d6-55ec-d941-be84-36c194fa78c0
Description:
Autoscaling adjusts cluster sizes automatically based on workload demands, offering benefits for many use cases in terms of costs and performance. It includes guidance on when and how to best utilize Autoscaling. For streaming, Delta Live Tables with autoscaling is advised.
Click the Azure Resource Graph tab to view the query
//under-development
Enable autoscaling for SQL warehouse
Impact:HighCategory:Scalability
APRL GUID:362ad2b6-b92c-414f-980a-0cf69467ccce
Description:
The scaling parameter of a SQL warehouse defines the min and max number of clusters for distributing queries. By default, it's set to one. Increasing the cluster count can accommodate more concurrent users effectively.
Click the Azure Resource Graph tab to view the query
//under-development
Use Delta Live Tables enhanced autoscaling
Impact:MediumCategory:Scalability
APRL GUID:cd77db98-9b13-6e4b-bd2b-74c2cb538628
Description:
Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.
Click the Azure Resource Graph tab to view the query
//under-development
Automatic Job Termination is enabled, ensure there are no user-defined local processes
Impact:MediumCategory:High Availability
APRL GUID:3d3e53b5-ebd1-db42-b43b-d4fad74824ec
Description:
To conserve cluster resources, you can terminate a cluster to store its configuration for future reuse or autostart jobs. Clusters can auto-terminate after inactivity, but this only tracks Spark jobs, not local processes, which might still be running even after Spark jobs end.
Click the Azure Resource Graph tab to view the query
//under-development
Enable Logging-Cluster log delivery
Impact:MediumCategory:Monitoring and Alerting
APRL GUID:7fb90127-5364-bb4d-86fa-30778ed713fb
Description:
When creating a Databricks cluster, you can set a log delivery location for the Spark driver, worker nodes, and events. Logs are delivered every 5 mins and archived hourly. Upon cluster termination, all generated logs until that point are guaranteed to be delivered.
Click the Azure Resource Graph tab to view the query
//under-development
Use Delta Lake for higher reliability
Impact:HighCategory:High Availability
APRL GUID:da4ea916-4df3-8c4d-8060-17b49da45977
Description:
Delta Lake is an open source storage format enhancing data lakes' reliability with ACID transactions, schema enforcement, and scalable metadata handling.
Click the Azure Resource Graph tab to view the query
//under-development
Use Photon Acceleration
Impact:LowCategory:High Availability
APRL GUID:892ca809-e2b5-9a47-924a-71132bf6f902
Description:
Apache Spark in Databricks Lakehouse ensures resilient distributed data processing by automatically rescheduling failed tasks, aiding in overcoming external issues like network problems or revoked VMs.
Click the Azure Resource Graph tab to view the query
//under-development
Automatically rescue invalid or nonconforming data with Databricks Auto Loader or Delta Live Tables
Impact:LowCategory:Business Continuity
APRL GUID:7e52d64d-8cc0-8548-a593-eb49ab45630d
Description:
Invalid or nonconforming data can crash workloads dependent on specific data formats. Best practices recommend filtering such data at ingestion to improve end-to-end resilience, ensuring no data is lost or missed.
Click the Azure Resource Graph tab to view the query
//under-development
Configure jobs for automatic retries and termination
Impact:HighCategory:High Availability
APRL GUID:84e44da6-8cd7-b349-b02c-c8bf72cf587c
Description:
Use Databricks and MLflow for deploying models as Spark UDFs for job scheduling, retries, autoscaling. Model serving offers scalable infrastructure, processes models using MLflow, and serves them via REST API using serverless compute managed in Databricks cloud.
Click the Azure Resource Graph tab to view the query
//under-development
Use a layered storage architecture
Impact:MediumCategory:High Availability
APRL GUID:1b0d0893-bf0e-8f4c-9dc6-f18f145c1ecf
Description:
Curate data by creating a layered architecture to increase data quality across layers. Start with a raw layer for ingested source data, continue with a curated layer for cleansed and refined data, and finish with a final layer catered to business needs, focusing on security and performance.
Click the Azure Resource Graph tab to view the query
//under-development
Improve data integrity by reducing data redundancy
Impact:LowCategory:Business Continuity
APRL GUID:e93fe702-e385-d741-ba37-1f1656482ecd
Description:
Copying data leads to redundancy, lost integrity, lineage, and access issues, affecting lakehouse data quality. Temporary copies are useful for agility and innovation but can become problematic operational data silos, questioning data's master status and currency.
Click the Azure Resource Graph tab to view the query
//under-development
Actively manage schemas
Impact:MediumCategory:Other Best Practices
APRL GUID:b7e1d13f-54c9-1648-8a52-34c0abe8ce16
Description:
Uncontrolled schema changes can lead to invalid data and failing jobs. Databricks validates and enforces schema through Delta Lake, which prevents bad records during ingestion, and Auto Loader, which detects new columns and supports schema evolution to maintain data integrity.
Click the Azure Resource Graph tab to view the query
//under-development
Use constraints and data expectations
Impact:LowCategory:Business Continuity
APRL GUID:a42297c4-7e4f-8b41-8d4b-114033263f0e
Description:
Delta tables verify data quality automatically with SQL constraints, triggering an error for violations. Delta Live Tables enhance this by defining expectations for data quality, utilizing Python or SQL, to manage actions for record failures, ensuring data integrity and compliance.
Click the Azure Resource Graph tab to view the query
//under-development
Create regular backups
Impact:LowCategory:Disaster Recovery
APRL GUID:932d45d6-b46d-e341-abfb-d97bce832f1f
Description:
To recover from a failure, regular backups are needed. The Databricks Labs project migrate lets admins create backups by exporting workspace assets using the Databricks CLI/API. These backups help in restoring or migrating workspaces.
Click the Azure Resource Graph tab to view the query
//under-development
Recover from Structured Streaming query failures
Impact:HighCategory:High Availability
APRL GUID:12e9d852-5cdc-2743-bffe-ee21f2ef7781
Description:
Structured Streaming ensures fault-tolerance and data consistency in streaming queries. With Azure Databricks workflows, you can set up your queries to automatically restart after failure, picking up precisely where they left off.
Click the Azure Resource Graph tab to view the query
//under-development
Recover ETL jobs based on Delta time travel
Impact:MediumCategory:Disaster Recovery
APRL GUID:a18d60f8-c98c-ba4e-ad6e-2fac72879df1
Description:
Despite thorough testing, a production job can fail or yield unexpected data. Sometimes, repairs are done by adding jobs post-issue identification and pipeline correction.
Click the Azure Resource Graph tab to view the query
//under-development
Use Databricks Workflows and built-in recovery
Impact:LowCategory:Disaster Recovery
APRL GUID:c0e22580-3819-444d-8546-a80e4ed85c83
Description:
Databricks Workflows enable efficient error recovery in multi-task jobs by offering a matrix view for issue examination. Fixes can be applied to initiate repair runs targeting only failed and dependent tasks, preserving successful outcomes and thereby saving time and money.
Click the Azure Resource Graph tab to view the query
//under-development
Configure a disaster recovery pattern
Impact:HighCategory:Disaster Recovery
APRL GUID:4fdb7112-4531-6f48-b60e-c917a6068d9b
Description:
Implementing a disaster recovery pattern is vital for Azure Databricks, ensuring data teams' access even during rare regional outages.
It is important to note that the Azure Databricks service is not entirely zone redudant and does support zonal failover.
Click the Azure Resource Graph tab to view the query
//under-development
Deploy workspaces in separate Subscriptions
Impact:HighCategory:Scalability
APRL GUID:397cdebb-9d6e-ab4f-83a1-8c481de0a3a7
Description:
Customers often naturally divide workspaces by teams or departments. However, it's crucial to also consider Azure Subscription and Azure Databricks (ADB) Workspace limits when partitioning.
Click the Azure Resource Graph tab to view the query
//under-development
Do not Store any Production Data in Default DBFS Folders
Impact:HighCategory:High Availability
APRL GUID:14310ba6-77ad-3641-a2db-57a2218b9bc7
Description:
Driven by security and data availability concerns, each Azure Databricks Workspace comes with a default DBFS designed for system-level artifacts like libraries and Init scripts, not for production data.
Click the Azure Resource Graph tab to view the query
//under-development
Do not use Azure Spot VMs for critical Production workloads
Impact:HighCategory:High Availability
APRL GUID:b5af7e26-3939-1b48-8fba-f8d4a475c67a
Description:
Azure Spot VMs are not suitable for critical production workloads needing high availability and reliability. They are meant for fault-tolerant tasks and can be evicted with 30-seconds notice if Azure needs the capacity, with no SLA guarantees.
Click the Azure Resource Graph tab to view the query
//under-development
Evaluate regional isolation for workspaces
Impact:HighCategory:High Availability
APRL GUID:8aa63c34-dd9d-49bd-9582-21ec310dfbdd
Description:
Move workspaces to in-region control plane for increased regional isolation. Identify current control plane region using the workspace URL and nslookup. When region from CNAME differs from workspace region and an in-region control is available, consider migration using tools provided below.
Click the Azure Resource Graph tab to view the query
//cannot-be-validated-with-arg
Define alternate VM SKUs
Impact:MediumCategory:Personalized
APRL GUID:028593be-956e-4736-bccf-074cb10b92f4
Description:
Azure Databricks planning should include VM SKU swap strategies for capacity issues. VMs are regional, and allocation failures may occur, shown by a "CLOUD PROVIDER" error.