Azure Databricks
The presented resiliency recommendations in this guidance include Azure Databricks and dependent resources and settings.
Summary of Recommendation
Recommendations Details
DBW-1 - Databricks runtime version is not latest or is not LTS version
Category: Governance
Impact: Medium
Guidance
Use 12.2 LTS later. Databricks recommends that you migrate your workloads in the following order:
- If your workloads are currently running on Databricks Runtime 11.3 LTS or above, you can migrate directly to the latest version of Databricks Runtime 12.x, as described later in this article.
- If your workloads are currently running on Databricks Runtime 11.3 LTS or below, do the following:
- Migrate to Databricks Runtime 11.3 LTS first. See the Databricks Runtime 11.x migration guide.
- Follow the guidance in this article to migrate from Databricks Runtime 11.3 LTS to the latest version of Databricks Runtime 12.x.
Resources
Resource Graph Query
// under-development
DBW-2 - Use Databricks Pools
Category: System Efficiency
Impact: High
Guidance
Databricks pools are a standard feature of the service, pre-provisions VM’s instead of spinning them up on demand will help to vastly reduce risks of “provisioning” errors when starting or scaling clusters.
Resources
Resource Graph Query
// under-development
DBW-3 - Use SSD backed VMs for Worker VM Type and Driver type
Category: System Efficiency
Impact: Medium
Guidance
We have identified that you are using standard hard disks with your premium-capable Virtual Machines and we recommend you consider upgrading the standard-hdd disks to standard-ssd or premium disks. For any Single Instance Virtual Machine using premium storage for all Operating System Disks and Data Disks, we guarantee you will have Virtual Machine Connectivity of at least 99.9%. Consider these factors when making your upgrade decision. The first is that upgrading requires a VM reboot and this process takes 3-5 minutes to complete. The second is if the VMs in the list are mission-critical production VMs, evaluate the improved availability against the cost of premium disks.
- Premium SSD disks offer high-performance, low-latency disk support for I/O-intensive applications and production workloads.
- Standard SSD Disks are a cost effective storage option optimized for workloads that need consistent performance at lower IOPS levels.
- Use Standard HDD disks for Dev/Test scenarios and less critical workloads at lowest cost.
Standard SSDs are acceptable for some Production workloads as well.
Resources
Resource Graph Query
// under-development
DBW-4 - Enable autoscaling for batch workloads
Category: System Efficiency
Impact: High
Guidance
Autoscaling allows clusters to resize automatically based on workloads. Autoscaling can benefit many use cases and scenarios from both a cost and performance perspective. The documentation provides considerations for determining whether to use Autoscaling and how to get the most benefit.
For streaming workloads, Databricks recommends using Delta Live Tables with autoscaling.
Resources
Resource Graph Query
// under-development
DBW-5 - Enable autoscaling for SQL warehouse
Category: System Efficiency
Impact: High
Guidance
The scaling parameter of a SQL warehouse sets the minimum and the maximum number of clusters over which queries sent to the warehouse are distributed. The default is a minimum of one and a maximum of one cluster.
To handle more concurrent users for a given warehouse, increase the cluster count.
Resources
Resource Graph Query
// under-development
DBW-6 - Use Delta Live Tables enhanced autoscaling
Category: System Efficiency
Impact: Medium
Guidance
Databricks enhanced autoscaling optimizes cluster utilization by automatically allocating cluster resources based on workload volume, with minimal impact on the data processing latency of your pipelines.
Resources
Resource Graph Query
// under-development
DBW-7 - Automatic Job Termination is enabled, ensure there are no user-defined local processes
Category: Availability
Impact: Medium
Guidance
To save cluster resources, you can terminate a cluster. The terminated cluster’s configuration is stored so that it can be reused (or, in the case of jobs, autostarted) at a later time. You can manually terminate a cluster or configure the cluster to terminate automatically after a specified period of inactivity. When the number of terminated clusters exceeds 150, the oldest clusters are deleted. You can also set auto termination for a cluster. During cluster creation, you can specify an inactivity period in minutes after which you want the cluster to terminate. However, The auto termination feature monitors only Spark jobs, not user-defined local processes. Therefore, if all Spark jobs have completed, a cluster may be terminated, even if local processes are running.
Resources
Resource Graph Query
// under-development
DBW-8 - Enable Logging-Cluster log delivery
Category: Monitoring
Impact: Medium
Guidance
When you create a cluster, you can specify a location to deliver the logs for the Spark driver node, worker nodes, and events. Logs are delivered every five minutes and archived hourly in your chosen destination. When a cluster is terminated, Azure Databricks guarantees to deliver all logs generated up until the cluster was terminated.
The destination of the logs depends on the cluster ID. If the specified destination is dbfs:/cluster-log-delivery, cluster logs for 0630-191345-leap375 are delivered to dbfs:/cluster-log-delivery/0630-191345-leap375.
Resources
Resource Graph Query
// under-development
DBW-9 - Use Delta Lake for higher reliability
Category: Availability
Impact: High
Guidance
Delta Lake is an open source storage format that brings reliability to data lakes. Delta Lake provides ACID transactions, schema enforcement, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake on Databricks allows you to configure Delta Lake based on your workload patterns.
Resources
Resource Graph Query
// under-development
DBW-10 - Use Photon Acceleration
Category: Availability
Impact: Low
Guidance
Apache Spark, as the compute engine of the Databricks Lakehouse, is based on resilient distributed data processing. In case of an internal Spark task not returning a result as expected, Apache Spark automatically reschedules the missing tasks and continues with the execution of the entire job. This is helpful for failures outside the code, like a short network issue or a revoked spot VM. Working with both the SQL API and the Spark DataFrame API comes with this resilience built into the engine.
In the Databricks Lakehouse, Photon, a native vectorized engine entirely written in C++, is high performance compute compatible with Apache Spark APIs.
Resources
Resource Graph Query
// under-development
DBW-11 - Automatically rescue invalid or nonconforming data with Databricks Auto Loader or Delta Live Tables
Category: Application Resilience
Impact: Low
Guidance
Invalid or nonconforming data can lead to crashes of workloads that rely on an established data format. To increase the end-to-end resilience of the whole process, it is best practice to filter out invalid and nonconforming data at ingestion. Supporting rescued data ensures you never lose or miss out on data during ingest or ETL. The rescued data column contains any data that wasn’t parsed, either because it was missing from the given schema, because there was a type mismatch, or the column casing in the record or file didn’t match that in the schema.
- Databricks Auto Loader: Auto Loader is the ideal tool for streaming the ingestion of files. It supports rescued data for JSON and CSV.
- Delta Live Tables: Another option to build workflows for resilience is using Delta Live Tables with quality constraints.
Resources
Resource Graph Query
// under-development
DBW-12 - Configure jobs for automatic retries and termination
Category: Availability
Impact: High
Guidance
For batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. Model serving provides a scalable and production-grade model real-time serving infrastructure. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account.
Resources
Resource Graph Query
// under-development
DBW-13 - Use a scalable and production-grade model serving infrastructure
Category: System Efficiency
Impact: High
Guidance
For batch and streaming inference, use Databricks jobs and MLflow to deploy models as Apache Spark UDFs to leverage job scheduling, retries, autoscaling, and so on. Model serving provides a scalable and production-grade model real-time serving infrastructure. It processes your machine learning models using MLflow and exposes them as REST API endpoints. This functionality uses serverless compute, which means that the endpoints and associated compute resources are managed and run in the Databricks cloud account.
Resources
Resource Graph Query
// under-development
DBW-14 - Use a layered storage architecture
Category: Application Resilience
Impact: Medium
Guidance
Curate data by creating a layered architecture and ensuring data quality increases as data moves through the layers. A common layering approach is:
Raw layer (bronze): Source data gets ingested into the lakehouse into the first layer and should be persisted there. When all downstream data is created from the raw layer, rebuilding the subsequent layers from this layer is possible if needed.
Curated layer (silver): The purpose of the second layer is to hold cleansed, refined, filtered and aggregated data. The goal of this layer is to provide a sound, reliable foundation for analyses and reports across all roles and functions.
Final layer (gold): The third layer is created around business or project needs. It provides a different view as data products to other business units or projects, preparing data around security needs (such as anonymized data) or optimizing for performance (such as with preaggregated views). The data products in this layer are seen as the truth for the business.
The final layer should only contain high-quality data and can be fully trusted from a business point of view.
Resources
Resource Graph Query
// under-development
DBW-15 - Improve data integrity by reducing data redundancy
Category: Application Resilience
Impact: Low
Guidance
Copying or duplicating data creates data redundancy and will lead to lost integrity, lost data lineage, and often different access permissions. This will decrease the quality of the data in the lakehouse. A temporary or throwaway copy of data is not harmful on its own - it is sometimes necessary for boosting agility, experimentation and innovation. However, if these copies become operational and regularly used for business decisions, they become data silos. These data silos getting out of sync has a significant negative impact on data integrity and quality, raising questions such as “Which data set is the master?” or “Is the data set up to date?”.
Resources
Resource Graph Query
// under-development
DBW-16 - Actively manage schemas
Category: Governance
Impact: Medium
Guidance
Uncontrolled schema changes can lead to invalid data and failing jobs that use these data sets. Databricks has several methods to validate and enforce the schema:
- Delta Lake supports schema validation and schema enforcement by automatically handling schema variations to prevent the insertion of bad records during ingestion.
- Auto Loader detects the addition of new columns as it processes your data. By default, the addition of a new column causes your streams to stop with an UnknownFieldException. Auto Loader supports several modes for schema evolution.
Resources
Resource Graph Query
// under-development
DBW-17 - Use constraints and data expectations
Category: Application Resilience
Impact: Low
Guidance
Delta tables support standard SQL constraint management clauses that ensure that the quality and integrity of data added to a table are automatically verified. When a constraint is violated, Delta Lake throws an InvariantViolationException error to signal that the new data can’t be added. See Constraints on Azure Databricks.
To further improve this handling, Delta Live Tables supports Expectations: Expectations define data quality constraints on the contents of a data set. An expectation consists of a description, an invariant, and an action to take when a record fails the invariant. Expectations to queries use Python decorators or SQL constraint clauses. See Manage data quality with Delta Live Tables.
Resources
Resource Graph Query
// under-development
DBW-18 - Create regular backups
Category: Disaster Recovery
Impact: Low
Guidance
To recover from a failure, regular backups need to be available. The Databricks Labs project migrate allows workspace admins to create backups by exporting most of the assets of their workspaces (the tool uses the Databricks CLI/API in the background). See Databricks Migration Tool. Backups can be used either for restoring workspaces or for importing into a new workspace in case of a migration.
Resources
Resource Graph Query
// under-development
DBW-19 - Recover from Structured Streaming query failures
Category: Availability
Impact: High
Guidance
Structured Streaming provides fault-tolerance and data consistency for streaming queries. Using Azure Databricks workflows, you can easily configure your Structured Streaming queries to restart on failure automatically. The restarted query continues where the failed one left off.
Resources
Resource Graph Query
// under-development
DBW-20 - Recover ETL jobs based on Delta time travel
Category: Disaster Recovery
Impact: Medium
Guidance
Despite thorough testing, a job in production can fail or produce some unexpected, even invalid, data. Sometimes this can be fixed with an additional job after understanding the source of the issue and fixing the pipeline that led to the issue in the first place. However, often this is not straightforward, and the respective job should be rolled back. Using Delta Time travel allows users to easily roll back changes to an older version or timestamp, repair the pipeline, and restart the fixed pipeline.
Resources
Resource Graph Query
// under-development
DBW-21 - Use Databricks Workflows and built-in recovery
Category: Disaster Recovery
Impact: Low
Guidance
Databricks Workflows are built for recovery. When a task in a multi-task job fails (and, as such, all dependent tasks), Azure Databricks Workflows provide a matrix view of the runs, which lets you examine the issue that led to the failure. See View runs for a job. Whether it was a short network issue or a real issue in the data, you can fix it and start a repair run in Azure Databricks Workflows. It runs only the failed and dependent tasks and keep the successful results from the earlier run, saving time and money.
Resources
Resource Graph Query
// under-development
DBW-22 - Configure a disaster recovery pattern
Category: Disaster Recovery
Impact: High
Guidance
A clear disaster recovery pattern is critical for a cloud-native data analytics platform like Azure Databricks. For some companies, it’s critical that your data teams can use the Databricks platform even in the rare case of a regional service-wide cloud-service provider outage, whether caused by a regional disaster like a hurricane or earthquake or another source.
Resources
Resource Graph Query
// under-development
DBW-23 - Automate deployments and workloads
Category: Automation
Impact: High
Guidance
The Databricks Terraform provider manages Azure Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Azure Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. The Databricks Terraform provider is the recommended tool to deploy and manage clusters and jobs reliably, provision Azure Databricks workspaces, and configure data access.
Resources
Resource Graph Query
// under-development
DBW-24 - Set up monitoring, alerting, and logging
Category: Monitoring
Impact: High
Guidance
The Databricks Terraform provider manages Azure Databricks workspaces and the associated cloud infrastructure using a flexible, powerful tool. The goal of the Databricks Terraform provider is to support all Azure Databricks REST APIs, supporting automation of the most complicated aspects of deploying and managing your data platforms. The Databricks Terraform provider is the recommended tool to deploy and manage clusters and jobs reliably, provision Azure Databricks workspaces, and configure data access.
Resources
Resource Graph Query
// under-development
DBW-25 - Deploy workspaces in separate Subscriptions
Category: System Efficiency
Impact: High
Guidance
Customers commonly partition workspaces based on teams or departments and arrive at that division naturally. But it is also important to partition keeping Azure Subscription and ADB Workspace limits in mind.
Resources
Resource Graph Query
// under-development
DBW-26 - Isolate each workspace in its own Vnet
Category: System Efficiency
Impact: High
Guidance
While you can deploy more than one Workspace in a VNet by keeping the associated subnet pairs separate from other workspaces, we recommend that you should only deploy one workspace in any Vnet. Doing this perfectly aligns with the ADB’s Workspace level isolation model. Most often organizations consider putting multiple workspaces in the same Vnet so that they all can share some common networking resource, like DNS, also placed in the same Vnet because the private address space in a vnet is shared by all resources. You can easily achieve the same while keeping the Workspaces separate by following the hub and spoke model and using Vnet Peering to extend the private IP space of the workspace Vnet.
Resources
Resource Graph Query
// under-development
DBW-27 - Do not Store any Production Data in Default DBFS Folders
Category: Availability
Impact: High
Guidance
This recommendation is driven by security and data availability concerns. Every Workspace comes with a default DBFS, primarily designed to store libraries and other system-level configuration artifacts such as Init scripts. You should not store any production data in it, because:
- The lifecycle of default DBFS is tied to the Workspace. Deleting the workspace will also delete the default DBFS and permanently remove its contents.
- One can’t restrict access to this default folder and its contents.
Resources
Resource Graph Query
// under-development
DBW-28 - Do not use Azure Sport VMs for critical Production workloads
Category: Availability
Impact: High
Guidance
Azure Spot VMs are not recommended for critical production workloads that require high availability and reliability. Azure Spot VMs are designed for workloads that are fault-tolerant and can tolerate interruptions. The amount of available capacity can vary based on size, region, time of day, and more. When deploying Azure Spot Virtual Machines, Azure will allocate the VMs if there’s capacity available, but there’s no SLA for these VMs. An Azure Spot Virtual Machine offers no high availability guarantees. At any point in time when Azure needs the capacity back, the Azure infrastructure will evict Azure Spot Virtual Machines with 30-seconds notice.
Resources
Resource Graph Query
// under-development
DBW-29 - Migrate Legacy Workspaces
Category: Availability
Impact: High
Guidance
Azure Databricks initially launched with shared control plane, where some regions shared control plane resources with another region. This shared control plane model then evolved to dedicated in-region control planes (e.g. North Europe, Central US, East US) to ensure a regional outage does not impact customer workspaces in other regions.
Regions that now have their dedicated control plane have workspaces running in two configurations:
- Legacy Workspaces - these are workspaces created before the dedicated control plane was available.
- Workspaces - these are workspaces created after the dedicated control plane was available.
The path for migrating legacy workspaces to use the in-region control plane is to redeploy.
Review the list of network addresses used in each region in the Microsoft documentation and determine which regions are sharing a control plane. For example, we can look up Canada East in the table and see that the address for its SCC relay is “tunnel.canadacentral.azuredatabricks.net”. Since the relay address is in Canada Central, we know that “Canada East” is using the control plane in another region.
Some regions list two different addresses in the Azure Databricks Control plane networking table. For example, North Europe lists both “tunnel.westeurope.azuredatabricks.net” and “tunnel.northeuropec2.azuredatabricks.net” for the SCC relay address. This is because North Europe once shared the West Europe control plane, but it now has its own independent control plane. There are still some old, legacy workspaces in North Europe tied to the old control plane, but all workspaces created since the switch-over will be using the new control plane.
Once a new Azure Databricks workspace is created, it should be configured to match the original legacy workspace. Databricks, Inc. recommends that customers use the Databricks Terraform Exporter for both the initial copy and for maintaining the workspace. However, this exporter is still in the experimental phase. For customers that do not trust experimental projects or for customers that do not want to use Terraform, they can use the “Migrate” tool that Databricks, Inc. maintains with GitHub. This is a collection of scripts that will export all of the objects (notebooks, cluster definitions, metadata, etc.) from one workspace and then import them to another workspace. Customers can use the “Migrate” tool to initially populate the new workspace and then use their CI/CD deployment process to keep the workspace in sync.
Pro Tip: If you need to determine where the control plane is located for a particular Databricks workspace, you can use the “nslookup” console command on Windows or Linux with the workspace address. The result will tell you where the control plane is located.
Resources
- Azure Databricks regions - IP addresses and domains
- Migrate - maintained by Databricks Inc.
- Databricks Terraform Exporter - maintained by Databricks Inc. (Experimental)
DBW-30 - Define alternate VM SKUs
Category: System Efficiency
Impact: Medium
Guidance
Azure Databricks availability planning should include plans for swapping VM SKUs based on capacity constraints.
Azure Databricks creates its VMs as regional VMs and depends on Azure to choose the best availability zone for the VM. In the past, there have been rare instances where compute can not be allocated due to zonal or regional VM constraints. Thus, resulting in a “CLOUD PROVIDER” error.
In these situations, customers have two options:
- Use Databricks Pools. To manage costs, customers should be careful when selecting the size of their pools. They will have to pay for the Azure VMs even when they are idle in the pool. Databricks pool can contain only one SKU of VMs; you cannot mix multiple SKUs in the same pool. To reduce the number of pools that customers need to manage, they should settle on a few SKUs that will service their jobs instead of using a different VM SKU for each job.
- Plan for alternative SKUs in their preferred region(s).
Resources
Resource Graph Query
// under-development