Plan for a multi-regional deployment of Azure Machine Learning and associated resources
Impact:HighCategory:Disaster Recovery
APRL GUID:a86ed26a-59d9-47bd-b440-6bc71b843978
Description:
Ensure you have a recovery strategy defined. Check regional availability and paired regions. Machine Learning doesn't have auto failover. Therefore, you must design a strategy that encompasses the workspace and all its dependencies, such as Key Vault, Azure Storage, and Container Registry.
Click the Azure Resource Graph tab to view the query
Deploy Azure Machine learning workspace in secondary region
Impact:HighCategory:Disaster Recovery
APRL GUID:675d249a-9486-45e3-8e89-863f5802782d
Description:
If your primary workspace is unavailable, switch to the secondary workspace to continue work. Azure Machine Learning doesn't auto-submit jobs to the secondary workspace during an outage. Update your code configuration to point to the new workspace resource.
Click the Azure Resource Graph tab to view the query
Ensure to create Machine Learning Compute resources in secondary region
Impact:HighCategory:Disaster Recovery
APRL GUID:13794a63-8d95-47ce-acbd-5925ede5b208
Description:
Create compute resources for training a Machine Learning model in selected regions. Ensure both regions have sufficient compute quota for dynamic scaling. Customers must configure HA across zones for attached compute resources like AKS, Azure Databricks, Container Instances.
Click the Azure Resource Graph tab to view the query
Ensure checkpoints are used for AI training models
Impact:HighCategory:Disaster Recovery
APRL GUID:98f15850-f31e-4fb2-8874-74f5aabbcf91
Description:
Checkpoint optimization for large model training is crucial for disaster recovery. It reduces training time, increases reliability, improves cost efficiency, enhances resource utilization, and supports scalability by saving model states periodically to resume training from the last saved point.
Potential Benefits:
Reduces costs, training time and increases reliability.
Click the Azure Resource Graph tab to view the query
Selecting regions for BCDR, ensure that both regions offer adequate compute quotas
Impact:HighCategory:Disaster Recovery
APRL GUID:6e4f0fd1-1853-4b94-9736-6d6d239d2694
Description:
When selecting regions for BCDR, ensure that both regions offer adequate compute quotas to meet your requirements in the same SKU. This ensures that you can failover to the secondary region without any issues.
Potential Benefits:
Provide enough compute resources to the secondary region
Click the Azure Resource Graph tab to view the query
Choose SKUs with longer terms and avoid those nearing retirement
Impact:MediumCategory:Service Upgrade and Retirement
APRL GUID:6e2af91f-477d-46a5-b8ce-6cd1b8176550
Description:
When choosing SKUs, opt for those that support longer terms and steer clear of any SKUs that are nearing retirement. This ensures that you can continue to use the SKU for a longer period of time.
Click the Azure Resource Graph tab to view the query
Make Azure Machine Learning quota requests through the Azure Machine Learning Studio
Impact:HighCategory:Other Best Practices
APRL GUID:48ea6480-6263-40ba-8937-326d790e63f6
Description:
Requests for additional Azure Machine Learning quota should be made through the Azure Machine Learning Studio instead of the subscription level in the Azure portal.