Azure Proactive Resiliency Library v2
Tools Glossary GitHub GitHub Issues Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Toggle Dark/Light/Auto mode Back to homepage

workspaces

Summary

RecommendationImpactCategoryAutomation AvailableIn Azure Advisor
Plan for a multi-regional deployment of Azure Machine Learning and associated resourcesHighDisaster RecoveryNoNo
Deploy Azure Machine learning workspace in secondary regionHighDisaster RecoveryNoNo
Ensure to create Machine Learning Compute resources in secondary regionHighDisaster RecoveryNoNo
Ensure checkpoints are used for AI training modelsHighDisaster RecoveryNoNo
Selecting regions for BCDR, ensure that both regions offer adequate compute quotasHighDisaster RecoveryNoNo
Choose SKUs with longer terms and avoid those nearing retirementMediumService Upgrade and RetirementNoNo
Avoid NC and NC_Promo series Azure VMs for machine learning quotas; migrate to newer versionsHighService Upgrade and RetirementNoNo
Make Azure Machine Learning quota requests through the Azure Machine Learning StudioHighOther Best PracticesNoNo

Details


Plan for a multi-regional deployment of Azure Machine Learning and associated resources

Impact:  High Category:  Disaster Recovery

APRL GUID:  a86ed26a-59d9-47bd-b440-6bc71b843978

Description:

Ensure you have a recovery strategy defined. Check regional availability and paired regions. Machine Learning doesn't have auto failover. Therefore, you must design a strategy that encompasses the workspace and all its dependencies, such as Key Vault, Azure Storage, and Container Registry.

Potential Benefits:

Provides multi-region disaster recovery strategy
Learn More:
Plan for multi-regional deployment

ARG Query:

Click the Azure Resource Graph tab to view the query



Deploy Azure Machine learning workspace in secondary region

Impact:  High Category:  Disaster Recovery

APRL GUID:  675d249a-9486-45e3-8e89-863f5802782d

Description:

If your primary workspace is unavailable, switch to the secondary workspace to continue work. Azure Machine Learning doesn't auto-submit jobs to the secondary workspace during an outage. Update your code configuration to point to the new workspace resource.

Potential Benefits:

Provides recovery from regional outages
Learn More:
Failover for business continuity and disaster recovery

ARG Query:

Click the Azure Resource Graph tab to view the query



Ensure to create Machine Learning Compute resources in secondary region

Impact:  High Category:  Disaster Recovery

APRL GUID:  13794a63-8d95-47ce-acbd-5925ede5b208

Description:

Create compute resources for training a Machine Learning model in selected regions. Ensure both regions have sufficient compute quota for dynamic scaling. Customers must configure HA across zones for attached compute resources like AKS, Azure Databricks, Container Instances.

Potential Benefits:

High availability and disaster recovery
Learn More:
Failover for business continuity and disaster recovery

ARG Query:

Click the Azure Resource Graph tab to view the query



Ensure checkpoints are used for AI training models

Impact:  High Category:  Disaster Recovery

APRL GUID:  98f15850-f31e-4fb2-8874-74f5aabbcf91

Description:

Checkpoint optimization for large model training is crucial for disaster recovery. It reduces training time, increases reliability, improves cost efficiency, enhances resource utilization, and supports scalability by saving model states periodically to resume training from the last saved point.

Potential Benefits:

Reduces costs, training time and increases reliability.
Learn More:
Importance of checkpoint optimization

ARG Query:

Click the Azure Resource Graph tab to view the query



Selecting regions for BCDR, ensure that both regions offer adequate compute quotas

Impact:  High Category:  Disaster Recovery

APRL GUID:  6e4f0fd1-1853-4b94-9736-6d6d239d2694

Description:

When selecting regions for BCDR, ensure that both regions offer adequate compute quotas to meet your requirements in the same SKU. This ensures that you can failover to the secondary region without any issues.

Potential Benefits:

Provide enough compute resources to the secondary region
Learn More:
Manage resource quotas

ARG Query:

Click the Azure Resource Graph tab to view the query



Choose SKUs with longer terms and avoid those nearing retirement

Impact:  Medium Category:  Service Upgrade and Retirement

APRL GUID:  6e2af91f-477d-46a5-b8ce-6cd1b8176550

Description:

When choosing SKUs, opt for those that support longer terms and steer clear of any SKUs that are nearing retirement. This ensures that you can continue to use the SKU for a longer period of time.

Potential Benefits:

supportability, longer term support
Learn More:
What are compute targets in Azure Machine Learning

ARG Query:

Click the Azure Resource Graph tab to view the query



Avoid NC and NC_Promo series Azure VMs for machine learning quotas; migrate to newer versions

Impact:  High Category:  Service Upgrade and Retirement

APRL GUID:  cf2569bb-1cf2-46ce-8885-d742dc6f4a4c

Description:

Avoid selecting NC and NC_Promo series Azure virtual machines for machine learning VM quotas and make sure to migrate to newer versions.

Potential Benefits:

Avoid service disruption, longer term support
Learn More:
Migration Guide for GPU Compute Workloads in Azure

ARG Query:

Click the Azure Resource Graph tab to view the query



Make Azure Machine Learning quota requests through the Azure Machine Learning Studio

Impact:  High Category:  Other Best Practices

APRL GUID:  48ea6480-6263-40ba-8937-326d790e63f6

Description:

Requests for additional Azure Machine Learning quota should be made through the Azure Machine Learning Studio instead of the subscription level in the Azure portal.

Potential Benefits:

Scalability,capacity planning
Learn More:
Manage and increase quotas and limits for resources with Azure Machine Learning

ARG Query:

Click the Azure Resource Graph tab to view the query