Azure High Performance Computing


The presented resiliency recommendations in this guidance include Azure High Performance Computing and associated resources and settings.

Summary of Recommendations

Recommendations Details

HPC-1 - Ensure File shares that stores jobs metadata are accessible from all head nodes

Category: Application Resilience

Impact: High

Recommendation/Guidance

Currently in all HPC Pack ARM templates we create the cluster share on one of the head node which is not highly available. If that head node is down, the share will not be accessible to the HPC Service running on other head node.

With Azure Files, the following file shares can be moved to Azure Files shares with SMB permissions to make them highly available:

  • \\<HN3>\REMINST
  • \\<HN3>\HpcServiceRegistration
  • \\<HN3>\Runtime$
  • \\<HN3>\TraceRepository
  • \\<HN3>\Diagnostics
  • \\<HN3>\CcpSpoolDir

With above setup all nodes can access the file shares independent of the the head nodes

Resources

Resource Graph Query/Scripts

// under-development



HPC-2 - Automatically grow and shrink HPC Pack cluster resources

Category: System Efficiency

Impact: Medium

Recommendation/Guidance

By deploying Azure “burst” nodes (both Windows and Linux) in your HPC Pack cluster or creating your HPC Pack cluster in Azure, you can automatically grow or shrink the cluster’s resources such as nodes or cores according to the workload on the cluster. Scaling the cluster resources in this way allows you to execute jobs without any interruptions. In addition it helps using the resources efficiently.

Resources

Resource Graph Query/Scripts

// under-development



HPC-3 - Use multiple head nodes for HPC Pack

Category: Application Resilience

Impact: Medium

Recommendation/Guidance

Establish a cluster with a minimum of two head nodes. In the event of a head node failure, the active HPC Service will be automatically transferred from the affected head node to another functioning one.

Resources

// under-development



HPC-4 - Use HPC Pack Azure AD Integration or other highly available AD configuration

Category: Application Resilience

Impact: High

Recommendation/Guidance

When HPC failed to connect to the Domain controller, admin and user will not be able to connect to the HPC Service thus not able to manage and submit jobs to the cluster. And new jobs will not be able started on the domain joined computer nodes as the NodeManager service failed to validate the job’s credential. Thus you need consider below options:

  • Having a high available domain controller deployed with your HPC Pack Cluster in Azure

  • Using Azure AD Domain service. During cluster deployment, you could just join all your cluster nodes into this domain and you get the high available domain service from Azure.

  • Using HPC Pack Azure AD integration solution without having the cluster nodes joining any domain. Thus as long as the HPC Service has connectivity to the Azure AD service.

Resources



Resource Graph Query/Scripts

// under-development