Skip to the content.

Home | Deploy | User Guide | Tutorials | Administrator Guide | Application Integration | Support

Node Health Check

Compute node health checks are implemented using LBNL NHC framework. NHC runs a series of quick tests to verify that the node is working properly. When the node is found to be “unhealthy”, the error message is logged in the CycleCloud GUI and the node is shut down. Actual set of tests depends on the VM type.

Invocation

If the cluster is configured with PBS queue manager, NHC will be automatically run on node creation.

If the cluster is configured with SLURM, NHC will run:

NHC is implemented in bash and can be extended with custom scripts. Different checks can be run depending on the type of the Azure VM.

Common Checks

The following checks are performed on all VMs regardless of the type:

VM specific checks

All HBv2, HBv3, HC:

All NV (except NVv4), NC, ND:

ND96asr_v4:

Adding custom tests to NHC

Configuration files related to HNC are located in playbooks/roles/cyclecloud_cluster/common/cluster-init/files/nhc

File Description
nhc_common.conf.j2 Configuration file with the common checks (to be run on all nodes)
nhc_nd96asr_v4.conf Additional tests for ND96asr_v4
nhc_hb120rs_v3.conf Additional tests for HB120rs_v3
nhc_vm_type.conf Additional tests for any VM type (lowercase)
scripts/ Directory with custom scripts (bash)

References

LBNL NHC documentation: https://github.com/mej/nhc#table-of-contents-by-gh-md-toc