HPC Monitoring and Alerting
This page provides the alert setting for HPC infrastructure. We may update these setting as we continue to work with a breadth of customers.
Alert Name | Component | Metric | Aggregation | Operator | Threshold | Window | Frequency | Severity | Scope | Support for Multiple Resources | Verified | References |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Microsoft.Batch/batchAccounts | UnusableNodeCount | Total | GreaterThan | 2.5 | PT5M | PT1M | 2 | No | N | |||
Microsoft.Batch/batchAccounts | OfflineNodeCount | Total | GreaterThan | 0 | PT5M | PT1M | 3 | No | N | |||
Microsoft.Batch/batchAccounts | TaskFailEvent | Total | GreaterThan | 0 | PT5M | PT1M | 3 | No | N | |||
Microsoft.Batch/batchAccounts | RebootingNodeCount | Total | GreaterThan | 0 | PT5M | PT1M | 1 | No | N | |||
Microsoft.Batch/batchAccounts | PreemptedNodeCount | Total | GreaterThan | 0 | PT5M | PT1M | 1 | No | N | |||
Microsoft.Compute/virtualMachineScaleSets | Percentage CPU | Average | GreaterThan | 90 | PT5M | PT1M | 3 | No | N | Supported Metrics for Microsoft.Compute/virtualMachineScaleSets | ||
Microsoft.Compute/virtualMachineScaleSets | Available Memory Bytes | Average | LessThan | 1e+09 | PT5M | PT1M | 2 | No | N | Supported Metrics for Microsoft.Compute/virtualMachineScaleSets | ||
Microsoft.Compute/virtualMachineScaleSets | Network In | Average | LessThan | 1 | PT5M | PT1M | 2 | No | N | |||
Microsoft.Compute/virtualMachines | Available Memory Bytes | Average | LessThan | 1000000000 | PT5M | PT5M | 3 | No | Y | |||
Microsoft.Compute/virtualMachines | VmAvailabilityMetric | Average | LessThan | 1 | PT5M | PT5M | 3 | No | Y | |||
Microsoft.Compute/virtualMachines | Data Disk Queue Depth | Average | GreaterThan | 100 | PT5M | PT1M | 2 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | VolumeConsumedSizePercentage | Average | GreaterThan | 80 | PT5M | PT1M | 3 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | VolumeLogicalSize | Average | GreaterThan | 8.589934592e+10 | PT1H | PT30M | 2 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | AverageWriteLatency | Average | GreaterThan | 20 | PT5M | PT1M | 3 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | AverageReadLatency | Average | GreaterThan | 20 | PT5M | PT1M | 3 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | CbsVolumeOperationComplete | Average | LessThan | 1 | PT30M | PT30M | 2 | No | N | |||
Microsoft.NetApp/netAppAccounts/capacityPools/volumes | VolumeAllocatedSize | Average | GreaterThan | 1.073741824e+11 | PT5M | PT1M | 3 | No | N | |||
Microsoft.Storage/storageAccounts | Availability | Average | LessThan | 100 | PT5M | PT5M | 1 | No | Y | Monitoring Availability Supported metrics for Microsoft.Storage/storageAccounts | ||
Microsoft.Storage/storageAccounts/fileServices | Transactions | Total | GreaterThanOrEqual | 1 | PT15M | PT5M | 2 | No | N | High latency, low throughput, or low IOPS | ||
Microsoft.Storage/storageAccounts | UsedCapacity | Average | GreaterThan | 2.2518e+15 | PT1H | PT1H | 3 | No | N | Account Level Metrics Azure Storage Metric - Used Capacity | ||
Microsoft.Storage/storageAccounts | Egress | Total | GreaterThan | 6e+07 | PT5M | PT5M | 2 | No | N | Transaction Metrics Storage Account Metric Dimensions (all storage) | ||
Microsoft.Storage/storageAccounts | Ingress | Total | GreaterThan | 1.073741824e+09 | PT5M | PT5M | 3 | No | N | Transaction Metrics Storage Account Metric Dimensions (all storage) | ||
Microsoft.Storage/storageAccounts/blobServices | SuccessE2ELatency | Average | GreaterThan | 1000 | PT5M | PT1M | 3 | No | N | Verify throughput and latency metrics for a storage account Troubleshoot performance in Azure storage accounts | ||
Microsoft.Storage/storageAccounts/blobServices | SuccessServerLatency | Average | GreaterThan | 1000 | PT5M | PT1M | 2 | No | N | Trouble shoot performance in Azure storage accounts Verify throughput and latency metrics for a storage account Storage Transaction Metrics | ||
Microsoft.Storage/storageAccounts/fileServices | Transactions | Total | GreaterThan | 10 | PT5M | PT1M | 3 | No | N | Identify storage accounts with no or low use Monitor the use of a container Storage Transaction Metrics | ||
Microsoft.StorageCache/caches | Uptime | Total | LessThan | 99 | PT5M | PT1M | 1 | No | N | Monitor HPC Cache with metrics and alerts |