GPU Operator
This guide details recommended configurations for GPU Operator to enable GPU workload, with specific settings for GPUDirect RDMA integration.
This guide assumes a basic understanding of GPU Operator and its role in Kubernetes clusters. Readers unfamiliar with GPU Operator are advised to review the official guide before proceeding. The concepts and recommended configurations presented here build on that foundation to enable GPU workload and GPUDirect RDMA in AKS.
GPU Drivers: AKS-managed vs. GPU Operator-managed
AKS-managed GPU drivers and the NVIDIA GPU Operator managed GPU drivers are mutually exclusive and cannot coexist. When you create a nodepool without the --skip-gpu-driver-install
flag, AKS provisions it with a node image that includes pre-installed NVIDIA drivers and the NVIDIA container runtime. Installing GPU Operator subsequently replaces this setup by deploying its own nvidia-container-toolkit
, overriding the AKS-managed configuration. Upon uninstalling GPU Operator, the toolkit cannot revert to the original AKS containerd configuration, as it lacks awareness of the prior state, potentially disrupting the node’s container runtime and impairing workload execution.
When provisioning GPU nodepools in an AKS cluster, the cluster administrator has the option to either rely on the default GPU driver installation managed by AKS or via GPU Operator. This decision impacts cluster setup, maintenance, and compatibility.
AKS-managed GPU Driver (Without GPU Operator) | GPU Operator-managed GPU Driver (--skip-gpu-driver-install ) | |
---|---|---|
Automation | AKS-managed drivers; cluster administrator needs to manually deploy device plugins | Automates installation of driver, device plugins, and container runtimes via GPU Operator |
Complexity | Simple, no additional components except device plugins | More complex, requires GPU Operator and additional components |
Support | Fully supported by AKS; no preview features | --skip-gpu-driver-install is a preview feature; limited support available |
Read more about the GPU driver installation options in AKS and the NVIDIA GPU Operator in the AKS documentation and the GPU Operator documentation.
GPU Operator Deployment
Please proceed with GPU operator installation only if you have created the nodepool with the --skip-gpu-driver-install
flag as described in prerequisites documentation.
Operator
GPU Operator is deployed using Helm, and the default Helm values are customized to align with the Network Operator and AKS requirements. Key adjustments to the Helm values disable redundant components such as NFD and enable RDMA support.
GPU operator deploys pods that require privileged access to the host system. To ensure proper operation, the gpu-operator
namespace must be labeled with pod-security.kubernetes.io/enforce=privileged
.
kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged
Save the following YAML configuration to a file named values.yaml
:
loading...
Deploy GPU Operator with the following command:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade --install \
--create-namespace -n gpu-operator \
gpu-operator nvidia/gpu-operator \
-f values.yaml \
--version v25.3.0
Usage of GPUDirect RDMA
Once GPU Operator and its operands are installed, configure pods to claim both GPUs and InfiniBand resources created from one of the device plugins managed via Network Operator.
1. SR-IOV Device Plugin
Here is an example for a GPUDirect RDMA workload using SR-IOV Device Plugin:
---
apiVersion: v1
kind: Pod
metadata:
name: gpudirect-rdma
spec:
containers:
- name: gpudirect-rdma
image: images.my-company.example/app:v4
securityContext:
capabilities:
# A pod without this will have a low locked memory value `# ulimit
# -l` value of "64", this changes the value to "unlimited".
add: ["IPC_LOCK"]
resources:
requests:
nvidia.com/gpu: 8 # Claims all GPUs on the node
rdma/ib: 8 # Claims 8 NIC; adjust to match node’s NIC count
limits:
nvidia.com/gpu: 8
rdma/ib: 8
2. RDMA Shared Device Plugin
Here is an example for a GPUDirect RDMA workload using RDMA Shared Device Plugin:
---
apiVersion: v1
kind: Pod
metadata:
name: gpudirect-rdma
spec:
containers:
- name: gpudirect-rdma
image: images.my-company.example/app:v4
securityContext:
capabilities:
# A pod without this will have a low locked memory value `# ulimit
# -l` value of "64", this changes the value to "unlimited".
add: ["IPC_LOCK"]
resources:
requests:
nvidia.com/gpu: 8 # Claims all GPUs on the node
rdma/shared_ib: 1 # Claims 1 of 63 pod slots; all NICs accessible
limits:
nvidia.com/gpu: 8
rdma/shared_ib: 1
3. IP over InfiniBand (IPoIB)
Here is an example for a GPUDirect RDMA workload using IPoIB:
apiVersion: v1
kind: Pod
metadata:
name: ib-pod
annotations:
# This name should match the IPoIBNetwork object we created earlier.
# You can find this config by running `kubectl get IPoIBNetwork`.
k8s.v1.cni.cncf.io/networks: aks-infiniband
spec:
containers:
- name: ib
image: images.my-company.example/app:v4
resources:
requests:
nvidia.com/gpu: 8 # Claims all GPUs on the node
limits:
nvidia.com/gpu: 8
Order of Operations
The installation process follows this sequence:
- GPU Operator Deployment: The Helm chart installs GPU Operator, including its controller manager deployment to manage
ClusterPolicy
reconciliation. ClusterPolicy
Reconciliation: The GPU Operator controller manager reconcilesClusterPolicy
, a custom resource that defines the desired state of GPU Operator and its components. The operator continuously monitors the cluster for changes and ensures that the actual state matches the desired state defined in theClusterPolicy
. The operator creates the following notable DaemonSets:nvidia-driver-daemonset
: Installs NVIDIA drivers on GPU nodes, blocking other components until complete, as the container runtime depends on it.nvidia-container-toolkit-daemonset
: Configures containerd withnvidia-container-runtime
as the default container runtime for creating containers going foward. Createsnvidia
RuntimeClass, enabling subsequent deployments.nvidia-device-plugin-daemonset
: Registers GPUs as claimable node resources (nvidia.com/gpu
) via the Device Plugin framework.nvidia-dcgm-exporter
: Exports GPU telemetry as Prometheus metrics, enabling monitoring of GPU utilization and other metrics.nvidia-operator-validator
: Validates GPU Operator installation and configuration, ensuring that all components are functioning correctly.gpu-feature-discovery
(GFD): Discovers GPU features and labels nodes with GPU attributes (e.g.,nvidia.com/gpu.product=NVIDIA-A100-SXM4-40G
) for scheduling.