GPU Operator

This guide details recommended configurations for GPU Operator to enable GPU workload, with specific settings for GPUDirect RDMA integration.

tip

This guide assumes a basic understanding of GPU Operator and its role in Kubernetes clusters. Readers unfamiliar with GPU Operator are advised to review the official guide before proceeding. The concepts and recommended configurations presented here build on that foundation to enable GPU workload and GPUDirect RDMA in AKS.

GPU Drivers: AKS-managed vs. GPU Operator-managed

danger

AKS-managed GPU drivers and the NVIDIA GPU Operator managed GPU drivers are mutually exclusive and cannot coexist. When you create a nodepool without the --skip-gpu-driver-install flag, AKS provisions it with a node image that includes pre-installed NVIDIA drivers and the NVIDIA container runtime. Installing GPU Operator subsequently replaces this setup by deploying its own nvidia-container-toolkit, overriding the AKS-managed configuration. Upon uninstalling GPU Operator, the toolkit cannot revert to the original AKS containerd configuration, as it lacks awareness of the prior state, potentially disrupting the node’s container runtime and impairing workload execution.

When provisioning GPU nodepools in an AKS cluster, the cluster administrator has the option to either rely on the default GPU driver installation managed by AKS or via GPU Operator. This decision impacts cluster setup, maintenance, and compatibility.

	AKS-managed GPU Driver (Without GPU Operator)	GPU Operator-managed GPU Driver (`--skip-gpu-driver-install`)
Automation	AKS-managed drivers; cluster administrator needs to manually deploy device plugins	Automates installation of driver, device plugins, and container runtimes via GPU Operator
Complexity	Simple, no additional components except device plugins	More complex, requires GPU Operator and additional components
Support	Fully supported by AKS; no preview features	`--skip-gpu-driver-install` is a preview feature; limited support available

info

Read more about the GPU driver installation options in AKS and the NVIDIA GPU Operator in the AKS documentation and the GPU Operator documentation.

GPU Operator Deployment

warning

Please proceed with GPU operator installation only if you have created the nodepool with the --skip-gpu-driver-install flag as described in prerequisites documentation.

Operator

GPU Operator is deployed using Helm, and the default Helm values are customized to align with the Network Operator and AKS requirements. Key adjustments to the Helm values disable redundant components such as NFD and enable RDMA support.

GPU operator deploys pods that require privileged access to the host system. To ensure proper operation, the gpu-operator namespace must be labeled with pod-security.kubernetes.io/enforce=privileged.

kubectl create ns gpu-operator
kubectl label --overwrite ns gpu-operator pod-security.kubernetes.io/enforce=privileged

Save the following YAML configuration to a file named values.yaml:

configs/values/gpu-operator/values.yaml

loading...

See full example on GitHub

Deploy GPU Operator with the following command:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm upgrade --install \
  --create-namespace -n gpu-operator \
  gpu-operator nvidia/gpu-operator \
  -f values.yaml \
  --version v25.3.0

Usage of GPUDirect RDMA

Once GPU Operator and its operands are installed, configure pods to claim both GPUs and InfiniBand resources created from one of the device plugins managed via Network Operator.

1. SR-IOV Device Plugin

Here is an example for a GPUDirect RDMA workload using SR-IOV Device Plugin:

---
apiVersion: v1
kind: Pod
metadata:
  name: gpudirect-rdma
spec:
  containers:
  - name: gpudirect-rdma
    image: images.my-company.example/app:v4
    securityContext:
      capabilities:
        # A pod without this will have a low locked memory value `# ulimit
        # -l` value of "64", this changes the value to "unlimited".
        add: ["IPC_LOCK"]
    resources:
      requests:
        nvidia.com/gpu: 8 # Claims all GPUs on the node
        rdma/ib: 8        # Claims 8 NIC; adjust to match node’s NIC count
      limits:
        nvidia.com/gpu: 8
        rdma/ib: 8

2. RDMA Shared Device Plugin

Here is an example for a GPUDirect RDMA workload using RDMA Shared Device Plugin:

---
apiVersion: v1
kind: Pod
metadata:
  name: gpudirect-rdma
spec:
  containers:
  - name: gpudirect-rdma
    image: images.my-company.example/app:v4
    securityContext:
      capabilities:
        # A pod without this will have a low locked memory value `# ulimit
        # -l` value of "64", this changes the value to "unlimited".
        add: ["IPC_LOCK"]
    resources:
      requests:
        nvidia.com/gpu: 8 # Claims all GPUs on the node
        rdma/shared_ib: 1 # Claims 1 of 63 pod slots; all NICs accessible
      limits:
        nvidia.com/gpu: 8
        rdma/shared_ib: 1

3. IP over InfiniBand (IPoIB)

Here is an example for a GPUDirect RDMA workload using IPoIB:

apiVersion: v1
kind: Pod
metadata:
  name: ib-pod
  annotations:
    # This name should match the IPoIBNetwork object we created earlier.
    # You can find this config by running `kubectl get IPoIBNetwork`.
    k8s.v1.cni.cncf.io/networks: aks-infiniband
spec:
  containers:
  - name: ib
    image: images.my-company.example/app:v4
    resources:
      requests:
        nvidia.com/gpu: 8 # Claims all GPUs on the node
      limits:
        nvidia.com/gpu: 8

Order of Operations

The installation process follows this sequence:

GPU Operator Deployment: The Helm chart installs GPU Operator, including its controller manager deployment to manage ClusterPolicy reconciliation.
ClusterPolicy Reconciliation: The GPU Operator controller manager reconciles ClusterPolicy, a custom resource that defines the desired state of GPU Operator and its components. The operator continuously monitors the cluster for changes and ensures that the actual state matches the desired state defined in the ClusterPolicy. The operator creates the following notable DaemonSets:
- nvidia-driver-daemonset: Installs NVIDIA drivers on GPU nodes, blocking other components until complete, as the container runtime depends on it.
- nvidia-container-toolkit-daemonset: Configures containerd with nvidia-container-runtime as the default container runtime for creating containers going foward. Creates nvidia RuntimeClass, enabling subsequent deployments.
- nvidia-device-plugin-daemonset: Registers GPUs as claimable node resources (nvidia.com/gpu) via the Device Plugin framework.
- nvidia-dcgm-exporter: Exports GPU telemetry as Prometheus metrics, enabling monitoring of GPU utilization and other metrics.
- nvidia-operator-validator: Validates GPU Operator installation and configuration, ensuring that all components are functioning correctly.
- gpu-feature-discovery (GFD): Discovers GPU features and labels nodes with GPU attributes (e.g., nvidia.com/gpu.product=NVIDIA-A100-SXM4-40G) for scheduling.

GPU Drivers: AKS-managed vs. GPU Operator-managed​

GPU Operator Deployment​

Operator​

Usage of GPUDirect RDMA​

1. SR-IOV Device Plugin​

2. RDMA Shared Device Plugin​

3. IP over InfiniBand (IPoIB)​

Order of Operations​