GPU Monitoring ============== .. contents:: Table of Contents :local: NVIDIA DCGM Exporter -------------------- .. code-block:: sh # Check if the dcgm-exporter is successufully deployed $ kubectl get pods -n gpu-operator-resources -l app=nvidia-dcgm-exporter # Check gpu metrics locally $ dcgm_pod_ip=$(kubectl get pods -n gpu-operator-resources -o wide -l app=nvidia-dcgm-exporter | tail -n 1 | awk '{print $6}') $ curl $dcgm_pod_ip:9400/gpu/metrics Deploying with Prometheus ------------------------- .. code-block:: sh # To scrape gpu metrics from Prometheus, add dcgm endpoint to Prometheus via a configmap $ tee dcgmScrapeConfig.yaml <