To collect and visualize NVIDIA GPU metrics in a Kubernetes cluster, use the provided Helm chart to deploy DCGM-Exporter.
For full instructions on setting up Prometheus (using kube-prometheus-stack
) and Grafana with DCGM-Exporter, review the documentation
First, install Helm v3 using the official script:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 && \
chmod 700 get_helm.sh && \
./get_helm.sh
Next, setup the Helm repo:
helm repo add gpu-helm-charts \
https://nvidia.github.io/gpu-monitoring-tools/helm-charts
Update the repo:
helm repo update
Install the official chart for DCGM-Exporter:
helm install \
--generate-name \
gpu-helm-charts/dcgm-exporter
We provide an official dashboard on Grafana: https://grafana.com/grafana/dashboards/12239