Troubleshooting noDpuLogsWarning Alerts

The Forge noDpuLogsWarning alert fires under the following conditions:

Forge has been receiving logs from the DPU ARM OS with in the last 30 days
It has not received any forge-dpu-agent.service lg events within the last 10 minutes
And opentelemetry-collector-prom end point running on the DPU ARM OS has been down for more than 5 minutes

The format of the alert name is "<Forge site ID>-noDpuLogsWarning (<Forge site ID> <DPU ARM OS hostname> forge-montioring/forge-monitoring-(<Forge site ID>-prometheus warning)

Common Causes of these alerts

The machine is currently being re-provisioned and taken longer than expected to completed provisioning
The machine is being worked on by another SRE team member. The machine might be powered off, undergoing maintenance or might have been force-deleted.
Issues with systemd services on the DPU ARM OS.
On the DPU ARM OS, check that node-exporter, otelcol-contrib and forge-dpu-otel-agent services are running and not reporting errors:

systemctl status node-exporter otelcol-contrib forge-dpu-otel-agent

Hostname is not picked up by the OpenTelemetry Collector service
Connect to the OpenTelemetry collector port and check that metrics are being generated and check for any other errors:

curl 127.0.0.1:9999/metrics | grep telemetry_stats
...
telemetry_stats_log_records_total{component="telemetry_stats",grouping="logs_by_component",host_name="localhost",http_scheme="http",instance="127.0.0.1:8890",job="log-stats",log_component="journald",machine_id="fm100dsekkqjprbu96gq67vd6p24rc1uqnct6dv15opjka9he3qlbk3doc0",net_host_port="8890",service_instance_id="127.0.0.1:8890",service_name="log-stats",source="telemetrystatsprocessor:0.0.1",systemd_unit="kernel"} 272
...

In the example above, the hostname being used by the otelcol-contrib service (host_name="localhost") is set to localhost. The host_name should be set to the hostname of the DPU ARM OS. To resolve this issue, restart the OpenTelemrty Collector service:

systemctl restart otelcol-contrib

Wait for 5 minutes after restarting the service and check the metrics again:

curl http://127.0.0.1:9999/metrics | grep telemetry_stats
...
telemetry_stats_log_records_total{component="telemetry_stats",grouping="logs_by_component",host_name="192-168-134-165.nico.example.org",http_scheme="http",instance="127.0.0.1:8890",job="log-stats",log_component="journald",machine_id="fm100ds5eue9nh4kmhb2mkdh1jrthqso8r3lve4jvn51biitt509s86e8gg",net_host_port="8890",service_instance_id="127.0.0.1:8890",service_name="log-stats",source="telemetrystatsprocessor:0.0.1",systemd_unit="kernel"} 20
...

In this example the host_name is now set to 192-168-134-165.nico.example.org.

Check carbide-hardware-health pod for errors scraping information from the IP address for the DPU:

kubectl logs carbide-hardware-health-67c95c7775-bd4mw -n forge-system --timestamps

If errors are being send against the endpoint, but it is available on the network (You can ping it, ssh to the DPU ARM OS and all services appear to be running with no errors), you can attempt to restart the carbide-hardware-health pod to see if this resolves the issues:

kubectl delete pod carbide-hardware-health-67c95c7775-bd4mw -n forge-system

Keyboard shortcuts

NCX Infra Controller Documentation

Troubleshooting noDpuLogsWarning Alerts

Common Causes of these alerts