WaitingForNetworkConfig and DPU health
Whenever the configuration of a ManagedHost changes (Instance gets created,
Instance gets deleted, Provisioning), Forge requires the forge-dpu-agent to
acknowledge that the desired DPU configuration is applied and that the DPU and
services running on it (like HBN) are in a healthy state.
This feedback mechanism works in the following fashion:
forge-dpu-agentperiodically callsGetManagedHostNetworkConfig. It thereby obtains the latest configuration for all interfaces, including the configuration which states whether the Host should get attached to an admin or tenant network. The configuration includes Version numbers, which increase whenever the configuration changes.forge-dpu-agentreports the version numbers of the currently applied configurations back to Carbide using theRecordDpuNetworkStatusAPI. This report also includes the DPUs health in the form of aHealthReport.
If the DPU has not recently reported that it is up, healthy and that the latest desired configuration is applied, the state will not be advanced.
If a ManagedHost is stuck due to this check, you can inspect which condition is not met by inspecting the last report from the Host and DPUs
- via carbide-admin-cli:
-
carbide-admin-cli managed-host show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg -
carbide-admin-cli machine show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg -
carbide-admin-cli machine show fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg -
carbide-admin-cli machine network status
-
E.g. in the following report
/opt/carbide/carbide-admin-cli managed-host show fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
Hostname : 192-168-18-95
State : DPUInitializing/WaitingForNetworkConfig
Time in State : 296 days and 29 minutes
State SLA : 30 minutes
In State > SLA: true
Reason : The object is in the state for longer than defined by the SLA. Handler outcome: Wait("Waiting for DPU agent to apply network config and report healthy network for DPU fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg")
Host:
----------------------------------------
ID : fm100psa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
Memory : Unknown
Admin IP : 192.168.18.95
Admin MAC : B8:3F:D2:B7:70:64
Health
Probe Alerts : HeartbeatTimeout [Target: forge-dpu-agent]:
Overrides
BMC
Version : Unknown
Firmware Version : Unknown
IP : Unknown
MAC : Unknown
DPU0:
----------------------------------------
ID : fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg
State : DPUInitializing/WaitingForNetworkConfig
Primary : true
Failure details : Unknown
Last reboot : 2023-12-13 16:38:08.180734 UTC
Last reboot requested : Unknown/
Last seen : 2023-12-13 17:24:15.454965 UTC
Serial Number : MT2244XZ022R
BIOS Version : BlueField:3.9.3-7-g8f2d8ca
Admin IP : 192.168.134.233
Admin MAC : B8:3F:D2:B7:70:72
BMC
Version : 1
Firmware Version : 2.08
IP : 192.168.134.234
MAC : B8:3F:D2:B7:70:66
Health
Probe Alerts : HeartbeatTimeout [Target: forge-dpu-agent]: No health data was received from DPU
- The
Healthfield indicates whether any of the health checks failed. In this case we can see an alert of theHeartbeatTimeoutprobe - with targetforge-dpu-agent. That indicates noHealthReporthad been received fromforge-dpu-agentvia aRecordDpuNetworkStatusAPI call for a certain amount of time. - The aggregate
Healthof a Host is the aggregation of Health states from monitoring byforge-dpu-agent, out of band BMC monitoring (hardware-health), and the results of validation tests. If the health check failure also shows up in theHealthfield of the DPU, then the failure is related to the DPU, and/or has been reported byforge-dpu-agent. If a health-check has failed, then the root-caused for the failed health-check needs to be remediated. - "Last seen" indicates whether the DPU (and
forge-dpu-agent) is up and running. If the timestamp is too old, it might indicate the DPU agent has crashed or the whole DPU is no longer online. In such a case aHeartbeatTimeoutalert on the DPU and Host would be raised too.
The network status details show:
/opt/carbide/carbide-admin-cli machine network status
+-------------------------+-------------------------------------------------------------+------------------------+----------+--------------------------------------------+---------------------------------+
| Observed at | DPU machine ID | Network config version | Healthy? | Health Probe Alerts | Agent version |
+=========================+=============================================================+========================+==========+============================================+=================================+
| 2023-12-13 17:24:15.454 | fm100dsa0aqpqvll7vi4jfrvtqv058mo8ifb0vtg761j06sqhq466b0slmg | V2-T1702485344893918 | false | HeartbeatTimeout [Target: forge-dpu-agent] | v2023.12-rc1-43-g3322d125f |
+-------------------------+-------------------------------------------------------------+------------------------+----------+--------------------------------------------+---------------------------------+
In this case we learn that the DPU was alive before, and acknowledged network config version V2-T1702485344893918. This is still the desired network configuration version for
this DPU. The target configuration for a DPU can be found on the Network Config block the DPU page in the admin Web UI.
The summary for this example is that the Machine is stuck because the DPU
- is either not healthy at all (e.g. not booted)
- is not running
forge-dpu-agent forge-dpu-agentis not reporting back to NICo
Follow-up investigation steps
Checking DPU liveliness
Operators can try SSHing to the DPU, using the DPU OOB address that is shown on ManagedHost pages and DPU details pages. If SSH fails, the DPU might not be up and running.
If directly SSHing to the DPU does not work, it can be accessed via its BMC and rshim to investigate its state.
TODO: Document the BMC path
Checking DPU agent logs
In case the DPU is running, forge-dpu-agent logs can be inspected in order to
learn why it can not communicate with carbide, or why the configuration application
might have failed. There are various options for this.
Checking logs via Grafana & Loki
forge-dpu-agent logs are forwarded via OpenTelemetry to the site controller logging infrastructure. They can be queried from there via Loki.
Search strings for DPU can be:
{systemd_unit="forge-dpu-agent.service", machine_id="fm100ds006eliqt3u4h65ou9ebrqfq9th2jf39qqki68k9ueu2amearv47g"}
{systemd_unit="forge-dpu-agent.service", host_name="192-168-155-135.nico.example.org"}
Note that the query using the MachineId will only work if the DPU once had been fully ingested
and is aware of its Machine ID. Otherwise only searches by host_name will work.
In case the DPU problem affects log forwarding, DPU logs need to be checked directly on the DPU.
Checking logs on the DPU:
The dpu agent logs are stored in the systemd journal on the DPU. They can be queried using
journalctl -u forge-dpu-agent.service -e --no-pager
Checking additional logs
Depending on the problems that are found in dpu-agent logs, it can be useful to check other logs that are available on the DPU. Examples are
nl2docalogs:{machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/nl2docad.log"}syslog:{machine_id="fm100ds02e5g65099ov37rmho1gnge0c99ihdisvluo4fls1ba3br9bksg0", log_file_path="/var/log/doca/hbn/syslog"}nvuelogsfrrlogs
Potential Mitigations
Power Cycling the Host
⚠️ Note that while a tenant uses a Machine as an instance, powercycling the Host will interrupt their workloads. Only perform these step if its clear that the Tenant no longer requires the Machine (is stuck in termination), or if the Tenant agrees with this action.
If the DPU is unresponsive, powering off the Host and back on can help. This will restart the DPU.
The Host can be powercycled using the Explored-Endpoint view in the Admin Web UI, The DPU Machine details page will link to the explored endpoint by clicking on the DPU BMC IP.
Restarting forge-dpu-agent
If forge-dpu-agent is not even started, then it needs to be started (systemctl enable forge-dpu-agent.service).
This should however never be necessary, since the agent gets restarted on all
crashes.
If forge-dpu-agent should just be restarted, use
systemctl restart forge-dpu-agent.service
Reloading forge-dpu-agent configurations
In rare situations, it might be useful to restart forge-dpu-agent using latest dpu-agent systemd config files. To do so:
systemctl daemon-reload
systemctl restart forge-dpu-agent.service
Mitigations for specific Health Probe Alerts
BgpStats
The BgpStats health probe indicates that BGP peering with the TOR or route server is not successful. This might either indicate a link issue or a configuration issue. The BGP details can be checked on the DPU using
sudo crictl exec -ti $(sudo crictl ps |grep doca-hbn |awk '{print $1}') vtysh -c 'show bgp summary'
TODO: Provide more details on the next steps here
ServiceRunning
Indicates that mandatory DPU services are not running. Next steps in the investigation
can be to check whether the HBN container is running on the DPU (crictl ps should
show doca-hbn container), and to search for associated logs.
DhcpRelay/DhcpServer
Indicates that the DHCP Relay or Server that Forge deploys on the DPU in order to respond to the DHCP requests from the Host are not running as intended. In these conditions, the Host would not be able to boot since nothing would respond to the DHCP request.
Next steps in the investigation would be to check forge-dpu-agent logs for details.
PostConfigCheckWait
This alert is only raised for a brief time after each configuration change in order to wait for the configuration to settle on the DPU. The alert should always settle down after less than a minute. In case the alert keeps raised, it can indicate that new configurations are applied in every dpu-agent eventloop iteration. In this case it would need to be debugged what in the configurations changed, and the source of the unnecessary configuration changes would need to be fixed.