Common Mitigations - NCX Infra Controller Documentation

Stuck Object Mitigations

Unfortunately there does not exist a common mitigation to all kinds of problems that show up. Many issues will require a unique mitigation that is tailored to the root cause of the object being stuck.

Therefore operators are required to understand the requirements for state transitions and how Forge system components work together. The previous sections of this runbook should help with this.

However there exists a few common requirements for state transitions, and repeated reasons on why those might be failing. This section provides an overview for those.

4.1 Common requirements and failures for `ManagedHost` state transitions

4.1.1 Machine reboots

Various state transitions require a machine (Host or DPU) to be rebooted. The reboot is indicated by the forge-scout performing a ForgeAgentControl call on startup of the machine.

The following issues might prevent this call from happening:

The reboot request never succeeds due to the Machine being powered down, not reachable via redfish, or due to issues during credential loading. These errors should all show up in carbide-api logs.
The machine reboots, but can either not obtain an IP address via DHCP or can not PXE boot. The serial console that is accessible via the BMC of a machine or via forge-ssh-console can be used to determine whether the Machine booted successfully, or whether it bootloops and not obtain an IP or load an image. If the boot process does not succeed, check carbide-dhcp and carbide-pxe for further logs. TODO: Better runbooks for DHCP failures
The machine boots into the discovery image (or BFB for DPUs), but the execution inside forge-scout will fail. For this case check the carbide-api logs on whether scout was able to send a ReportForgeScoutError call which indicates the source of the problem. If the machine is not able to enumerate hardware, or if carbide-api is not accessible to the machine, such an error report will not be available. You can however access the host via serial console and check the logfile that forge-scout generates (/var/log/forge/forge-scout.log) in order to further investigate the problem.

4.1.2 Feedback from forge-dpu-agent

Whenever the configuration of a ManagedHost changes (Instance gets created, Instance gets deleted, Provisioning), Forge requires the forge-dpu-agent to acknowledge that the desired DPU configuration is applied and that the DPU and services running on it (like HBN) are in a healthy state.

This often happens within a state called WaitingForNetworkConfig. For details about this see WaitingForNetworkConfig.

Optional Step 5: Mitigation by deleting the object using the Forge Web UI or API

In order to fix the problem of instance or subnet stuck in provisioning, it often seems appealing to just delete the object and retry.

This mitigation will however only work if the object has not even been created on the Forge Site and if the source of the creation problem is within the scope of the Forge Cloud Backend.

If the object was already created on the site and is stuck in a certain provisioning state there, then the deletion attempt will not help getting the object unstuck. The lifecycle of any object is fully linear with no shortcuts. If the object isn't getting Ready it will also never be deleted. The object lifecycle is implemented this way in Forge in order to avoid any important object creation or deletion steps accidentally being skipped due to skipping states.

Due to this reason, it is usually not helpful to initiate deletion of objects stuck in Provisioning. Instead of this, the reason for an object stuck in provisioning should be inspected and the underlying issue being resolved.

Keyboard shortcuts

NCX Infra Controller Documentation

Stuck Object Mitigations

4.1 Common requirements and failures for ManagedHost state transitions

4.1.1 Machine reboots

4.1.2 Feedback from forge-dpu-agent

Optional Step 5: Mitigation by deleting the object using the Forge Web UI or API

4.1 Common requirements and failures for `ManagedHost` state transitions