Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Force deleting and rebuilding NICo hosts

In various cases, it might be necessary to force-delete knowledge about hosts from the database and to restart the discovery process for those hosts. The following are use-cases where force-delete can be helpful:

  • If a host managed by NCX Infra Controller (NICo) has entered an erroneous state from which it can not automatically recover.
  • If a non backward compatible software update requires the host to go through the discovery phase again.

Important note

This this is not a site-provider facing workflow, since force-deleting a machine does skip any cleanup on the machine and leaves it in an undefined state where the tenants OS could be still running. force-deleting machines is purely an operational tool. The operator which executed the command needs to make sure that either no tenant image is running anymore, or take additional steps (like rebooting the machine) to interrupt the image. Site providers would get a safe version of this workflow later on that moves the machine through all necessary cleanup steps

Force-Deletion Steps

The following steps can be used to force-delete knowledge about a a NICo host:

1. Obtain access to carbide-admin-cli

See carbide-admin-cli access on a Carbide cluster.

2. Execute the carbide-admin-cli machine force-delete command

Executing carbide-admin-cli machine force-delete will wipe most knowledge about machines and instances running on top of them from the database, and clean up associated CRDs. It accepts the machine-id, hostname, MAC or IP of either the managed host or DPU as input, and will delete information about both of them (since they are heavily coupled).

It returns all machine-ids and instance-ids it acted on, as well as the BMC information for the host.

Example:

/opt/carbide/carbide-admin-cli -c https://127.0.0.1:1079 machine force-delete --machine="60cef902-9779-4666-8362-c9bb4b37184f"

3. Use the returned BMP IP/port and machine-id to reboot the host

See Rebooting a machine. Supply the BMC IP and port of the managed host, as well as its machine_id as parameters.

Force-deleting a machine will not delete its last set of credentials from vault. Therefore the site controller can still access those.

Once a reboot is triggered, the DPU of the Machine should boot into the NICo discovery image again. This should initiate DPU discovery. A second reboot is required to initiate host discovery. After those steps, the host should be fully rebuilt and available.

Reinstall OS Steps

Deleting and recreating a NICo instance can take upwards of 1.5 hours. However, if you do not need to change the PXE image you can reinstall the OS in place and reuse your allocated system. All the other information about your instance will stay the same. This procedure will delete any data on the host!

The following steps can be used to reinstall the host OS on a NICo host:

1. Obtain access to the carbide-admin-cli tool

See carbide-admin-cli access on a Carbide cluster.

3. Execute the carbide-admin-cli instance reboot --custom-pxe command

carbide-admin-cli -f json -c https://127.0.0.1079/ instance reboot --custom-pxe -i 26204c21-83ac-445e-8ea7-b9130deb6315
Reboot for instance 26204c21-83ac-445e-8ea7-b9130deb6315 (machine fm100hti4deucakqqgteo692efnfo7egh7pq1lkl7vkgas4o6e0c42hnb80) is requested successfully!