Adding New Machines to an Existing Site
This guide is intended to cover some of the basic things you should check to get a machine into a a basic state where it can be discovered by Forge auto-ingestion.
Some of the configuration items that should be considered which could potentially cause issues:
- Host BMC Password Requirements
- Updating the Host BMC and UEFI Firmware (Not covered in this document at this time)
- DPU BMC Password Requirements
- Updating DPU BMC Firmware
- DPU ARM OS Check Secure Boot status
Host BMC Password Requirements
Note: New servers should be using the default username for the server type e.g. USERID for Lenovo, admin for NVIDIA/Vikings, root for Dell
You should check both the expected machines DB and the site vault pod data store for any existing data. If entries exist in both expected machines and vault, you should consider the password stored in vault as the password that should be used.
Check Host BMC exists in Expected Machines DB
If there is an existing data in expected machines for the machine, you can either update the password in expected machines or change the password on the Host BMC to match.
-
Use
carbide-admin-clito check if there is an existing entry for the host BMC:carbide-admin-cli expected-machine show |grep <Host BMC IP Address|Host BMC MAC Address> -
If an entry exists for the machine, display the details using
carbide-admin-cli:carbide-admin-cli expected-machine show <Host BMC MAC address> -
To update an existing expected machines data:
carbide-admin-cli expected-machine add --bmc-mac-address <BMC MAC Address> --bmc-username <BMC Username> --bmc-password <BMC Password --chassis-serial-number <Chassis Serial Number>Note: If you only need to update the BMC password, you just need to supply the BMC MAC Address and BMC Password
-
To add a new machine to the expected machines DB:
carbide-admin-cli expected-machine update --bmc-mac-address <BMC_MAC_ADDRESS> <--bmc-username <BMC_USERNAME> --bmc-password <BMC_PASSWORD> --chassis-serial-number <CHASSIS_SERIAL_NUMBER>
Checking site vault data
To check of the Host BMC has currently any passwords in vault on a site:
-
Connect to the Kubernetes environment for the site you are working on
-
Retrieve the decoded vault secret for the site:
kubectl get secret -n forge-system carbide-vault-token -oyaml | yq '.data.token' | base64 -d ; echo -
Connect to the vault pod for the site and paste in the decoded vault secret at the Token prompt:
kubectl --namespace vault exec -it vault-0 -- /bin/sh vault login --tls-skip-verify Token (will be hidden): -
List the secrets in vault:
vault secrets list --tls-skip-verify -
Look for the site BMC:
vault kv list --tls-skip-verify secrets/machines/bmc/ |grep <Host BMC MAC Address> -
Get the current credentials set for the host bmc if they exist:
vault kv get --tls-skip-verify secrets/machines/bmc/<BMC MAC Address>/rootEnsure these credentials match the credentials currently set on the host BMC. It is easier to just update the Host BMC to match vault rather than attempting to update the secret in vault.
DPU BMC Password Requirements
For a new/undiscovered DPU BMC, ensure that it is set to the default BMC username/password
Resting DPU BMC password to default - From DPU BMC
To reset to factory defaults from the DPU BMC:
-
Log into the DPU BMC.
-
Run the following command to reset to factory defaults:
ipmitool raw 0x32 0x66 -
Reboot the DPU BMC:
reboot
Resetting DPU BMC password to default - From DPU ARM OS
If you don't know the BMC password, but have access to the DPU ARM OS, you can reset to defaults as follows:
-
Log into the DPU ARM OS
-
Switch to root:
sudo -i -
Restore DPU BMC defaults:
ipmitool raw 0x32 0x66 -
Restart DPU BMC:
ipmitool mc reset cold
Updating DPU firmware
Determine the DPU model
Log on to the DPU ARM OS and attempt to run the following command:
sudo mlxfwmanager --query -d /dev/mst/*_pciconf0
For Bluefield 2 DPUs you should expect the output similar to the following:
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: BlueField2
Part Number: MBF2H536C-CECO_Ax_Bx
Description: BlueField-2 P-Series DPU 100GbE Dual-Port QSFP56; integrated BMC; PCIe Gen4 x16; Secure Boot Enabled; Crypto Enabled; 32GB on-board DDR; 1GbE OOB management; FHHL
PSID: MT_0000000768
PCI Device Name: /dev/mst/mt41686_pciconf0
Base GUID: a088c20300ea8240
Base MAC: a088c2ea8240
Versions: Current Available
FW 24.40.1000 N/A
FW (Running) 24.35.2000 N/A
PXE 3.6.0805 N/A
UEFI 14.28.0016 N/A
UEFI Virtio blk 22.4.0010 N/A
UEFI Virtio net 21.4.0010 N/A
For Bluefield 3 DPUs you should expect the output similar to the following:
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: BlueField3
Part Number: 900-9D3B6-00CV-A_Ax
Description: NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
PSID: MT_0000000884
PCI Device Name: /dev/mst/mt41692_pciconf0
Base MAC: a088c232137a
Versions: Current Available
FW 32.41.1000 N/A
PXE 3.7.0400 N/A
UEFI 14.34.0012 N/A
UEFI Virtio blk 22.4.0013 N/A
UEFI Virtio net 21.4.0013 N/A
Status: No matching image found
Checking Bluefield Firmware Versions
To check the current Bluefield firmware versions installed on a DPU:
-
Log into the staging server for the site
-
Set up IP, password and token environment variables:
export DPUBMCIP=<DPU BMC IP> export BMCPASS=<BMC Password> export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'` -
Check the current DPU BMC Firmware Versions:
Bluefield 2 DPUs:
curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory # Use the Firmware ID from the first command to complete the firmware ID needed for the following command: curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/<firmware_id>_BMC_Firmware | jq -r ' .Version'Bluefield 3 DPUs:
curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'
Updating the Bluefield Firmware Versions
Note: If discovery is failing due to the firmware revision being too low, confirm with Forge Dev team what version you should update to before proceeding
DPU Firmware versions can be downloaded from the following locations:
BF2: https://confluence.nvidia.com/display/SW/BF2+BMC+Firmware+release
BF3: https://confluence.nvidia.com/display/SW/BF3+BMC+Firmware+release
For the examples below, we are installing FW version 24.01-5, but confirm this with Forge Development team for your specific install before proceeding
-
Download the relevant packages for your DPU type:
Bluefield 2:
wget https://urm.nvidia.com/artifactory/sw-bmc-generic-local/BF2/BF2BMC-24.01-5/OPN/bf2-bmc-ota-24.01-5-opn.tarBluefield 3:
wget https://urm.nvidia.com/artifactory/sw-bmc-generic-local/BF3/BF3BMC-24.01-5/OPN/bf3-bmc-24.01-5_opn.fwpkg -
Copy the firmware package to the staging server for the site
-
Set up IP, password and token environment variables:
export DPUBMCIP=<DPU BMC IP> export BMCPASS=<BMC Password> export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'` -
Initiate the DPU BMC FW Upgrade:
Bluefield 2:
curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/octet-stream" -X POST -T bf2-bmc-ota-24.01-5-opn.tar https://$DPUBMCIP/redfish/v1/UpdateService/updateBluefield 3:
curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/octet-stream" -X POST -T bf3-bmc-24.01-5_opn.fwpkg https://$DPUBMCIP/redfish/v1/UpdateService/update -
Monitor the firmware update progress:
# List the running tasks: curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/TaskService/Tasks { "@odata.id": "/redfish/v1/TaskService/Tasks", "@odata.type": "#TaskCollection.TaskCollection", "Members": [ { "@odata.id": "/redfish/v1/TaskService/Tasks/0" } ], "Members@odata.count": 1, "Name": "Task Collection" } # Display the current progress curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/TaskService/Tasks/0 | jq -r ' .PercentComplete' 30 -
Once the progress has reached 100% complete, initiate a reboot of the BMC:
curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/json" -X POST -d '{"ResetType": "GracefulRestart"}' https://$DPUBMCIP/redfish/v1/Managers/Bluefield_BMC/Actions/Manager.Reset -
Once the DPU BMC has rebooted, retrieve a new BMC Token and check the installed firmware version:
Bluefield 2:
export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'` curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory # Use the Firmware ID from the first command to complete the firmware ID needed for the following command: curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/<firmware_id>_BMC_Firmware | jq -r ' .Version'Bluefield 3:
export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'` curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'
DPU ARM OS: Checking Secure Boot Status
To successfully boot from the Forge BFB image, the DPU ARM OS needs to have Secure Boot disabled and configured for HTTP PXE boot.
Check current secure boot settings
-
Log in to the staging server for the site
-
Set up the DPU IP, password environment variables:
export DPUBMCIP='BMC IP' export BMCPASS='BMC password' -
Check the current Secure Boot settings:
curl -k -u root:"$BMCPASS" -X GET https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBootNote: If you do not see the
SecureBootCurrentBootoption listed, you should install DOCA version 2.5.0If you see the following output, secure boot is enabled and it needs to be disabled:
{ "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot", "@odata.type": "#SecureBoot.v1_1_0.SecureBoot", "Description": "The UEFI Secure Boot associated with this system.", "Id": "SecureBoot", "Name": "UEFI Secure Boot", "SecureBootCurrentBoot": "Enabled", "SecureBootDatabases": { "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot/SecureBootDatabases" }, "SecureBootEnable": true, "SecureBootMode": "UserMode" }If you see
"SecureBootCurrentBoot": "Disabled",no action is required. You should attempt to boot the DPU ARM OS over the network:{ "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot", "@odata.type": "#SecureBoot.v1_1_0.SecureBoot", "Description": "The UEFI Secure Boot associated with this system.", "Id": "SecureBoot", "Name": "UEFI Secure Boot", "SecureBootCurrentBoot": "Disabled", "SecureBootDatabases": { "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot/SecureBootDatabases" }, "SecureBootEnable": true, "SecureBootMode": "UserMode" }
Disable Secure Boot
To disable if Secure Boot if it is enabled:
-
Run the command to disable Secure Boot:
curl -k -u root:"$BMCPASS" -X PATCH -H 'Content-Type: application/json' https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBoot -d '{"SecureBootEnable":false}' -
Restart the DPU ARM OS:
curl -k -u root:"$BMCPASS" -X POST -H 'Content-Type: application/json' https://$DPUBMCIP/redfish/v1/Systems/Bluefield/Actions/ComputerSystem.Reset -d '{"ResetType" : "GracefulRestart"}' -
Wait for the DPU ARM OS to boot and check if Secure Boot is enabled now:
curl -k -u root:"$BMCPASS" -X GET https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBootNote: You may need to run this step several times to disable secure boot. It may take up to 3 cycles of this for the setting to stick
If the "SecureBootCurrentBoot" setting is not shown, attempt to install DOCA 2.5.0:
-
Download the BFB image on the staging server:
mkdir DOCA cd DOCA wget https://image.azure.nvmetal.net/mirror/forge/DOCA_2.5.0_BSP_4.5.0_Ubuntu_22.04-1.23-10.prod.bfb --no-check-certificate -
Install the BFB image to the DPU ARM OS via the DPU BMC from the server with the BFB image:
export DPUBMCIP='BMC IP' export BMCPASS='BMC password' sshpass -p $BMCPASS scp -o StrictHostKeyChecking=no DOCA_2.5.0_BSP_4.5.0_Ubuntu_22.04-1.23-10.prod.bfb root@$DPUBMCIP:/dev/rshim0/boot -
Log on to the DPU BMC and reboot the DPU ARM OS:
echo SW_RESET 1 > /dev/rshim0/misc -
After the DPU ARM OS boots, log into the DPU ARM OS using the default password
-
Switch to root and set the default username passwod back to the default
-
Ensure that the DOCA firmware is up to date:
sudo -i bfvcheck /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl -
Check that the DPU ARM OS is configured for HTTPs boot. Log into the DPU ARM OS and switch to root,
-
List the current boot order:
efibootmgr -
If the boot order is set to something similar the following, no action is needed and you should reboot the DPU ARM OS:
BootCurrent: 0040 Timeout: 3 seconds BootOrder: 0000,0040,0001,0002,0003 Boot0000* NET-OOB-IPV4-HTTP Boot0001* NET-OOB.4040-IPV4 Boot0002* UiApp Boot0003* EFI Internal Shell Boot0040* ubuntu0 -
To set the correct boot order, create the /etc/bf.cfg file with the following contents:
echo "BOOT0=NET-OOB-IPV4-HTTP BOOT1=DISK BOOT2=NET-OOB.4040-IPV4" >> /etc/bf.cfg -
Run the bfcfg command to update the boot order:
bfcfg -
Verify that the boot order is no set to NET-OOB-IPV4-HTTP as default:
efibootmgr -
Reboot the DPU ARM OS from the RSHIM console and monitor the reboot/provisioning process
Note: If you see an error similar to the following during PXE boot, verify that Secure Boot is disabled correctly:
EFI stub: Booting Linux Kernel... EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary EFI stub: UEFI Secure Boot is enabled. EFI stub: Using DTB from configuration table