Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Adding New Machines to an Existing Site

This guide is intended to cover some of the basic things you should check to get a machine into a a basic state where it can be discovered by Forge auto-ingestion.

Some of the configuration items that should be considered which could potentially cause issues:

  1. Host BMC Password Requirements
  2. Updating the Host BMC and UEFI Firmware (Not covered in this document at this time)
  3. DPU BMC Password Requirements
  4. Updating DPU BMC Firmware
  5. DPU ARM OS Check Secure Boot status

Host BMC Password Requirements

Note: New servers should be using the default username for the server type e.g. USERID for Lenovo, admin for NVIDIA/Vikings, root for Dell

You should check both the expected machines DB and the site vault pod data store for any existing data. If entries exist in both expected machines and vault, you should consider the password stored in vault as the password that should be used.

Check Host BMC exists in Expected Machines DB

If there is an existing data in expected machines for the machine, you can either update the password in expected machines or change the password on the Host BMC to match.

  1. Use carbide-admin-cli to check if there is an existing entry for the host BMC:

    carbide-admin-cli expected-machine show |grep <Host BMC IP Address|Host BMC MAC Address>
    
  2. If an entry exists for the machine, display the details using carbide-admin-cli:

    carbide-admin-cli expected-machine show <Host BMC MAC address>
    
  3. To update an existing expected machines data:

    carbide-admin-cli expected-machine add --bmc-mac-address <BMC MAC Address> --bmc-username <BMC Username> --bmc-password <BMC Password --chassis-serial-number <Chassis Serial Number>
    

    Note: If you only need to update the BMC password, you just need to supply the BMC MAC Address and BMC Password

  4. To add a new machine to the expected machines DB:

    carbide-admin-cli expected-machine update --bmc-mac-address <BMC_MAC_ADDRESS> <--bmc-username <BMC_USERNAME> --bmc-password <BMC_PASSWORD> --chassis-serial-number <CHASSIS_SERIAL_NUMBER>
    

Checking site vault data

To check of the Host BMC has currently any passwords in vault on a site:

  1. Connect to the Kubernetes environment for the site you are working on

  2. Retrieve the decoded vault secret for the site:

    kubectl get secret -n forge-system carbide-vault-token -oyaml | yq '.data.token' | base64 -d ; echo
    
  3. Connect to the vault pod for the site and paste in the decoded vault secret at the Token prompt:

    kubectl --namespace vault exec -it vault-0 -- /bin/sh
    vault login --tls-skip-verify
    Token (will be hidden):
    
  4. List the secrets in vault:

    vault secrets list --tls-skip-verify
    
  5. Look for the site BMC:

    vault kv list --tls-skip-verify secrets/machines/bmc/ |grep <Host BMC MAC Address>
    
  6. Get the current credentials set for the host bmc if they exist:

    vault kv get --tls-skip-verify secrets/machines/bmc/<BMC MAC Address>/root
    

    Ensure these credentials match the credentials currently set on the host BMC. It is easier to just update the Host BMC to match vault rather than attempting to update the secret in vault.

DPU BMC Password Requirements

For a new/undiscovered DPU BMC, ensure that it is set to the default BMC username/password

Resting DPU BMC password to default - From DPU BMC

To reset to factory defaults from the DPU BMC:

  1. Log into the DPU BMC.

  2. Run the following command to reset to factory defaults:

    ipmitool raw 0x32 0x66
    
  3. Reboot the DPU BMC:

    reboot
    

Resetting DPU BMC password to default - From DPU ARM OS

If you don't know the BMC password, but have access to the DPU ARM OS, you can reset to defaults as follows:

  1. Log into the DPU ARM OS

  2. Switch to root:

    sudo -i
    
  3. Restore DPU BMC defaults:

    ipmitool raw 0x32 0x66
    
  4. Restart DPU BMC:

    ipmitool mc reset cold
    

Updating DPU firmware

Determine the DPU model

Log on to the DPU ARM OS and attempt to run the following command:

sudo mlxfwmanager --query -d /dev/mst/*_pciconf0

For Bluefield 2 DPUs you should expect the output similar to the following:

Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      BlueField2
  Part Number:      MBF2H536C-CECO_Ax_Bx
  Description:      BlueField-2 P-Series DPU 100GbE Dual-Port QSFP56; integrated BMC; PCIe Gen4 x16; Secure Boot Enabled; Crypto Enabled; 32GB on-board DDR; 1GbE OOB management; FHHL
  PSID:             MT_0000000768
  PCI Device Name:  /dev/mst/mt41686_pciconf0
  Base GUID:        a088c20300ea8240
  Base MAC:         a088c2ea8240
  Versions:         Current        Available
     FW             24.40.1000     N/A
     FW (Running)   24.35.2000     N/A
     PXE            3.6.0805       N/A
     UEFI           14.28.0016     N/A
     UEFI Virtio blk   22.4.0010      N/A
     UEFI Virtio net   21.4.0010      N/A

For Bluefield 3 DPUs you should expect the output similar to the following:

Querying Mellanox devices firmware ...

Device #1:
----------

Device Type:      BlueField3
  Part Number:      900-9D3B6-00CV-A_Ax
  Description:      NVIDIA BlueField-3 B3220 P-Series FHHL DPU; 200GbE (default mode) / NDR200 IB; Dual-port QSFP112; PCIe Gen5.0 x16 with x16 PCIe extension option; 16 Arm cores; 32GB on-board DDR; integrated BMC; Crypto Enabled
  PSID:             MT_0000000884
  PCI Device Name:  /dev/mst/mt41692_pciconf0
  Base MAC:         a088c232137a
  Versions:         Current        Available
     FW             32.41.1000     N/A
     PXE            3.7.0400       N/A
     UEFI           14.34.0012     N/A
     UEFI Virtio blk   22.4.0013      N/A
     UEFI Virtio net   21.4.0013      N/A

  Status:           No matching image found

Checking Bluefield Firmware Versions

To check the current Bluefield firmware versions installed on a DPU:

  1. Log into the staging server for the site

  2. Set up IP, password and token environment variables:

    export DPUBMCIP=<DPU BMC IP>
    export BMCPASS=<BMC Password>
    export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'`
    
  3. Check the current DPU BMC Firmware Versions:

    Bluefield 2 DPUs:

    curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory
    
    # Use the Firmware ID from the first command to complete the firmware ID needed for the following command:
    curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/<firmware_id>_BMC_Firmware | jq -r ' .Version'
    

    Bluefield 3 DPUs:

    curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'
    

Updating the Bluefield Firmware Versions

Note: If discovery is failing due to the firmware revision being too low, confirm with Forge Dev team what version you should update to before proceeding

DPU Firmware versions can be downloaded from the following locations:
BF2: https://confluence.nvidia.com/display/SW/BF2+BMC+Firmware+release
BF3: https://confluence.nvidia.com/display/SW/BF3+BMC+Firmware+release

For the examples below, we are installing FW version 24.01-5, but confirm this with Forge Development team for your specific install before proceeding

  1. Download the relevant packages for your DPU type:

    Bluefield 2:

    wget https://urm.nvidia.com/artifactory/sw-bmc-generic-local/BF2/BF2BMC-24.01-5/OPN/bf2-bmc-ota-24.01-5-opn.tar
    

    Bluefield 3:

    wget https://urm.nvidia.com/artifactory/sw-bmc-generic-local/BF3/BF3BMC-24.01-5/OPN/bf3-bmc-24.01-5_opn.fwpkg
    
  2. Copy the firmware package to the staging server for the site

  3. Set up IP, password and token environment variables:

    export DPUBMCIP=<DPU BMC IP>
    export BMCPASS=<BMC Password>
    export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'`
    
  4. Initiate the DPU BMC FW Upgrade:

    Bluefield 2:

    curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/octet-stream" -X POST -T bf2-bmc-ota-24.01-5-opn.tar https://$DPUBMCIP/redfish/v1/UpdateService/update
    

    Bluefield 3:

    curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/octet-stream" -X POST -T bf3-bmc-24.01-5_opn.fwpkg https://$DPUBMCIP/redfish/v1/UpdateService/update
    
  5. Monitor the firmware update progress:

    # List the running tasks:
    curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/TaskService/Tasks
    {
      "@odata.id": "/redfish/v1/TaskService/Tasks",
      "@odata.type": "#TaskCollection.TaskCollection",
      "Members": [
        {
          "@odata.id": "/redfish/v1/TaskService/Tasks/0"
        }
      ],
      "Members@odata.count": 1,
      "Name": "Task Collection"
    }
    
    # Display the current progress
    curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/TaskService/Tasks/0 | jq -r ' .PercentComplete'
    30
    
  6. Once the progress has reached 100% complete, initiate a reboot of the BMC:

    curl -k -H "X-Auth-Token: $BMCTOKEN" -H "Content-Type: application/json" -X POST -d '{"ResetType": "GracefulRestart"}' https://$DPUBMCIP/redfish/v1/Managers/Bluefield_BMC/Actions/Manager.Reset
    
  7. Once the DPU BMC has rebooted, retrieve a new BMC Token and check the installed firmware version:

    Bluefield 2:

    export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'`
    
    curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory
    # Use the Firmware ID from the first command to complete the firmware ID needed for the following command:
    curl -k -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/<firmware_id>_BMC_Firmware | jq -r ' .Version'
    
    

    Bluefield 3:

    export BMCTOKEN=`curl -k -H "Content-Type: application/json" -X POST https://$DPUBMCIP/login -d "{\"username\": \"root\", \"password\": \"$BMCPASS\"}" | grep token | awk '{print $2;}' | tr -d '"'`
    
    curl -ks -H "X-Auth-Token: $BMCTOKEN" -X GET https://$DPUBMCIP/redfish/v1/UpdateService/FirmwareInventory/BMC_Firmware | jq -r ' .Version'
    

DPU ARM OS: Checking Secure Boot Status

To successfully boot from the Forge BFB image, the DPU ARM OS needs to have Secure Boot disabled and configured for HTTP PXE boot.

Check current secure boot settings

  1. Log in to the staging server for the site

  2. Set up the DPU IP, password environment variables:

    export DPUBMCIP='BMC IP'
    export BMCPASS='BMC password'
    
  3. Check the current Secure Boot settings:

    curl -k -u root:"$BMCPASS" -X  GET https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBoot
    

    Note: If you do not see the SecureBootCurrentBoot option listed, you should install DOCA version 2.5.0

    If you see the following output, secure boot is enabled and it needs to be disabled:

    {
      "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot",
      "@odata.type": "#SecureBoot.v1_1_0.SecureBoot",
      "Description": "The UEFI Secure Boot associated with this system.",
      "Id": "SecureBoot",
      "Name": "UEFI Secure Boot",
      "SecureBootCurrentBoot": "Enabled",
      "SecureBootDatabases": {
        "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot/SecureBootDatabases"
      },
      "SecureBootEnable": true,
      "SecureBootMode": "UserMode"
    }
    

    If you see "SecureBootCurrentBoot": "Disabled", no action is required. You should attempt to boot the DPU ARM OS over the network:

    {
      "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot",
      "@odata.type": "#SecureBoot.v1_1_0.SecureBoot",
      "Description": "The UEFI Secure Boot associated with this system.",
      "Id": "SecureBoot",
      "Name": "UEFI Secure Boot",
      "SecureBootCurrentBoot": "Disabled",
      "SecureBootDatabases": {
        "@odata.id": "/redfish/v1/Systems/Bluefield/SecureBoot/SecureBootDatabases"
      },
      "SecureBootEnable": true,
      "SecureBootMode": "UserMode"
    }
    

Disable Secure Boot

To disable if Secure Boot if it is enabled:

  1. Run the command to disable Secure Boot:

    curl -k -u root:"$BMCPASS" -X  PATCH -H 'Content-Type: application/json' https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBoot -d '{"SecureBootEnable":false}'
    
  2. Restart the DPU ARM OS:

    curl -k -u root:"$BMCPASS" -X POST -H 'Content-Type: application/json' https://$DPUBMCIP/redfish/v1/Systems/Bluefield/Actions/ComputerSystem.Reset -d '{"ResetType" : "GracefulRestart"}'
    
  3. Wait for the DPU ARM OS to boot and check if Secure Boot is enabled now:

    curl -k -u root:"$BMCPASS" -X  GET https://$DPUBMCIP/redfish/v1/Systems/Bluefield/SecureBoot
    

    Note: You may need to run this step several times to disable secure boot. It may take up to 3 cycles of this for the setting to stick

If the "SecureBootCurrentBoot" setting is not shown, attempt to install DOCA 2.5.0:

  1. Download the BFB image on the staging server:

    mkdir DOCA
    cd DOCA
    wget https://image.azure.nvmetal.net/mirror/forge/DOCA_2.5.0_BSP_4.5.0_Ubuntu_22.04-1.23-10.prod.bfb --no-check-certificate
    
  2. Install the BFB image to the DPU ARM OS via the DPU BMC from the server with the BFB image:

    export DPUBMCIP='BMC IP'
    export BMCPASS='BMC password'
    sshpass -p $BMCPASS scp -o StrictHostKeyChecking=no DOCA_2.5.0_BSP_4.5.0_Ubuntu_22.04-1.23-10.prod.bfb root@$DPUBMCIP:/dev/rshim0/boot
    
  3. Log on to the DPU BMC and reboot the DPU ARM OS:

    echo SW_RESET 1 > /dev/rshim0/misc
    
  4. After the DPU ARM OS boots, log into the DPU ARM OS using the default password

  5. Switch to root and set the default username passwod back to the default

  6. Ensure that the DOCA firmware is up to date:

    sudo -i
    bfvcheck
    /opt/mellanox/mlnx-fw-updater/mlnx_fw_updater.pl
    
  7. Check that the DPU ARM OS is configured for HTTPs boot. Log into the DPU ARM OS and switch to root,

  8. List the current boot order:

    efibootmgr
    
  9. If the boot order is set to something similar the following, no action is needed and you should reboot the DPU ARM OS:

    BootCurrent: 0040
    Timeout: 3 seconds
    BootOrder: 0000,0040,0001,0002,0003
    Boot0000* NET-OOB-IPV4-HTTP
    Boot0001* NET-OOB.4040-IPV4
    Boot0002* UiApp
    Boot0003* EFI Internal Shell
    Boot0040* ubuntu0
    
  10. To set the correct boot order, create the /etc/bf.cfg file with the following contents:

    echo "BOOT0=NET-OOB-IPV4-HTTP
    BOOT1=DISK
    BOOT2=NET-OOB.4040-IPV4" >> /etc/bf.cfg
    
  11. Run the bfcfg command to update the boot order:

    bfcfg
    
  12. Verify that the boot order is no set to NET-OOB-IPV4-HTTP as default:

    efibootmgr
    
  13. Reboot the DPU ARM OS from the RSHIM console and monitor the reboot/provisioning process

    Note: If you see an error similar to the following during PXE boot, verify that Secure Boot is disabled correctly:

    EFI stub: Booting Linux Kernel...
    EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary
    EFI stub: UEFI Secure Boot is enabled.
    EFI stub: Using DTB from configuration table