Infiniband NIC and port selection
NCX Infra Controller (NICo) supports multiple Infiniband enabled Network Interface Cards (NICs). Each of those NICs might feature 1-2 physical ports, where each port allows to connect the NIC to an Infiniband switch that is part of a certain Infiniband fabric.
This document describes how NICo enumerates available NICs and how it makes them available for selection by a tenant during instance creation.
Requirements
- Hosts with the identical hardware configuration should be reported by NICo as having the exact same machine capabilities. E.g. a Machine having 2 Infiniband NICs that each have 2 ports that are connected to different Infiniband fabrics (4 fabrics in total), should be exactly reported as such.
- If NICo tenants configure multiple hosts of the same instance type with the same infiniband configuration and run the same operating system, they should find exactly the exact same device names on the host. This allows them to e.g. statically use certain Infiniband devices in applications and containers without a need for complex run-time enumeration on the tenant side. E.g. a tenant should be able to rely on the devices
ibp202s0f0andibp202s0f1always being available and connected their desired configuration.
Recommendation
Each port of all supported Infiniband NICs is reported as a separate PCI device. This makes those ports individually controllable and thereby mostly indistinguishable from a different physical NIC. E.g. an infiniband capable ConnectX-6 NIC shows up on a Linux host as the following 2 devices:
ubuntu@alpha:~$ lspci -v | grep Mellanox
ca:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Subsystem: Mellanox Technologies MT28908 Family [ConnectX-6]
ca:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Subsystem: Mellanox Technologies MT28908 Family [ConnectX-6]
Both show up as 2 independent infiniband devices:
ls /sys/class/infiniband
ibp202s0f0 ibp202s0f1
This setup is mostly equivalent to a setup with 2 single-port Infiniband NICs. Therefore we seem to have 2 options for presenting multi-port NICs to NICo users:
- Preferred: Present each physical port of a NIC as a separate Infiniband NIC. The combination of a NIC & port is referred to as
device. - Present a multi-port NIC as single NIC with multiple ports.
Option 1) is preferred because it simplifies the NICo data model and user experience: Users don't have to worry about 2 dimensions (NIC and port) when selecting an interface they want to configure - they only have to select a device. The fact that this interface is really a part of a hardware component that features 2 interfaces does not matter for the user workflows, where they want to use the infiniband device to send or receive data.
Various NICo user APIs can therefore by simplified to a point where no port information is required to be entered or shown. E.g. during Instance creation, the infiniband interface network configuration object only requires to pass a network device ID and no longer a port. In a similar fashion, the NICo internal data models for storing hardware information about infiniband devices can be simplified by dropping port data.
How are the devices still related?
While the devices for the 2 ports seem mostly independent, there are still a few areas where they behave different than 2 independent cards:
-
Both devices report the same serial number.
-
The Mellanox firmware tools (
mlxconfig,mst) show only a single device. E.g.MST devices: ------------ /dev/mst/mt4123_pciconf0 - PCI configuration cycles access. domain:bus:dev.fn=0000:ca:00.0 addr.reg=88 data.reg=92 cr_bar.gw_offset=-1 Chip revision is: 00This breaks the illusion of 2 independent devices. Since the tenant can install and use those tools without the availability of a NIC firmware lockdown, they are be able to inspect these properties. There however doesn't seem to be an obvious problem with it.
-
Due to 2), the port configurations for both ports are performed by manipulating a single device object in the Mellanox Firmware tools. E.g. both of the following commands
mlxconfig -d /dev/mst/mt4123_pciconf0 set LINK_TYPE_P1=2 LINK_TYPE_P2=2 mlxconfig -d /dev/mst/mt4123_pciconf0.1 set LINK_TYPE_P1=2 LINK_TYPE_P2=2reconfigure both ports of a physical card from ethernet to infiniband, independent of whether the target device is the first port (
/dev/mst/mt4123_pciconf0or 2nd port/dev/mst/mt4123_pciconf0.1).The same applies also for settings like
NUM_OF_VFSandSRIOV_EN.
None of those reasons seem blockers for representing the ports as separate devices for NICo users: Since NICo configures the device for tenants, they do not need to worry about the physical properties and can just use the independent devices.
Required changes
NICo machine hardware enumeration
When NICo discovers a machine that is intended to be managed by the NICo site controller, it enumerates its hardware details using the forge-scout tool.
The tool reports all discovered hardware information (e.g. the number and type of CPUs, GPUs, and network interfaces), and this information gets persisted in the NICo database.
The reported information includes the list of Infiniband network interfaces. The site controller needs the information to decide whether a certain Infiniband configuration is valid for a Machine.
The NICo DiscoveryData model for Infiniband that is defined as follows almost supports the preferred model:
message InfinibandInterface {
PciDeviceProperties pci_properties = 1;
string guid = 2;
}
message PciDeviceProperties{
string vendor = 1;
string device = 2;
string path = 3;
sint32 numa_node = 4;
optional string description = 5;
}
In this model, every port of an Infiniband NIC already shows up as a separate network device. E.g. a dual port ConnectX-6 NIC gets reported as:
[
{
"guid": "1234",
"pci_properties": {
"path": "/devices/pci0000:c9/0000:c9:02.0/0000:ca:00.0/net/ibp202s0f0",
"device": "0x101b",
"vendor": "0x15b3",
"numa_node": 1,
"description": "MT28908 Family [ConnectX-6]"
}
},
{
"guid": "5678",
"pci_properties": {
"path": "/devices/pci0000:c9/0000:c9:02.0/0000:ca:00.1/net/ibp202s0f1",
"device": "0x101b",
"vendor": "0x15b3",
"numa_node": 1,
"description": "MT28908 Family [ConnectX-6]"
}
}
]
There however seem to be aspects that we can improve on:
- The device and vendor names are passed as identifiers. If Tenants would want to
use the same information to configure infiniband on an instance, the API calls
to do that would contain the same non-descriptive data: Configure the first
Infiniband interface of type
vendor: 0x15b3anddevice: 0x101b. If we would use those fields to directly report the stringified versions, both the hardware report and the interface selection become more obvious to the user. We could also transmit both the IDs and the names. But as long as the IDs are not referenced in any other NICo APIs they do not seem too useful. - The device path is very OS and driver specific. A different path is reported
depending on which of the various Mellanox drivers the NICo discovery image uses.
We are be able to have more stable information by just persisting the PCI slot - either
in the existing
pathfield or a newslotfield. - For multi-fabric support, we would include the identifier of the fabric that the device is connected to. This field can be empty in the MVP which supports only a single fabric. An empty field would always reference the default Infiniband fabric.
- The
deviceis referred to asinterfacein the discovery data API, which is inconsistent with the remaining terminology. We can renameInfinibandInterfacetoInfinibandDevice, andinfiniband_interfacestoinfiniband_devices.
With these changes, the submitted discovery information for the dual port NIC is:
[
{
"guid": "1234",
"fabric": "IbFabric1",
"pci_properties": {
"slot": "0000:ca:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]",
"numa_node": 1,
"description": "TBD (not strictly required)"
}
},
{
"guid": "5678",
"fabric": "IbFabric2",
"pci_properties": {
"slot": "0000:ca:00.1",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]",
"numa_node": 1,
"description": "TBD (not strictly required)"
}
}
]
Instance Type hardware capabilities
The NICo cloud backend currently displays Machine hardware details with slightly less granularity than the site APIs. It uses a "Machine Capability" model that tries to model how many components of a particular type a Machine includes. This model reduces the amount of data that needs to be transferred between the Rest API backend and NICo users since it doesn't need to explain every individual component in detail. It also has the advantage that "machine capabilities" can describe groups of similar machines ("instance types") instead of just a single machine. Each machine the that adheres to an instance type shares the same capabilities.
To support Infiniband, we can extend the existing capabilities model of the NICo REST API backend to cover infiniband:
- Each Infiniband
devicewill be represented by a capability that describes the device. - The
typefield that is used for Infiniband devices would beInfiniband. - The
namefield is the device name. The vendor can optionally be stored a separatevendorfield. Alternatively thenamefield could store the concatenation ofvendorand the devicename. However since some APIs might just require the name, keeping the information separate seems clearer. - Every physical port of an Infiniband NIC would be shown as one separate
device (
count: 1). - For multi-fabric support, each entry would also be annotated with the
fabricthat the port is connected to. - Virtual Functions (VF)s are not presented in this list of hardware capabilities, since their existence can be controlled by configuring the associated Physical Function (PF).
- Hardware details like PCI slots and hardware GUIDs are not shown in this model. Since they could be different from Machine to Machine, they they can not be used in the data model that is shared across a range of Machines.
[
{
"type": "Infiniband",
"name": "MT28908 Family [ConnectX-6]",
"vendor": "Mellanox Technologies",
"count": 1,
"fabric": "IbFabric1",
},
{
"type": "Infiniband",
"name": "MT28908 Family [ConnectX-6]",
"vendor": "Mellanox Technologies",
"count": 1,
"fabric": "IbFabric2",
}
]
If both ports of the dual port NIC would be connected to the same fabric, the NIC would be represented as a single entry:
[
{
"type": "Infiniband",
"name": "MT28908 Family [ConnectX-6]",
"vendor": "Mellanox Technologies",
"count": 2,
"fabric": "IbFabric1",
}
]
Alternative: If we would merge the device vendor and name fields, the entry would become:
[
{
"type": "Infiniband",
"name": "Mellanox Technologies MT28908 Family [ConnectX-6]",
"count": 2,
"fabric": "IbFabric1",
}
]
Instance creation APIs
When tenants create instances, they need to pass configuration that describes how Infiniband interfaces on the new instance get configured.
For instance types that feature multiple devices, the tenant needs to select which device to utilize. This is especially important in cases where the ports of NICs are connected to different fabrics.
An important aspect of instance configuration APIs is that they are decoupled from the actual hardware. This allows configurations to be shared between all instances of the same instance type. And it allows hardware (like an actual NIC) to be replaced at runtime without changing the configuration objects. Therefore the tenant facing configurations do not contain machine-specific identifiers like a serial-number, MAC address or GUID on it. The tenant instead selects the device via attributes that are common between all machines of the same instance type.
Due to these constraints, we allow the tenant to select a device via
the following configuration object of type InstanceInfinibandConfig:
{
"ib_interfaces": [{
// The first 3 parameters select the physical PCI device
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
// Specifies that the n-th instance of the device will be used by this interface.
// In this example the first ConnectX-6 NIC&port that utilizes
// fabric "IbFabric1" will be configured.
"device_instance": 0,
// Select the PF or a specific VF. If a VF is required, the parameter
// `virtual_function_id` also needs to be supplied
"function_type": "PhysicalFunction",
// Configures the partition this interface gets attached to
"ib_partition_id": "some_partition_identifier",
}, {
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
"device_instance": 1,
"function_type": "VirtualFunction",
"virtual_function_id": 0,
"ib_partition_id": "some_other_partition_identifier",
}]
}
In this model, the device field references a particular Infiniband PCI device that
is reported in the name field of the Infiniband capability. It is used along with the fabric
attribute to select a device combination that is suitable for the purpose of
the tenant.
A capability that describes that a host supports multiple Infiniband devices
of the same model, attached to the same fabric (e.g. via count: 2) requires the
tenant needs to select via device_instance which particular instance of the device needs
to be configured.
The parameters device, fabric and device_instance always select the
physical PCI device (PhysicalFunction). A tenant uses the 2 additional parameters
function_type and virtual_function_id to configure a device that makes use of
a VirtualFunction on top of the selected PhysicalFunction.
Device vendor
The API described above fully omits the device vendor as a selection criteria.
This would make selection ambiguous in case a Machine would feature devices with the
same name but produced by different vendors.
Given all known devices that NICo will support initially are produced by Mellanox/NVIDIA,
this is however not an issue in the foreseeable future.
In case such a setup ever needs to be supported, an optional device_vendor field
could be added for each entry of InstanceInfinibandConfig to disambiguate the
target device in case of conflicts:
{
"ib_interfaces": [{
"device": "Ambiguous Device",
"vendor": "VendorA",
"fabric": "IbFabric1",
"device_instance": 0,
"function_type": "PhysicalFunction",
"virtual_function_id": 0,
"ib_partition_id": "some_partition_identifier",
}, {
"device": "Ambiguous Device",
"vendor": "VendorB",
"fabric": "IbFabric1",
"device_instance": 0,
"function_type": "PhysicalFunction",
"virtual_function_id": 0,
"ib_partition_id": "some_other_partition_identifier",
}]
}
The Web UI can combine all the necessary information into a single combo-box. E.g. it could show a combo box with the following content:
+-----------------------------------------------------------------------+
| Select Device |
+-----------------------------------------------------------------------+
| [IbFabric1]: Mellanox Technologies MT28908 Family [ConnectX-6] - Nr 0 |
| [IbFabric1]: Mellanox Technologies MT28908 Family [ConnectX-6] - Nr 1 |
+-----------------------------------------------------------------------+
This single selector would provide all the information that all layers need to configure the interface according to user requirements.
Mapping from Tenant Configuration to actual hardware interfaces
If a tenant selects a network interface, we need to be able to uniquely map the interface to a specific hardware interface.
E.g. this instance configuration request:
{
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
"device_instance": 1,
}
needs to map to the following hardware interface information:
{
"guid": "1234",
"fabric": "IbFabric1",
"pci_properties": {
"slot": "0000:ca:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]",
"numa_node": 1,
"description": "TBD (not strictly required)"
}
}
The fabric is directly copied, and the model fields map
to the device fields. The vendor field can be resolved by looking for any
device with the specified device name.
Thereby the only challenge is how to map instance in an non ambiguous fashion.
We can achieve this by sorting the interfaces based on the PCI slot,
and pick the N-th slot that satisfies the criteria.
Example 2:
Assuming the following hardware information is available:
[{
"guid": "1234",
"fabric": "IbFabric1",
"pci_properties": {
"slot": "0000:cb:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]"
}
},{
"guid": "2345",
"fabric": "IbFabric2",
"pci_properties": {
"slot": "0000:cd:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]"
}
},{
"guid": "3456",
"fabric": "IbFabric1",
"pci_properties": {
"slot": "0000:ea:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]"
}
},{
"guid": "4567",
"fabric": "IbFabric2",
"pci_properties": {
"slot": "0000:eb:00.0",
"vendor": "Mellanox Technologies",
"device": "MT28908 Family [ConnectX-6]"
}
}]
In this example a selection of
{device: "Mellanox ... MT28908 ...", fabric: "IbFabric1", device_instance: 0}would select the interface with GUID1234.{device: "Mellanox ... MT28908 ...", fabric: "IbFabric1", device_instance: 1}would select the interface with GUID3456.{device: "Mellanox ... MT28908 ...", fabric: "IbFabric2", device_instance: 0}would select the interface with GUID2345.{device: "Mellanox ... MT28908 ...", fabric: "IbFabric2", device_instance: 1}would select the interface with GUID4567.
An alternative seems to be to sort the interfaces by hardware guid instead of
PCI slot. The downside of this mapping is that it won't be stable
across machines of the same instance type. E.g. the selection in our example
might sometimes select a device in slot 4 and sometimes a device in slot 5 in case the
GUIDs are different. Since the PCI slots are assumed to be deterministic
for Machines with the same hardware configuration, tenants can assume their selection
always affects the exact same piece of hardware.
Forge Metadata Service (FMDS)
This will be renamed to something else (likely just NICo Metadata Service as we move from the old code name
The Forge Metadata Service (FMDS) provides the Tenant's software running on instance the capability to identify the infiniband configuration at runtime. It also provides the ability to execute a configuration script which configures the local Infiniband interfaces for the operating mode that the Tenant desired for this instance. This script needs to configure all network interfaces on the host. This includes
- setting the correct number of VFs per physical device
- writing GUIDs that NICo allocated for VF interfaces to the locations the OS expects them it
Applying these settings configure the interfaces in software in a way that allows them to send their traffic successfully to the connected Infiniband switches.
To perform this job, FMDS returns the applied instance configuration -
which is the desired InstanceInfinibandConfig plus the configuration data that
Forge allocates on behalf the tenant. This would be mostly the GUIDs.
Putting it together, the tenant machine would retrieve the following data via FMDS, in a format that is still TBD:
{
"config": {
"infiniband": {
"ib_interfaces": [{
// Selects the device (NIC and Port)
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
"device_instance": 0,
// Select the PF or a specific VF
"function_type": "VirtualFunction",
"virtual_function_id": 0,
// Configures the partition this interface gets attached to
"ib_partition_id": "some_partition_identifier",
}]
}
},
"status": {
"infiniband": {
"ib_interfaces": [{
"guid": "1234",
"lid": 123,
"addresses": ["5.6.7.8", "::8:1:3:4:5"]
}]
}
}
}
The FMDS client needs to perform the mapping from configuration
parameters to the actual Linux devicename (in /sys/class/infiniband) to apply
the necessary configuration. This requires the same knowledege about
the unique mapping of the configuration to the actual hardware that is residing
in NICo. A challenge here is however that the client running
on a tenants host is not able to resolve the fabric per interface. Since
the fabric is one part of the mapping in a multi-fabric context, the mapping would
no longer be unambiguous. An alternative to this is to extend
status.infiniband.ib_interfaces in a way that allows the software on the tenant
host to easier lookup the necessary device. E.g. we would return the hardware
guid of the associated physical function in every interface. Along:
{
"status": {
"infiniband": {
"ib_interfaces": [{
"pf_guid": "1234",
"guid": "1234",
"lid": 123,
"addresses": ["5.6.7.8", "::8:1:3:4:5"]
}, {
"pf_guid": "1234",
"guid": "3457",
"lid": 124,
"addresses": ["5.6.7.9", "::8:1:3:4:56"]
}]
}
}
}
Alternatives considered
Interface configuration via unique PCI address (device_slot)
The APIs described above make it slightly ambigiuos which device (in terms of
PCI slot) a tenant would use for an interface. They tenant specifies the following
in an instance creation request
{
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
"device_instance": 2,
"ib_partition_id": "partition_a"
}
and the system would look up what PCI address device_instance: 2 refers to.
This mapping might not be obvious in a system which features multiple NICs with
one or multiple ports, and each of them connected to a mix of fabrics.
E.g. a tenant could be surprised that device_instance can have the
same value for 2 devices that utilize a different fabric, since the index is
per device & fabric combination. E.g. the following configuration is valid:
[{
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric1",
"device_instance": 1,
"ib_partition_id": "Partition_A"
},
{
"device": "MT28908 Family [ConnectX-6]",
"fabric": "IbFabric2",
"device_instance": 1,
"ib_partition_id": "Partition_B"
}]
It would select the 2nd device of type ConnectX-6 that is connected to IbFabric1
and configure it to use partition Partition_A. Whereas the 2nd device of type
ConnectX-6 that is connected to IbFabric2`` will use partition Partition_B`.
To avoid this concern, we can move towards an API which uses the unique PCI address/slot for instance creation. In this model, a tenant would configure the instance with the following request
{
"ib_interfaces": [{
// This single parameters selects the device (NIC, Port and thereby Fabric)
"device_slot": "0000:ca:00.0",
// Select the PF or a specific VF. If a VF is required, the parameter
// `virtual_function_id` also needs to be supplied
"function_type": "PhysicalFunction",
// Configures the partition this interface gets attached to
"ib_partition_id": "some_partition_identifier",
}, {
"device_slot": "0000:ca:00.1",
"function_type": "VirtualFunction",
"virtual_function_id": 0,
"ib_partition_id": "some_other_partition_identifier",
}]
}
The hardware inventory data model already provides the slot address. Therefore
no additional changes are required here.
However the machine capability model needs to be extended to include the slot
information, since it is used by the NICo Admin UI to explain the tenant what devices
can be configured. E.g. the reported machine capability data could be:
[
{
"type": "Infiniband",
"name": "MT28908 Family [ConnectX-6]",
"vendor": "Mellanox Technologies",
"count": 1,
"fabric": "IbFabric1",
"slot": "0000:ca:00.0"
},
{
"type": "Infiniband",
"name": "MT28908 Family [ConnectX-6]",
"vendor": "Mellanox Technologies",
"count": 1,
"fabric": "IbFabric2",
"slot": "0000:ca:00.1"
}
]
Since the slot is unique per device, the count field could never be anything
different than 1 for Infiniband capabilities.
Downsides of the device_slot based API
The device_slot based API is not preferred, because it makes it harder for API
users to spin up an instance without an excessive amount of "prior knowledge".
In the recommended model tenants that require to configure a single Infiniband
Interface will likely just need to specify the device name which is well known
(e.g. MT28908 Family [ConnectX-6]). The fabric field might not need to be specified
since it would be the site default, and the device_instance could simply be 0.
This simplicity would remain even if machine contains multiple devices that are connected to the same fabric, and where the tenant wants to configure all of them.
The advantages of the device_slot based APIs would only show up in complex
deployments with multiple NICs and multiple Fabrics.
Another downside is that the device_slot based API strictly requires the
PCI slot addresses to be consistent between all machines of a certain instance type.
The preferred model can support different PCI slot addresses to the extent that
instance creation and configuration would still work as expected.
Other considerations
Terminology
A variety of different terms had been used to reference "things to send/receive infiniband traffic":
- Network Interface Cards (NICs)
- Network Adapters
- Host Channel Adapters (HCAs)
- Devices
- Interfaces
Each of those terms is sometimes used to reference to a full Infiniband card that might provide more than 1 port, to just a single port on the card, or even to a purely virtual output that is provided by the card (a VF).
To avoid confusion, The APIs presented in this document are consistently using the following terms with meanings defined as follows:
Devices
- A
deviceis a physical PCI device which can be used to send and receive Infiniband traffic. - The operating system of a Tenants host shows each device separately. E.g.
on Linux, each
deviceshows up under/sys/class/infiniband/. - A Network Interface Card (NIC) can provide 1 or more
devices. - The "Physical Function" (PF) of each PCI device leads to a
devicebeing made available. Besides that the usage of "Virtual Functions" (VFs) allows to configure additionaldevices that share the same hardware.
Interfaces
An interface represents a device that is configured towards a certain purpose.
For example a tenant can configure the first device of a certain type on their
host to be connected to Partition A, and the second device to Partition B.
Therefore, BB refers to interfaces when in instance configuration APIs and
when providing status information about running instances.
Open questions
- Should NICo documentation settle on a specific term to reference a full NIC?
E.g.
NICorAdapter? It might be necessary in order to explain workflows for tools which do only show the complete NIC and not individual devices (e.g.mlxconfig)
Numa Node awareness
We discussed a bit on whether the NUMA node that a device is connected to should be exposed to the user, or whether a tenant should even be able to select a device by NUMA node. This would help the tenant to achieve better locality between the device and a connected GPU for some applications.
While this seems like an interesting feature, it would also complicate the APIs even more by introducing yet another selector.
Even without introducing NUMA awareness on the API layer, tenants should be
able to achieve the same goal by exploiting the fact that the device mapping is
equivalent for all machines of an instance type: The Tenant can create a
test instance, and determine based on introspection of this particular instance
whether they have a suitable device configuration. They can modify the interface
selection (via instance) until they achieve their ideally desired configuration.
Once they have found the desired configuration, they would be able to carry it
over to other instances using the exact same configuration.