Skip to content

Socket and RDMA Benchmarking

Use this page when the peer protocol is TCP, UDP, or RoCE/RDMA. These benchmarks use the Linux networking stack for TCP/UDP and RDMA verbs for RoCE, so the same client/server namespace shape is useful for proving that traffic leaves the host through the expected NIC path.

Make sure to build DAQIRI with the socket and RDMA backends first.

Backend choices

Protocol YAML selector Benchmark executable Typical reason to use it
TCP stream_type: "socket", protocol: "tcp" daqiri_bench_socket Baseline against normal Linux streams or test a TCP-speaking peer.
UDP stream_type: "socket", protocol: "udp" daqiri_bench_socket Datagram baseline against Linux networking. UDP payloads must be at most 65507 bytes.
RoCE/RDMA stream_type: "socket", protocol: "roce" daqiri_bench_rdma Compare DAQIRI RDMA verbs against tools such as ib_send_bw or ib_write_bw.

Build and launch a test shell

Build the socket and RDMA benchmarks inside the DAQIRI container:

docker run --rm --privileged --network=host --gpus all --ipc=host \
  --user "$(id -u):$(id -g)" \
  -v /dev/hugepages:/dev/hugepages \
  -v "$PWD:/work" \
  -w /work daqiri:local \
  bash -lc 'cmake -S . -B build-socket-rdma \
    -DBUILD_SHARED_LIBS=ON \
    -DDAQIRI_BUILD_PYTHON=OFF \
    -DDAQIRI_MGR="dpdk socket rdma" &&
    cmake --build build-socket-rdma \
      --target daqiri_bench_socket daqiri_bench_rdma -j"$(nproc)"'

Run the benchmark setup commands as root. The easiest path is a privileged, host-networked DAQIRI container:

docker run --rm -it --privileged --network=host --pid=host --ipc=host \
  --gpus all \
  -v "$PWD:/work" \
  -v /tmp:/tmp \
  -w /work daqiri:local bash

Install network tools inside the container if needed:

apt-get update
apt-get install -y iproute2 iputils-ping ethtool iperf3 rdma-core ibverbs-utils

Create isolated namespaces

Choose one transmit-facing interface and one receive-facing interface. The example below uses the Spark pair that was verified to increment physical counters on the tested system; adjust names, IPs, and MAC addresses on other machines.

CLIENT_NS=dq_wire_client
SERVER_NS=dq_wire_server

CLIENT_IF=enp1s0f0np0
SERVER_IF=enp1s0f1np1

CLIENT_IP=10.250.0.1
SERVER_IP=10.250.0.2

CLIENT_MAC=4c:bb:47:2a:ea:ed
SERVER_MAC=4c:bb:47:2a:ea:ee

MTU=9082

Create namespaces and pin routes to the physical interfaces:

ip netns delete "$CLIENT_NS" >/dev/null 2>&1 || true
ip netns delete "$SERVER_NS" >/dev/null 2>&1 || true

ip addr flush dev "$CLIENT_IF" || true
ip addr flush dev "$SERVER_IF" || true

ip netns add "$CLIENT_NS"
ip netns add "$SERVER_NS"

ip link set "$CLIENT_IF" netns "$CLIENT_NS"
ip link set "$SERVER_IF" netns "$SERVER_NS"

ip -n "$CLIENT_NS" addr add "$CLIENT_IP/24" dev "$CLIENT_IF"
ip -n "$SERVER_NS" addr add "$SERVER_IP/24" dev "$SERVER_IF"

ip -n "$CLIENT_NS" link set lo up
ip -n "$SERVER_NS" link set lo up
ip -n "$CLIENT_NS" link set "$CLIENT_IF" mtu "$MTU" up
ip -n "$SERVER_NS" link set "$SERVER_IF" mtu "$MTU" up

ip -n "$CLIENT_NS" route add "$SERVER_IP/32" dev "$CLIENT_IF"
ip -n "$SERVER_NS" route add "$CLIENT_IP/32" dev "$SERVER_IF"

ip -n "$CLIENT_NS" neigh replace "$SERVER_IP" \
  lladdr "$SERVER_MAC" dev "$CLIENT_IF" nud permanent
ip -n "$SERVER_NS" neigh replace "$CLIENT_IP" \
  lladdr "$CLIENT_MAC" dev "$SERVER_IF" nud permanent

Verify the route and a short control packet:

ip -n "$CLIENT_NS" route get "$SERVER_IP" from "$CLIENT_IP"
ip -n "$SERVER_NS" route get "$CLIENT_IP" from "$SERVER_IP"
ip netns exec "$CLIENT_NS" ping -c 1 -W 1 "$SERVER_IP"

The route output should name the namespace interface, not lo.

RDMA device visibility

On most RoCE setups, the RDMA device follows the netdev/GID association used by the namespace. If ibv_devinfo or rdma link show inside a namespace cannot see the expected device, move the matching RDMA device into the namespace with rdma dev set <rdma_device> netns <namespace>, or run the RDMA benchmark in the host namespace and still verify the same physical counters.

Prove the pair hits the wire

Capture directional PHY counters before and after a short transfer. For one-way client-to-server traffic:

ip netns exec "$CLIENT_NS" ethtool -S "$CLIENT_IF" | \
  grep -E 'tx_packets_phy|tx_bytes_phy|tx_vport_unicast'
ip netns exec "$SERVER_NS" ethtool -S "$SERVER_IF" | \
  grep -E 'rx_packets_phy|rx_bytes_phy|rx_vport_unicast'

Use iperf3 as a quick proof before running DAQIRI:

ip netns exec "$SERVER_NS" iperf3 -s -B "$SERVER_IP" -1 &
sleep 1
ip netns exec "$CLIENT_NS" iperf3 -c "$SERVER_IP" -B "$CLIENT_IP" -t 2 -P 1
wait

Then check the counters again. Treat the result as on-wire only when the client tx_packets_phy and server rx_packets_phy counters increase by matching packet counts. If only vport counters move, pick a different port pair.

Run the Linux socket benchmark

The shipped configs run both endpoints on 127.0.0.1 and are useful for a smoke test:

./build-socket-rdma/examples/daqiri_bench_socket \
  examples/daqiri_bench_socket_udp_tx_rx.yaml \
  --seconds 10 --mode both

./build-socket-rdma/examples/daqiri_bench_socket \
  examples/daqiri_bench_socket_tcp_tx_rx.yaml \
  --seconds 10 --mode both

For an on-wire namespace test, use separate server and client YAML files. The important fields are the protocol, namespace IPs, server port, max_payload_size, memory-region buf_size, and benchmark message_size.

Server-side UDP template:

%YAML 1.2
---
daqiri:
  cfg:
    version: 1
    stream_type: "socket"
    protocol: "udp"
    master_core: 3
    debug: false
    log_level: "info"
    memory_regions:
    - name: "DATA_SOCKET_SERVER"
      kind: "host"
      affinity: 0
      num_bufs: 1024
      buf_size: 65507
    interfaces:
    - name: udp_server
      address: 10.250.0.2
      socket_config:
        mode: server
        local_ip: 10.250.0.2
        local_port: 5021
        max_payload_size: 65535
      rx:
        queues:
        - name: "RX_Queue"
          id: 0
          cpu_core: 8
          batch_size: 1
          memory_regions: ["DATA_SOCKET_SERVER"]
      tx:
        queues:
        - name: "TX_Queue"
          id: 0
          cpu_core: 7
          batch_size: 1
          memory_regions: ["DATA_SOCKET_SERVER"]

socket_bench_server:
  server: true
  send: false
  receive: true
  iterations: 1000000000
  message_size: 65507
  server_address: 10.250.0.2
  client_address: 10.250.0.1
  server_port: 5021

Client-side UDP template:

%YAML 1.2
---
daqiri:
  cfg:
    version: 1
    stream_type: "socket"
    protocol: "udp"
    master_core: 3
    debug: false
    log_level: "info"
    memory_regions:
    - name: "DATA_SOCKET_CLIENT"
      kind: "host"
      affinity: 0
      num_bufs: 1024
      buf_size: 65507
    interfaces:
    - name: udp_client
      address: 10.250.0.1
      socket_config:
        mode: client
        local_ip: 10.250.0.1
        local_port: 5121
        remote_ip: 10.250.0.2
        remote_port: 5021
        max_payload_size: 65535
      rx:
        queues:
        - name: "RX_Queue"
          id: 0
          cpu_core: 8
          batch_size: 1
          memory_regions: ["DATA_SOCKET_CLIENT"]
      tx:
        queues:
        - name: "TX_Queue"
          id: 0
          cpu_core: 7
          batch_size: 1
          memory_regions: ["DATA_SOCKET_CLIENT"]

socket_bench_client:
  server: false
  send: true
  receive: false
  iterations: 1000000000
  message_size: 65507
  server_address: 10.250.0.2
  client_address: 10.250.0.1
  server_port: 5021

For TCP, change protocol: "udp" to protocol: "tcp" in both files. For UDP, keep message_size at or below 65507.

Run the server and client in their namespaces:

export LD_LIBRARY_PATH=/work/build-socket-rdma/src:${LD_LIBRARY_PATH:-}
BIN=/work/build-socket-rdma/examples/daqiri_bench_socket

ip netns exec "$SERVER_NS" env LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  "$BIN" /tmp/socket-server.yaml --seconds 11 --mode server &

sleep 1

ip netns exec "$CLIENT_NS" env LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  "$BIN" /tmp/socket-client.yaml --seconds 10 --mode client

wait

For a four-process run, create four server/client YAML pairs with unique server ports such as 5021, 5022, 5023, and 5024, and unique client local ports such as 5121, 5122, 5123, and 5124.

Run the RDMA RoCE benchmark

Start from examples/daqiri_bench_rdma_tx_rx.yaml or examples/daqiri_bench_rdma_tx_rx_spark.yaml. The full config can run both endpoints in one process:

./build-socket-rdma/examples/daqiri_bench_rdma \
  examples/daqiri_bench_rdma_tx_rx_spark.yaml \
  --seconds 10 --mode both

For namespace testing, split the file by role just as in the Linux socket test:

  • The server YAML keeps the server memory regions, the server interface with socket_config.mode: server, and rdma_bench_server.
  • The client YAML keeps the client memory regions, the client interface with socket_config.mode: client, and rdma_bench_client.
  • Both files use stream_type: "socket" and protocol: "roce".
  • rdma_bench_client.client_address should be the client namespace IP.

Run the split RDMA test with the same namespace pair:

export LD_LIBRARY_PATH=/work/build-socket-rdma/src:${LD_LIBRARY_PATH:-}
BIN=/work/build-socket-rdma/examples/daqiri_bench_rdma

ip netns exec "$SERVER_NS" env LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  "$BIN" /tmp/rdma-server.yaml --seconds 11 --mode server &

sleep 1

ip netns exec "$CLIENT_NS" env LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \
  "$BIN" /tmp/rdma-client.yaml --seconds 10 --mode client

wait

Use ib_send_bw or ib_write_bw in the same namespaces as a comparison baseline, and monitor mlnx_perf or ethtool -S on the same directional interfaces.

Example Spark socket results

The following DAQIRI socket matrix was run on the verified physical path enp1s0f0np0 -> enp1s0f1np1 with four client/server process pairs:

Protocol Message size App TX App RX Loss Client tx_packets_phy Server rx_packets_phy
TCP 1000 10.93 Gb/s 10.93 Gb/s 0.00% 1,513,047 1,513,047
TCP 8000 11.20 Gb/s 11.20 Gb/s 0.00% 1,550,052 1,550,052
TCP 1 MiB 11.67 Gb/s 11.67 Gb/s 0.00% 1,615,399 1,615,399
UDP 1000 12.28 Gb/s 11.68 Gb/s 4.88% 15,350,463 15,350,463
UDP 8000 12.93 Gb/s 10.10 Gb/s 21.91% 2,020,461 2,020,461
UDP 65507 12.84 Gb/s 12.41 Gb/s 3.34% 1,960,392 1,960,392

UDP 1 MiB is intentionally skipped because Linux UDP payloads above 65507 bytes require fragmentation or segmentation behavior outside the benchmark's supported payload model.

Restore host networking

After tests, move interfaces back to the host and restore the usual IPs. Adjust names and addresses for the target machine:

for ns in "$CLIENT_NS" "$SERVER_NS"; do
  ip netns exec "$ns" ip link set "$CLIENT_IF" netns 1 >/dev/null 2>&1 || true
  ip netns exec "$ns" ip link set "$SERVER_IF" netns 1 >/dev/null 2>&1 || true
done

ip netns delete "$CLIENT_NS" >/dev/null 2>&1 || true
ip netns delete "$SERVER_NS" >/dev/null 2>&1 || true

for ifc in enp1s0f0np0 enp1s0f1np1 enP2p1s0f0np0 enP2p1s0f1np1; do
  ip addr flush dev "$ifc" >/dev/null 2>&1 || true
  ip link set dev "$ifc" mtu 9082 up >/dev/null 2>&1 || true
done

Loopback disable knobs

If namespace isolation still increments only vport counters, check whether the platform exposes loopback control:

ethtool --show-priv-flags <interface>
ethtool --set-priv-flags <interface> local_lb off

mlxconfig -d <device> q | grep FORCE_LOOPBACK_DISABLE
mlxconfig -d <device> set FORCE_LOOPBACK_DISABLE=1

Treat firmware settings as maintenance-window changes: query first, set only with the proper Mellanox tooling available, then reset or reboot as required and rerun the same rx_packets_phy proof.


Previous: Benchmarking
Next: Raw Ethernet Benchmarking