Understanding the Configuration File¶
Choosing an example config¶
Choosing the appropriate DAQIRI backend for your setup¶
DAQIRI ships three backends, selected at build time via DAQIRI_MGR (the default build enables all three). Which backend's example YAML you start from depends on your hardware and topology:
- DPDK raw — kernel-bypass raw Ethernet with GPUDirect zero-copy. Highest performance. Requires a Mellanox/ConnectX-class NVIDIA NIC;
tx_portandrx_portcan share one physical NIC for a single-host closed-loop bench, or be split across two hosts. - RDMA / RoCE — low-latency verbs over an RDMA-capable fabric. The natural choice when you have NVIDIA NICs at both endpoints of a host-to-host link.
- Kernel TCP/UDP sockets — no NIC, no privileges, no special CMake flags. Useful as a comparison baseline against DPDK and RDMA, or as a path to first results when no NVIDIA NIC is available.
If you don't have any NIC at all, the *_sw_loopback* variants of the DPDK configs need no hardware — useful for first-time build verification.
With a backend in mind, read down the questions below and stop at the first one that matches what you're trying to do. Each section names the YAML, the binary that consumes it, and any platform-specific notes.
1. I want to measure baseline throughput
Pick the backend that matches your stack (see the backend overview above), then the hardware or protocol variant.
DPDK raw — runs on daqiri_bench_raw_gpudirect.
- Generic discrete GPU (template — replace
<placeholders>) —daqiri_bench_raw_tx_rx.yaml. This is the file annotated line-by-line in the walkthrough below. - DGX Spark / GB10 (prefilled) —
daqiri_bench_raw_tx_rx_spark.yaml.kind: host_pinnedfor the integrated GPU; cores, PCIe addresses, and IPs are prefilled. See the Spark profile callout for run details. - No physical NIC available —
daqiri_bench_raw_sw_loopback.yaml.loopback: "sw", no NIC required. Useful for first-time build verification, not representative of production performance.
RDMA / RoCE — runs on daqiri_bench_rdma (use --mode {tx,rx,both}). Configs use kind: host_pinned regardless of platform.
- Generic (template — replace IPs) —
daqiri_bench_rdma_tx_rx.yaml. - DGX Spark (prefilled) —
daqiri_bench_rdma_tx_rx_spark.yaml. See the Spark profile callout for run details.
Kernel TCP/UDP sockets — runs on daqiri_bench_socket. Both bind to 127.0.0.1.
2. I have out-of-order UDP packets that need to be reordered on the GPU
DAQIRI's flagship pipeline: a CUDA kernel reads a sequence number from each packet's header and places packets at the correct offset in a GPU buffer, so a downstream consumer sees a fully ordered stream without a CPU touch. Configs run on daqiri_bench_raw_reorder_seq unless 2.4 applies. Sub-questions:
2.1 Which algorithm matches how your packets encode batches?
- "My protocol sends a fixed N packets per logical batch; the seqno identifies position within the batch" —
seq_packets_per_batch. - "My protocol identifies the batch index in the seqno; packets-per-batch is fixed at the protocol level" —
seq_batch_number.
2.2 Where should the reorder run?
- GPU kernel (default, recommended) —
reorder_type: "gpu". - CPU (throughput-bounded; comparison/baseline path) —
reorder_type: "cpu".
2.3 Self-contained, or do you have a TX peer?
- TX+RX — closed-loop in one process.
- RX-only — you'll generate traffic separately. A standalone run of any
raw_rx_*config exits cleanly with0packets if no traffic arrives — that is not a bug; you need a TX peer.
2.4 Do you also need an in-kernel payload type conversion?
- No — pick a leaf from the table below.
- Yes —
daqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml(runs ondaqiri_bench_raw_reorder_quantize, notdaqiri_bench_raw_reorder_seq). Combinesseq_batch_numberreorder with an in-kernel payload type conversion; thedata_typesblock sets the input and output types (the example uses int4 → fp32). Pick this when wire format and compute format differ.
Concrete leaves (without conversion):
| YAML | Algorithm | Kernel | Direction |
|---|---|---|---|
daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml |
seq_packets_per_batch (1024) |
GPU | TX+RX |
daqiri_bench_raw_tx_rx_reorder_seq_1024_cpu.yaml |
seq_packets_per_batch (1024) |
CPU | TX+RX |
daqiri_bench_raw_rx_reorder_seq_ppb.yaml |
seq_packets_per_batch (128) |
GPU | RX-only |
daqiri_bench_raw_rx_reorder_seq_batch.yaml |
seq_batch_number |
GPU | RX-only |
daqiri_bench_raw_sw_loopback_reorder_seq_1024.yaml |
seq_packets_per_batch (1024) |
CPU | TX+RX, no NIC |
Requires: DPDK build + Mellanox-class NIC (or the SW-loopback variant for first-time validation).
A diff-style walkthrough of daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml appears below.
3. I need to parse small per-packet metadata on the CPU while keeping payload on the GPU
daqiri_bench_raw_tx_rx_hds.yaml(runs ondaqiri_bench_raw_hds).
Header-data split: segment 0 (CPU) holds the header, segment 1 (GPU) holds the payload via GPUDirect zero-copy. Pick this when the CPU needs to read small per-packet fields without ever touching the payload.
Requires: DPDK build + Mellanox-class NIC.
A diff-style walkthrough of this config appears below.
4. I need flow-based load balancing across multiple RX queues
daqiri_bench_raw_rx_multi_q.yaml(runs ondaqiri_bench_raw_gpudirect).
RX-only by design — drive traffic from a separate peer. Demonstrates flow-rule-based routing across multiple RX queues, each pinned to its own CPU core.
Requires: DPDK build + Mellanox-class NIC + a separate TX traffic source.
5. I need to record packet data to disk
Sub-question: which output format?
5.1 Wireshark- / tcpdump-compatible PCAP — runs on daqiri_example_pcap_writer. Default; works on any filesystem. Run shape: daqiri_example_pcap_writer <yaml> <output.pcap> [--tx] (omit --tx for an RX-only tcpdump-style capture).
- Hardware loopback —
daqiri_example_pcap_writer_tx_rx.yaml. - No physical NIC available —
daqiri_example_pcap_writer_sw_loopback.yaml.
Requires: DPDK build. No special CMake flag.
5.2 Zero-copy GPU → NVMe writes (advanced) — runs on daqiri_example_gds_write. Pick this only if the GPU-to-disk zero-copy path is the specific subject of investigation; otherwise pick PCAP (5.1).
- Hardware loopback —
daqiri_example_gds_write_tx_rx.yaml. - No physical NIC available —
daqiri_example_gds_write_sw_loopback.yaml.
Requires: built with -DDAQIRI_ENABLE_GDS=ON, NVMe-backed storage, working cuFile / nvidia_fs stack, gdscheck.py -p reports NVMe : Supported.
Annotated walkthrough¶
This section walks through three YAML configurations: the base TX+RX template, followed by diff-style snippets for header-data split (HDS) and GPU packet reordering. Click on the icons to expand explanations for each annotated line.
Annotations are prefixed with a category icon when applicable:
- System-specific — must be changed for your hardware
- Payload-dependent — adjust based on your application's packet format and throughput needs
In each code block, the lines you're most likely to tune are highlighted: system-specific addresses, cores, and MAC/IPs in the base walkthrough; feature-defining values (split boundaries, batch sizes, sequence-number positions) in the HDS and reorder diff snippets below.
Base TX+RX config¶
The annotated example below is daqiri_bench_raw_tx_rx.yaml.
daqiri: # (1)!
cfg:
version: 1
stream_type: "raw" # (2)!
master_core: 3 # (3)!
debug: false
log_level: "info"
loopback: "" # (4)!
memory_regions: # (5)!
- name: "Data_TX_GPU"
kind: "device" # (6)!
affinity: 0 # (7)!
num_bufs: 51200 # (8)!
buf_size: 8064 # (9)!
- name: "Data_RX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 8064
interfaces: # (10)!
- name: "tx_port"
address: <0000:00:00.0> # (11)!
tx: # (12)!
queues: # (13)!
- name: "tx_q_0"
id: 0
batch_size: 10240 # (14)!
cpu_core: 11 # (15)!
memory_regions: # (16)!
- "Data_TX_GPU"
offloads: # (17)!
- "tx_eth_src"
- name: "rx_port"
address: <0000:00:00.0> # (18)!
rx:
flow_isolation: true # (19)!
queues:
- name: "rq_q_0"
id: 0
cpu_core: 9
batch_size: 10240
memory_regions:
- "Data_RX_GPU"
flows: # (20)!
- name: "flow_0"
id: 0 # (21)!
action: # (22)!
type: queue
id: 0
match: # (23)!
udp_src: 4096
udp_dst: 4096
bench_rx: # (24)!
interface_name: "rx_port"
bench_tx: # (25)!
interface_name: "tx_port"
batch_size: 10240
payload_size: 8000
header_size: 64
eth_dst_addr: <00:00:00:00:00:00>
ip_src_addr: <1.2.3.4>
ip_dst_addr: <5.6.7.8>
udp_src_port: 4096
udp_dst_port: 4096
- The
daqirisection configures the DAQIRI library, which is responsible for setting up the NIC. It is passed todaqiri_init(...)during application startup. Within this section,name:fields on interfaces, queues, flows, and memory regions are used only for logging — pick any descriptive string. stream_type·string· required — High-level transport family selected for this config. Supported:"raw"(DPDK raw Ethernet, used here),"socket"(kernel sockets and RDMA/RoCE; the specific protocol is then set via a separateprotocol:field). The actual backend implementation is chosen at build time viaDAQIRI_MGR—stream_typeonly picks among the backends you built.-
master_core·integer (CPU core ID)· required — Core used for DAQIRI setup. Does not need to be isolated; recommended to differ from thecpu_corefields below that poll the NIC. loopback·string· default:""— Loopback mode. Supported:""(no loopback; use the physical NIC),"sw"(software loopback — no NIC required, used by the*_sw_loopback*configs for first-time build verification).- The
memory_regionssection lists where the NIC will write/read data from/to when bypassing the OS kernel. Tip: when using GPU buffer regions, keeping the sum of their buffer sizes below 80% of your BAR1 size is generally a good rule of thumb. -
kind·string· required — Type of memory backing the region. Supported:device(GPU VRAM via GPUDirect — preferred on discrete GPUs),host_pinned(CPU pinned memory — required on integrated GPUs like NVIDIA GB10/DGX Spark where peer-DMA isn't available),huge(hugepages, CPU),host(CPU unpinned). See the memory regions reference. Choose based on whether packets are processed on the GPU or CPU and on the GPU class. -
affinity·integer (GPU ID / NUMA node)· required — GPU device ID whenkind: deviceorkind: host_pinned; NUMA node ID for CPU memory regions (huge,host). -
num_bufs·integer· required — Number of buffers in the region. Higher gives more time to process packets but uses more BAR1 space; too low risks NIC drops (RX) or buffering latency (TX). A good starting point is 3×–5× the queuebatch_size. For the DPDK backend,num_bufsbelow 1.5× the NIC ring size deadlocks the worker;daqiri_initauto-bumps such regions to 3× the ring (24576 with the default 8192) and logs aWARN. -
buf_size·integer (bytes)· required — Size of each buffer in the region. Should equal your maximum packet size, or smaller when chaining regions per packet (e.g. header-data split — see the HDS walkthrough below). - The
interfacessection lists the NIC interfaces that will be configured for the application. -
address·string (PCIe BDF)· required — PCIe bus address of this interface. Must be changed for your system. Bothtx_portandrx_portmay point to the same physical NIC for single-port closed-loop benches. - Each interface declares a
tx(transmitting) and/orrx(receiving) section. Include only the side you're using on that port, or both if a single port carries traffic in both directions. - The
queuessection lists per-direction queues. Queues are a core NIC concept: they handle the actual reception or transmission of packets. RX queues buffer incoming packets until the application processes them; TX queues hold outgoing packets waiting to be sent. The simplest setup uses one RX and one TX queue; using more queues allows parallel streams (each queue can be pinned to its own CPU core and memory region). -
batch_size·integer (packets)· required — Packets per burst. The RX path delivers packets to the application in batches of this size; the TX path should not send more packets than this per call. -
cpu_core·integer (CPU core ID)· required — Core that this queue uses to poll the NIC. Ideally one isolated core per queue. Must match your system's available cores. - The list of memory regions where this queue will write/read packets. Order matters: the first region is used until one buffer fills (
buf_size), then the next region is used, and so on until the packet is fully written/read. The HDS walkthrough below shows a chained example. offloads·list of string· TX queues only — Optional tasks offloaded to the NIC. Supported:tx_eth_src(the NIC inserts the Ethernet source MAC into outgoing headers). Note: IP, UDP, and Ethernet checksums/CRC are always done by the NIC and are not optional.-
address·string (PCIe BDF)· required — PCIe bus address of the RX interface. May share the BDF withtx_portfor single-port closed-loop benches, or be a different NIC for two-port setups. Must be changed for your system. flow_isolation·boolean· default:false— Whentrue, packets that don't match this interface's MAC or any rule underflowsare delegated back to the Linux kernel (no kernel bypass). Useful for letting the interface still handle ARP, ICMP, etc. while DAQIRI takes the application packets. Whenfalse, every packet hitting the interface must be processed (or dropped) by your application.- The list of flows. Flows route packets to a queue based on packet fields. If
flowsis missing, all packets are routed to the first queue. id·integer· required — Tag attached to packets that match this flow. Useful when multiple flows route to a single queue and the application needs to distinguish which rule matched.- What to do with packets that match this flow. The only currently supported action is
type: queue(send the packet to the queue with the givenid). - List of rules to match packets against. All rules must hold for a packet to match the flow. Currently supported keys:
udp_src/udp_dst(UDP source/destination port numbers, integer),ipv4_len(full IPv4 packet length in bytes, integer). Adjust to match your incoming traffic. - The
bench_rxsection is specific to the benchmark application. In this base config it only names the RX interface — the heavy lifting (memory regions, queues, flows) is in thedaqirisection above. Other DAQIRI binaries (e.g. the reorder-quantize bench) may add fields here; see those configs for details. - The
bench_txsection configures the TX side of the benchmark: packet sizes and the Ethernet/IP/UDP header fields embedded in outgoing packets. Theeth_dst_addris required whenflow_isolationis enabled on the RX interface. Theip_src_addr/ip_dst_addrare only needed when traffic is routed across subnets — for a direct cable loopback, any value works. Thepayload_size,header_size, and UDP ports should match your application's packet format.
Header-data split (HDS)¶
For applications that parse small per-packet fields on the CPU while keeping the payload on the GPU, DAQIRI supports header-data split (HDS): the NIC writes the header bytes to a CPU buffer (segment 0) and the payload to a GPU buffer (segment 1) using GPUDirect zero-copy. The packet is split at a fixed byte boundary defined by the buf_size of the first region in the queue's memory_regions: list.
The canonical HDS config is daqiri_bench_raw_tx_rx_hds.yaml. It builds on the base TX+RX config above; only the deltas are shown here.
New CPU memory regions, one per direction. Headers land here.
memory_regions:
- name: "Data_TX_CPU" # (1)!
kind: "huge"
affinity: 0
num_bufs: 51200
buf_size: 64 # (2)!
- name: "Data_TX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 1064
- name: "Data_RX_CPU"
kind: "huge"
affinity: 0
num_bufs: 51200
buf_size: 64
- name: "Data_RX_GPU"
kind: "device"
affinity: 0
num_bufs: 51200
buf_size: 1000 # (3)!
- New region for headers.
kind: "huge"puts buffers in CPU hugepages so the CPU can read them without touching GPU memory. PairData_TX_CPUandData_RX_CPUwith their GPU siblings to chain regions per packet. -
buf_size·integer (bytes)· required — In HDS, the first region'sbuf_sizeis the split boundary: bytes 0 tobuf_size − 1go to the CPU region, the remainder spills into the next region. Size this to match your header length exactly (64 bytes for a typical Eth+IPv4+UDP header). -
buf_size·integer (bytes)· required — Size of each GPU buffer in HDS mode: payload-only (no longer the full packet size). The packet first fills 64 bytes inData_RX_CPU, then the remaining 1000 bytes spill intoData_RX_GPU.
Chained memory_regions: per queue. Order matters: header region first, payload region second. The NIC walks the list in order, filling each region's buf_size before moving on.
tx:
queues:
- memory_regions: # (1)!
- "Data_TX_CPU"
- "Data_TX_GPU"
rx:
queues:
- memory_regions:
- "Data_RX_CPU"
- "Data_RX_GPU"
- Header region listed first. For each RX packet, 64 bytes land in
Data_RX_CPU, then the next 1000 bytes land inData_RX_GPU.
Pin the packet length in the flow rule. HDS triggers cleanly only when the full packet length is known up front, so the flow match adds ipv4_len:
-
ipv4_len·integer (bytes)· optional in the base config; recommended for HDS — Match on the full IPv4 packet length. Required for HDS so the NIC can split deterministically. Set to the fixed packet length you expect.
Match bench_tx.payload_size to the GPU region. The TX side generates 1000-byte payloads to match Data_RX_GPU.buf_size:
The HDS bench runs on daqiri_bench_raw_hds:
Packet reordering on the GPU¶
For UDP workloads where packets arrive out-of-order and need to be placed at their correct offset in a GPU buffer before the application sees them, DAQIRI provides a GPU reorder kernel: the kernel reads a sequence number from each packet and writes the packet into a dedicated landing region at the correct slot. The downstream consumer reads a fully ordered batch with no CPU touch.
The canonical reorder config is daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml (seq_packets_per_batch algorithm, GPU kernel, closed-loop TX+RX). It builds on the base TX+RX config above; only the deltas are shown here.
New Reorder_RX_GPU memory region. A large, dedicated GPU region that holds one fully reordered batch.
memory_regions:
- name: "Data_TX_CPU"
kind: "huge"
num_bufs: 16384
buf_size: 2048
- name: "Data_RX_GPU"
kind: "device"
num_bufs: 16384 # (1)!
buf_size: 2048
- name: "Reorder_RX_GPU" # (2)!
kind: "device"
affinity: 0
num_bufs: 128
# (source buf_size - payload_byte_offset) * packets_per_batch
# = (2048 - 64) * 1024 = 2,031,616 bytes per reordered batch.
buf_size: 2031616 # (3)!
-
num_bufs·integer· required — Shrunk from the base config's 51200 to 16384. Reorder works at smaller batches with smaller per-packet buffers (buf_size: 2048here vs. 8064 in the base), so the buffer pool is correspondingly smaller. - New region for the kernel output. Each buffer holds one fully reordered batch of
packets_per_batchpackets, payload-only (the header is stripped viapayload_byte_offset). -
buf_size·integer (bytes)· required — Sized as(source buf_size − payload_byte_offset) × packets_per_batch. For this config:(2048 − 64) × 1024 = 2,031,616bytes. Re-derive when you change any of the three inputs.
Match queue batch_size to packets_per_batch. The reorder kernel processes exactly one batch per invocation; the queue must hand it that many packets at once.
-
batch_size·integer (packets)· required — Must equalpackets_per_batchin the reorder config below. -
timeout_us·integer (microseconds)· default: none (waits forever) — Maximum time the queue waits for a partial batch to fill before flushing. Without it, a stalled flow can stall the reorder kernel indefinitely.
Flow id tags packets for the reorder config. The reorder block selects packets to reorder by flow ID.
id·integer· required — Flow tag attached to matching packets. Set to a non-zero value here so thereorder_configs:block below can reference it viaflow_ids:to select which packets to reorder.
The reorder_configs: block. The core of the feature — sits inside the rx: section alongside queues and flows.
reorder_configs:
- name: "rx_reorder_seq_1024"
reorder_type: "gpu" # (1)!
memory_region: "Reorder_RX_GPU" # (2)!
payload_byte_offset: 64 # (3)!
flow_ids:
- 201 # (4)!
method:
seq_packets_per_batch: # (5)!
sequence_number:
bit_offset: 512 # (6)!
bit_width: 32
packets_per_batch: 1024 # (7)!
reorder_type·string· required — Where the kernel runs. Supported:"gpu"(CUDA kernel, recommended),"cpu"(throughput-bounded; comparison baseline — seedaqiri_bench_raw_tx_rx_reorder_seq_1024_cpu.yaml).memory_region·string· required — Name of the landing region for reordered output. Must match a region defined in the top-levelmemory_regions:(here,Reorder_RX_GPU).-
payload_byte_offset·integer (bytes)· required — Number of leading bytes (typically the header) to skip when copying packets into the reorder region. The kernel copies from this offset to the end of the source buffer. - List of flow IDs whose packets feed this reorder config. Must match the
idfield of one or moreflows:entries above. method— Algorithm choice.seq_packets_per_batch(used here) groups a fixed number of packets per batch, identified by a sequence number within the batch. The alternativeseq_batch_numberencodes the batch index directly in the seqno — seedaqiri_bench_raw_tx_rx_reorder_quantize_seq_batch.yaml.-
bit_offset/bit_width·integer (bits)· required — Location and size of the sequence number within the packet. Here, the seqno starts at byte 64 (bit_offset: 512= 64 × 8) and is 32 bits wide — matching auint32at the start of the UDP payload. -
packets_per_batch·integer (packets)· required — Number of packets the kernel groups per reordered batch. Must equal the queuebatch_sizeabove.
TX-side seqno injection. The benchmark TX path writes a monotonic big-endian uint32 into the configured payload offset.
bench_tx:
batch_size: 1024
payload_size: 1000
header_size: 64
udp_src_port: 5000
udp_dst_port: 5000
sequence_number_offset: 0 # (1)!
sequence_number_start: 0
-
sequence_number_offset·integer (bytes)— Byte offset into the UDP payload where the TX path writes the monotonic seqno. Must align withsequence_number.bit_offsetin the RX reorder config (after subtracting the header size). Here the seqno is at the very start of the payload.
The reorder bench runs on daqiri_bench_raw_reorder_seq:
./build/examples/daqiri_bench_raw_reorder_seq ./build/examples/daqiri_bench_raw_tx_rx_reorder_seq_1024.yaml --seconds 10
Other reorder variants are listed under question 2 of the decision tree above: the CPU-kernel variant, the RX-only variants, and the seq_batch_number algorithm with in-kernel int4 → fp32 type conversion (runs on daqiri_bench_raw_reorder_quantize).
Previous: Benchmarking Examples