NVIDIA Open Source · Data Acquisition

DAQIRI — Command the
Data Deluge at the Source

DAQIRI (Data Acquisition for Integrated Real-time Instruments) connects high-bandwidth streaming sensors directly to the NVIDIA compute ecosystem. By abstracting zero-copy data movement from sensor to GPU, DAQIRI puts scalable, real-time AI, signal processing, and scientific computing within reach of the next generation of instruments.

⚠️

The library is undergoing large improvements as we aim to better support it as an NVIDIA product. API breakages may be more frequent until version 1.0.

Gbps – Tbps+
Sensor Bandwidth
Zero-Copy
Sensor → GPU
UDP, RoCE
Protocol
C++
Language
Multi-Sensor
Scalable
Minutes
Time to Deployment
Apache 2.0
License
DAQIRI — sensor connected to GPU infrastructure

Closing the Gap Between Sensor and GPU

Scientific and industrial instruments generate data that is richest at the source — before it is filtered, decimated, or summarized. DAQIRI places NVIDIA GPU hardware directly in that data path, forging a tight bond between upstream sensors, their data converters, and the NVIDIA compute ecosystem. The result is a new foundation for developers: the ability to work with instrument data in its rawest form, at wire speed, and to build a new class of autonomous experiments where AI can observe phenomena directly at the source, augment human analysis, and steer experiments in real time. Streaming Ethernet data in, GPU tensor out.

AI Native DAQ Architecture

Scalable, High Throughput

Hundreds of gigabits per second with proper hardware and CPU/NUMA tuning. Direct access to NIC ring buffers keeps latency at PCIe transit time only.

🚀

GPUDirect Zero-Copy

Two GPU receive modes: Header-data split (headers to CPU, payload to GPU — recommended) and Batched GPU (entire packets to GPU for maximum bandwidth).

🔀

Hardware Flow Steering

Route packets to specific queues by UDP port, IPv4 payload length, or custom flex items — all in NIC silicon, before any software runs.

🔗

RDMA over Converged Ethernet

Run RDMA READ, WRITE, and SEND over standard Ethernet via RoCE — no specialized InfiniBand fabric required. The same libibverbs API also supports InfiniBand for environments where it is available.

📄

YAML-Driven Configuration

Define memory regions, NIC interfaces, TX/RX queues, and flow rules in a single YAML file — or build the same config in C++ code. Switch backends, memory kinds, and buffer sizes without recompiling.

📦

Containerized Deployment

A ready-to-run container bundles all userspace dependencies including a dmabuf-patched DPDK — no host-side dependency setup, no peermem kernel module. From docker pull to running benchmarks in minutes.

Build & Run in Minutes

Requires a ConnectX-6 Dx+ NIC, Linux (kernel 5.4+), and the CUDA Toolkit.

Full Guide →
1

Install Prerequisites

Install MLNX5/InfiniBand drivers with peermem support (inbox on Ubuntu ≥5.4 and <6.8, or OFED from DOCA-Host 2.8+). Install the CUDA Toolkit.

2

Build from Source

Select backends with DAQIRI_MGR. Valid values: dpdk, rdma.

# Configure, build, install
cmake -S . -B build \
  -DBUILD_SHARED_LIBS=ON \
  -DDAQIRI_BUILD_PYTHON=OFF \
  -DDAQIRI_MGR="dpdk socket rdma"
cmake --build build -j
cmake --install build --prefix /opt/daqiri
3

Or Build the Container

The Dockerfile builds DPDK from source with dmabuf patches — no peermem needed inside the container. Set BASE_IMAGE=torch to build on top of NGC PyTorch for Torch / TensorRT inference workflows.

BASE_TARGET=dpdk \
  DAQIRI_MGR="dpdk socket rdma" \
  scripts/build-container.sh
4

Tune the System

Isolate CPU cores, enable hugepages, configure NUMA affinity. Run the diagnostic script:

python3 python/tune_system.py
5

Run a Benchmark

Edit the YAML to match your hardware (PCIe BDF, CPU cores, IPs), then:

./build/examples/daqiri_bench_raw_gpudirect \
  examples/daqiri_bench_raw_tx_rx.yaml \
  --seconds 10
Initialize & Receive PacketsC++
#include <daqiri/daqiri.h>

// Init from YAML config
daqiri::daqiri_init("config.yaml");

// Non-blocking burst receive
daqiri::BurstParams *burst;
auto s = daqiri::get_rx_burst(
    &burst, port_id, queue_id);

if (s == daqiri::Status::SUCCESS) {
  int n = daqiri::get_num_packets(burst);
  for (int i = 0; i < n; i++) {
    void* p = daqiri::get_packet_ptr(
        burst, i);
    // process p ...
  }
  daqiri::free_all_packets_and_burst_rx(
      burst);
}
Header-Data Split (GPU payload)C++
// Seg 0 = headers (CPU)
// Seg 1 = payload (GPU)
for (int i = 0; i < n; i++) {
  void* hdr =
    daqiri::get_segment_packet_ptr(
        burst, 0, i);
  void* pay =
    daqiri::get_segment_packet_ptr(
        burst, 1, i); // GPU ptr
  uint32_t hlen =
    daqiri::get_segment_packet_length(
        burst, 0, i);
  uint32_t plen =
    daqiri::get_segment_packet_length(
        burst, 1, i);
  // pay is already on GPU — no copy
}

Examples

Benchmark executables and YAML configs included in the examples/ directory.

Browse examples/ →
Transmit a Packet BurstC++
auto burst =
  daqiri::create_tx_burst_params();
daqiri::set_header(burst, port_id,
    queue_id, batch_size, num_segs);
daqiri::get_tx_packet_burst(burst);

for (int i = 0; i < batch_size; i++) {
  daqiri::set_eth_header(burst,i,dst_mac);
  daqiri::set_ipv4_header(burst,i,
      ip_len,IPPROTO_UDP,src_ip,dst_ip);
  daqiri::set_udp_header(burst,i,
      udp_len,src_port,dst_port);
  daqiri::set_udp_payload(burst,i,
      payload_ptr,payload_size);
}
daqiri::send_tx_burst(burst);
GPU Packet ProcessingC++/CUDA
// Batched GPU: packets arrive in
// CUDA-addressable buffers. Reordering
// is configured with rx.reorder_configs.
__global__ void noop_packet_kernel(void* pkt) {
  (void)pkt;
}

if (daqiri::get_num_packets(burst) > 0) {
  void* pkt =
    daqiri::get_packet_ptr(burst, 0);
  noop_packet_kernel<<<1, 1, 0, stream>>>(pkt);
}

daqiri::free_all_packets_and_burst_rx(
  burst);
Header-Data Split ConfigYAML
daqiri:
  cfg:
    version: 1
    manager: "dpdk"
    master_core: 3
    memory_regions:
    - name: "RX_CPU"
      kind: "huge"
      affinity: 0
      num_bufs: 51200
      buf_size: 64    # headers (~42 B)
    - name: "RX_GPU"
      kind: "device"
      affinity: 0
      num_bufs: 51200
      buf_size: 1000  # payload
Hardware Flow SteeringYAML
rx:
  flow_isolation: true
  queues:
  - name: "rx_q_0"
    id: 0
    cpu_core: 9
    batch_size: 10240
    memory_regions:
      - "Data_RX_CPU"
      - "Data_RX_GPU"
  flows:
  - name: "flow_0"
    id: 0
    action: {type: queue, id: 0}
    match:
      udp_src: 4096
      udp_dst: 4096
      ipv4_len: 1050
DPDK Benchmarksbash
# Build with examples
cmake -S . -B build \
  -DDAQIRI_BUILD_EXAMPLES=ON \
  -DDAQIRI_MGR="dpdk socket rdma"
cmake --build build -j

# TX/RX throughput (10 s)
./build/examples/daqiri_bench_raw_gpudirect \
  examples/daqiri_bench_raw_tx_rx.yaml \
  --seconds 10

# Header-data split / GPUDirect
./build/examples/daqiri_bench_raw_hds \
  examples/daqiri_bench_raw_tx_rx_hds.yaml \
  --seconds 10
RDMA & Multi-Queue Benchmarksbash
# RDMA client/server (10 s)
./build/examples/daqiri_bench_rdma \
  examples/daqiri_bench_rdma_tx_rx.yaml \
  --seconds 10 --mode both

# Software loopback (no physical link)
./build/examples/daqiri_bench_raw_gpudirect \
  examples/daqiri_bench_raw_sw_loopback.yaml \
  --seconds 10

# Multi-queue RX
./build/examples/daqiri_bench_raw_gpudirect \
  examples/daqiri_bench_raw_rx_multi_q.yaml \
  --seconds 10

Tutorials

Step-by-step guides from first build to production-grade deployment.

Getting Started →
01
Advanced Networking Background
Kernel-bypass networking and GPUDirect concepts — what they are, why they matter, and how DAQIRI builds on them.
Beginner~10 min
02
Requirements & Installation
Hardware (ConnectX-6 Dx+), driver setup (OFED from DOCA-Host 2.8+ or inbox on Ubuntu 5.4–6.7), and CUDA Toolkit installation on Linux 5.4+.
Beginner~15 min
03
Building from Source with CMake
Configure DAQIRI_MGR, DAQIRI_BUILD_PYTHON, BUILD_SHARED_LIBS, and DAQIRI_BUILD_EXAMPLES. Build for A100/H100 (CUDA arches 80, 90).
Coming Soon
04
Container Build with Patched DPDK
Build the Docker image with build-container.sh. The container ships a dmabuf-patched DPDK, so peermem is not required.
Coming Soon
05
System Tuning for High-Performance Networking
Isolate CPU cores, configure hugepages, set NUMA affinity, and run python/tune_system.py to diagnose common configuration issues.
Intermediate~30 min
06
Benchmarking Examples
Run a TX/RX loopback test to validate your setup, and walk through interpreting throughput results.
Beginner~20 min
07
YAML Configuration Deep Dive
Memory regions (huge, device, host_pinned), RX/TX queue setup, flow steering rules, flex items, and RDMA client/server config schemas.
Intermediate~40 min
08
GPUDirect: Header-Data Split Pipeline
Configure a two-region memory layout, access CPU headers and GPU payloads per-packet with get_segment_packet_ptr(), and reorder scattered GPU buffers with the built-in CUDA kernel.
Coming Soon
09
RDMA Client/Server Setup
Configure the RDMA backend with RC transport, assign client and server roles across two hosts, and run daqiri_bench_rdma to validate the connection.
Coming Soon
10
Timed TX with ConnectX-7
Enable accurate_send in the TX config and use set_packet_tx_time() for PTP-synchronized, hardware-scheduled packet transmission on ConnectX-7+.
Coming Soon

News

Announcements, publications, and community updates about DAQIRI.

GitHub2025
DAQIRI Open-Sourced on GitHub
NVIDIA — Initial public release under Apache 2.0, featuring DPDK and RDMA backends with GPUDirect support for ConnectX-6 Dx and later NICs.
Release Note2025
Pre-1.0 API Stability Notice
NVIDIA — DAQIRI is undergoing large improvements to become a fully supported NVIDIA product. API breakages may occur before v1.0. Contributions welcome.
RoadmapComing Soon
High-Performance Networking System Tuning Guide
NVIDIA — A comprehensive guide covering CPU isolation, hugepages, and NUMA configuration tuning for DAQIRI workloads will be added to this repository.

Connect Your Sensors to the NVIDIA Ecosystem

Clone the repo, build with CMake, and start streaming sensor data directly into your GPU-accelerated pipeline today.

# Clone and build
git clone https://github.com/NVIDIA/daqiri
cmake -S daqiri -B build \
  -DBUILD_SHARED_LIBS=ON \
  -DDAQIRI_MGR="dpdk socket rdma"
cmake --build build -j