Relay BP Decoding with CUDA-Q Realtime

Note

The following information is about a C++ demonstration that must be built from source and is not part of any distributed CUDA-Q QEC binaries.

This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder using CUDA-Q’s realtime host dispatch system. The decoder runs as a CPU-launched CUDA graph (HOST_LOOP dispatch path) and can operate in three configurations:

  • CI unit test – standalone executable, no FPGA or network hardware needed

  • Emulated end-to-end test – software FPGA emulator replaces real hardware

  • FPGA end-to-end test – real FPGA connected via ConnectX RDMA/RoCE

Prerequisites

Hardware

Configuration

GPU

ConnectX NIC

FPGA

CI unit test

Any CUDA-capable GPU

Not required

Not required

Emulated E2E

CUDA GPU with GPUDirect RDMA

Required (loopback cable)

Not required

FPGA E2E

CUDA GPU with GPUDirect RDMA

Required

Required

Tested platforms: DGX Spark, GB200.

Software

  • CUDA Toolkit: 12.6 or later

  • CUDA-Q SDK: pre-installed (provides libcudaq, libnvqir, nvq++)

  • nv-qldpc-decoder plugin: the proprietary nv-qldpc-decoder shared library (libcudaq-qec-nv-qldpc-decoder.so). Required at runtime for all three configurations.

Source Repositories

Repository

URL

Version

cudaqx

https://github.com/NVIDIA/cudaqx

main branch (or your feature branch)

cuda-quantum (realtime)

https://github.com/NVIDIA/cuda-quantum

Branch releases/v0.14.1

holoscan-sensor-bridge

https://github.com/nvidia-holoscan/holoscan-sensor-bridge

Tag 2.6.0-EA2

cuda-quantum provides libcudaq-realtime (the host dispatcher, ring buffer management, and dispatch kernel). holoscan-sensor-bridge provides the Hololink GpuRoceTransceiver library for RDMA transport.

Note

holoscan-sensor-bridge is only needed for the emulated and FPGA end-to-end tests. The CI unit test requires only libcudaq-realtime.

Repository Layout

Key files within cudaqx:

libs/qec/
  unittests/
    realtime/
      qec_graph_decode_test/
        test_realtime_qldpc_graph_decoding.cpp   # CI unit test
      qec_roce_decode_test/
        data/
          config_nv_qldpc_relay.yml              # Relay BP decoder config
          syndromes_nv_qldpc_relay.txt           # 100 test syndrome shots
    utils/
      hololink_qldpc_graph_decoder_bridge.cpp    # Bridge tool (RDMA <-> decoder)
      hololink_qldpc_graph_decoder_test.sh       # Orchestration script
      hololink_fpga_syndrome_playback.cpp        # Playback tool (loads syndromes)

The FPGA emulator is in the cuda-quantum repository:

cuda-quantum/realtime/
  unittests/utils/
    hololink_fpga_emulator.cpp                   # Software FPGA emulator

Building

Using the orchestration script

The orchestration script can build everything automatically:

./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
  --build \
  --hsb-dir /path/to/holoscan-sensor-bridge \
  --cuda-quantum-dir /path/to/cuda-quantum \
  --no-run

CI Unit Test

The CI unit test (test_realtime_qldpc_graph_decoding) exercises the full host dispatch decode path without any network hardware. It:

  1. Loads the Relay BP config and syndrome data from YAML/text files

  2. Creates the decoder via the decoder::get("nv-qldpc-decoder", ...) plugin API

  3. Captures a CUDA graph of the decode pipeline

  4. Wires libcudaq-realtime’s host dispatcher (HOST_LOOP) to a ring buffer

  5. Writes RPC requests into the ring buffer, the host dispatcher launches the CUDA graph, and the test verifies corrections

Running

cd cudaqx/build

# The nv-qldpc-decoder plugin must be discoverable at runtime.
# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path:
export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so

./libs/qec/unittests/test_realtime_qldpc_graph_decoding

Expected output:

[==========] Running 1 test from 1 test suite.
[----------] 1 test from RealtimeQLDPCGraphDecodingTest
[ RUN      ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots
...
[       OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms)
[==========] 1 test from 1 test suite ran.
[  PASSED  ] 1 test.

Emulated End-to-End Test

The emulated test replaces the physical FPGA with a software emulator. Three processes run concurrently:

  1. Emulator – receives syndromes via the UDP control plane, sends them to the bridge via RDMA, and captures corrections

  2. Bridge – runs the host dispatcher and CUDA graph decode loop on the GPU, receiving syndromes and sending corrections via RDMA

  3. Playback – loads syndrome data into the emulator’s BRAM and triggers playback, then verifies corrections

Requirements

  • ConnectX NIC with a loopback cable connecting both ports (the emulator sends RDMA traffic out one port and the bridge receives on the other)

  • Software dependencies (DOCA, Holoscan SDK, etc.) as described in the cuda-quantum realtime build guide

  • All three tools built (bridge, playback, emulator)

Running

./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
  --emulate \
  --build \
  --setup-network \
  --hsb-dir /path/to/holoscan-sensor-bridge

The --setup-network flag configures the ConnectX interface with the appropriate IP addresses and MTU. It only needs to be run once per boot.

After the initial build and network setup, subsequent runs are faster:

./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate

FPGA End-to-End Test

The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two processes run:

  1. Bridge – same as emulated mode

  2. Playback – loads syndromes into the FPGA’s BRAM and triggers playback, then reads back corrections from the FPGA’s capture RAM to verify them

Requirements

  • FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via direct cable or switch. Bitfiles for supported FPGA vendors are available here. See the cuda-quantum realtime user guide for FPGA setup instructions.

  • FPGA IP and bridge IP on the same subnet

  • ConnectX device name (e.g., mlx5_4, mlx5_5)

Running

./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
  --build \
  --setup-network \
  --device mlx5_5 \
  --bridge-ip 192.168.0.1 \
  --fpga-ip 192.168.0.2 \
  --gpu 2 \
  --page-size 512 \
  --hsb-dir /path/to/holoscan-sensor-bridge

Key parameters for FPGA mode:

Parameter

Description

--device

ConnectX IB device name (e.g., mlx5_5)

--bridge-ip

IP address assigned to the ConnectX interface

--fpga-ip

FPGA’s IP address

--gpu

GPU device ID (choose NUMA-local GPU for lowest latency)

--page-size

Ring buffer slot size in bytes (use 512 on GB200 for alignment)

--spacing

Inter-shot spacing in microseconds

Note

The --spacing value should be set to at least the per-shot decode time to avoid overrunning the input ring buffer. If syndromes arrive faster than the decoder can process them, the buffer fills up and messages are lost. Use a --spacing value at or above the observed decode time for sustained operation.

GPU Selection

For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. For example, on a GB200 system where mlx5_5 is on NUMA node 1, use --gpu 2 or --gpu 3. Check NUMA locality with:

cat /sys/class/infiniband/<device>/device/numa_node

Network Sanity Check

Before running, verify that the bridge IP is assigned to exactly one interface:

ip addr show | grep 192.168.0.1

If multiple interfaces show the same IP, remove the duplicate to avoid routing ambiguity that silently drops RDMA packets.

Orchestration Script Reference

hololink_qldpc_graph_decoder_test.sh [options]

Modes

Flag

Description

--emulate

Use FPGA emulator (no real FPGA needed)

(default)

FPGA mode (requires real FPGA)

Actions

Flag

Description

--build

Build all required tools before running

--setup-network

Configure ConnectX network interfaces

--no-run

Skip running the test (useful with --build)

Build Options

Flag

Default

Description

--hsb-dir DIR

/workspaces/holoscan-sensor-bridge

holoscan-sensor-bridge source directory

--cuda-quantum-dir DIR

/workspaces/cuda-quantum

cuda-quantum source directory

--cuda-qx-dir DIR

/workspaces/cudaqx

cudaqx source directory

--jobs N

nproc

Parallel build jobs

Network Options

Flag

Default

Description

--device DEV

auto-detect

ConnectX IB device name

--bridge-ip ADDR

10.0.0.1

Bridge tool IP address

--emulator-ip ADDR

10.0.0.2

Emulator IP (emulate mode)

--fpga-ip ADDR

192.168.0.2

FPGA IP address

--mtu N

4096

MTU size

Run Options

Flag

Default

Description

--gpu N

0

GPU device ID

--timeout N

60

Timeout in seconds

--num-shots N

all available

Limit number of syndrome shots

--page-size N

384

Ring buffer slot size in bytes

--num-pages N

128

Number of ring buffer slots

--spacing N

10

Inter-shot spacing in microseconds

--no-verify

(verify)

Skip correction verification

--control-port N

8193

UDP control port for emulator