Relay BP Decoding with CUDA-Q Realtime
Note
The following information is about a C++ demonstration that must be built from source and is not part of any distributed CUDA-Q QEC binaries.
This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP
decoder using CUDA-Q’s realtime host dispatch system. The decoder runs as a
CPU-launched CUDA graph (HOST_LOOP dispatch path) and can operate in three
configurations:
CI unit test – standalone executable, no FPGA or network hardware needed
Emulated end-to-end test – software FPGA emulator replaces real hardware
FPGA end-to-end test – real FPGA connected via ConnectX RDMA/RoCE
Prerequisites
Hardware
Configuration |
GPU |
ConnectX NIC |
FPGA |
|---|---|---|---|
CI unit test |
Any CUDA-capable GPU |
Not required |
Not required |
Emulated E2E |
CUDA GPU with GPUDirect RDMA |
Required (loopback cable) |
Not required |
FPGA E2E |
CUDA GPU with GPUDirect RDMA |
Required |
Required |
Tested platforms: DGX Spark, GB200.
Software
CUDA Toolkit: 12.6 or later
CUDA-Q SDK: pre-installed (provides
libcudaq,libnvqir,nvq++)nv-qldpc-decoder plugin: the proprietary nv-qldpc-decoder shared library (
libcudaq-qec-nv-qldpc-decoder.so). Required at runtime for all three configurations.
Source Repositories
Repository |
URL |
Version |
|---|---|---|
cudaqx |
|
|
cuda-quantum (realtime) |
Branch |
|
holoscan-sensor-bridge |
Tag |
cuda-quantum provides libcudaq-realtime (the host dispatcher, ring
buffer management, and dispatch kernel). holoscan-sensor-bridge provides
the Hololink GpuRoceTransceiver library for RDMA transport.
Note
holoscan-sensor-bridge is only needed for the emulated and FPGA
end-to-end tests. The CI unit test requires only libcudaq-realtime.
Repository Layout
Key files within cudaqx:
libs/qec/
unittests/
realtime/
qec_graph_decode_test/
test_realtime_qldpc_graph_decoding.cpp # CI unit test
qec_roce_decode_test/
data/
config_nv_qldpc_relay.yml # Relay BP decoder config
syndromes_nv_qldpc_relay.txt # 100 test syndrome shots
utils/
hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA <-> decoder)
hololink_qldpc_graph_decoder_test.sh # Orchestration script
hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes)
The FPGA emulator is in the cuda-quantum repository:
cuda-quantum/realtime/
unittests/utils/
hololink_fpga_emulator.cpp # Software FPGA emulator
Building
CI unit test only (no Hololink tools)
If you only need to run the CI unit test, you can build without
holoscan-sensor-bridge:
# 1. Build libcudaq-realtime
git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src
cd cudaq-realtime-src
git checkout releases/v0.14.1
cd realtime && mkdir -p build && cd build
cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime ..
ninja && ninja install
cd ../../..
# 2. Build cudaqx with the nv-qldpc-decoder test
cmake -S cudaqx -B cudaqx/build \
-DCMAKE_BUILD_TYPE=Release \
-DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \
-DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
-DCUDAQX_ENABLE_LIBS="qec" \
-DCUDAQX_INCLUDE_TESTS=ON
cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding
Full build (CI test + Hololink bridge/playback tools)
To also build the bridge and playback tools for emulated or FPGA testing:
# 1. Clone cuda-quantum (realtime)
git clone --filter=blob:none --no-checkout \
https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src
cd cudaq-realtime-src
git sparse-checkout init --cone
git sparse-checkout set realtime
git checkout releases/v0.14.1
cd ..
# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2)
# Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger).
# If your system cmake is older: pip install cmake
git clone --branch 2.6.0-EA2 \
https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git
cd holoscan-sensor-bridge
# Strip operators we don't need to avoid configure failures from missing deps
sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d;
/add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d;
/add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d;
/add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d;
/add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d;
/add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d;
/add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \
src/hololink/operators/CMakeLists.txt
mkdir -p build && cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DHOLOLINK_BUILD_ONLY_NATIVE=OFF \
-DHOLOLINK_BUILD_PYTHON=OFF \
-DHOLOLINK_BUILD_TESTS=OFF \
-DHOLOLINK_BUILD_TOOLS=OFF \
-DHOLOLINK_BUILD_EXAMPLES=OFF \
-DHOLOLINK_BUILD_EMULATOR=OFF ..
cmake --build . --target gpu_roce_transceiver hololink_core
cd ../..
# 3. Build libcudaq-realtime with Hololink tools enabled
# This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge
# tool) as well as the FPGA emulator.
cd cudaq-realtime-src/realtime && mkdir -p build && cd build
cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \
-DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \
-DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \
-DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \
..
ninja && ninja install
cd ../../..
# 4. Build cudaqx with Hololink tools enabled
cmake -S cudaqx -B cudaqx/build \
-DCMAKE_BUILD_TYPE=Release \
-DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \
-DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
-DCUDAQX_ENABLE_LIBS="qec" \
-DCUDAQX_INCLUDE_TESTS=ON \
-DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \
-DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \
-DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build
cmake --build cudaqx/build --target \
test_realtime_qldpc_graph_decoding \
hololink_qldpc_graph_decoder_bridge \
hololink_fpga_syndrome_playback
Using the orchestration script
The orchestration script can build everything automatically:
./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
--build \
--hsb-dir /path/to/holoscan-sensor-bridge \
--cuda-quantum-dir /path/to/cuda-quantum \
--no-run
CI Unit Test
The CI unit test (test_realtime_qldpc_graph_decoding) exercises the full
host dispatch decode path without any network hardware. It:
Loads the Relay BP config and syndrome data from YAML/text files
Creates the decoder via the
decoder::get("nv-qldpc-decoder", ...)plugin APICaptures a CUDA graph of the decode pipeline
Wires
libcudaq-realtime’s host dispatcher (HOST_LOOP) to a ring bufferWrites RPC requests into the ring buffer, the host dispatcher launches the CUDA graph, and the test verifies corrections
Running
cd cudaqx/build
# The nv-qldpc-decoder plugin must be discoverable at runtime.
# Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path:
export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so
./libs/qec/unittests/test_realtime_qldpc_graph_decoding
Expected output:
[==========] Running 1 test from 1 test suite.
[----------] 1 test from RealtimeQLDPCGraphDecodingTest
[ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots
...
[ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms)
[==========] 1 test from 1 test suite ran.
[ PASSED ] 1 test.
Emulated End-to-End Test
The emulated test replaces the physical FPGA with a software emulator. Three processes run concurrently:
Emulator – receives syndromes via the UDP control plane, sends them to the bridge via RDMA, and captures corrections
Bridge – runs the host dispatcher and CUDA graph decode loop on the GPU, receiving syndromes and sending corrections via RDMA
Playback – loads syndrome data into the emulator’s BRAM and triggers playback, then verifies corrections
Requirements
ConnectX NIC with a loopback cable connecting both ports (the emulator sends RDMA traffic out one port and the bridge receives on the other)
Software dependencies (DOCA, Holoscan SDK, etc.) as described in the cuda-quantum realtime build guide
All three tools built (bridge, playback, emulator)
Running
./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
--emulate \
--build \
--setup-network \
--hsb-dir /path/to/holoscan-sensor-bridge
The --setup-network flag configures the ConnectX interface with the
appropriate IP addresses and MTU. It only needs to be run once per boot.
After the initial build and network setup, subsequent runs are faster:
./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate
FPGA End-to-End Test
The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two processes run:
Bridge – same as emulated mode
Playback – loads syndromes into the FPGA’s BRAM and triggers playback, then reads back corrections from the FPGA’s capture RAM to verify them
Requirements
FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via direct cable or switch. Bitfiles for supported FPGA vendors are available here. See the cuda-quantum realtime user guide for FPGA setup instructions.
FPGA IP and bridge IP on the same subnet
ConnectX device name (e.g.,
mlx5_4,mlx5_5)
Running
./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
--build \
--setup-network \
--device mlx5_5 \
--bridge-ip 192.168.0.1 \
--fpga-ip 192.168.0.2 \
--gpu 2 \
--page-size 512 \
--hsb-dir /path/to/holoscan-sensor-bridge
Key parameters for FPGA mode:
Parameter |
Description |
|---|---|
|
ConnectX IB device name (e.g., |
|
IP address assigned to the ConnectX interface |
|
FPGA’s IP address |
|
GPU device ID (choose NUMA-local GPU for lowest latency) |
|
Ring buffer slot size in bytes (use |
|
Inter-shot spacing in microseconds |
Note
The --spacing value should be set to at least the per-shot decode
time to avoid overrunning the input ring buffer. If syndromes arrive faster
than the decoder can process them, the buffer fills up and messages are lost.
Use a --spacing value at or above the observed decode time for sustained
operation.
GPU Selection
For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC.
For example, on a GB200 system where mlx5_5 is on NUMA node 1,
use --gpu 2 or --gpu 3. Check NUMA locality with:
cat /sys/class/infiniband/<device>/device/numa_node
Network Sanity Check
Before running, verify that the bridge IP is assigned to exactly one interface:
ip addr show | grep 192.168.0.1
If multiple interfaces show the same IP, remove the duplicate to avoid routing ambiguity that silently drops RDMA packets.
Orchestration Script Reference
hololink_qldpc_graph_decoder_test.sh [options]
Modes
Flag |
Description |
|---|---|
|
Use FPGA emulator (no real FPGA needed) |
(default) |
FPGA mode (requires real FPGA) |
Actions
Flag |
Description |
|---|---|
|
Build all required tools before running |
|
Configure ConnectX network interfaces |
|
Skip running the test (useful with |
Build Options
Flag |
Default |
Description |
|---|---|---|
|
|
holoscan-sensor-bridge source directory |
|
|
cuda-quantum source directory |
|
|
cudaqx source directory |
|
|
Parallel build jobs |
Network Options
Flag |
Default |
Description |
|---|---|---|
|
auto-detect |
ConnectX IB device name |
|
|
Bridge tool IP address |
|
|
Emulator IP (emulate mode) |
|
|
FPGA IP address |
|
|
MTU size |
Run Options
Flag |
Default |
Description |
|---|---|---|
|
|
GPU device ID |
|
|
Timeout in seconds |
|
all available |
Limit number of syndrome shots |
|
|
Ring buffer slot size in bytes |
|
|
Number of ring buffer slots |
|
|
Inter-shot spacing in microseconds |
|
(verify) |
Skip correction verification |
|
|
UDP control port for emulator |