AI Predecoder with CUDA-Q Realtime (with FPGA Data Injection)
Note
The following information is about a C++ demonstration that must be built from source and is not part of any distributed CUDA-Q QEC binaries.
This guide explains how to build, test, and run the AI predecoder + PyMatching pipeline over Hololink RDMA using CUDA-Q’s realtime host dispatch system. The pipeline runs a TensorRT-accelerated neural network (the predecoder) on the GPU to reduce syndrome density, then feeds the residual detectors to a pool of PyMatching MWPM decoders on the CPU. It operates in two configurations:
Emulated end-to-end test – software FPGA emulator replaces real hardware
FPGA end-to-end test – real FPGA connected via ConnectX RDMA/RoCE
For the software-only benchmark (no FPGA or network hardware), see AI Predecoder with CUDA-Q Realtime.
Prerequisites
Hardware
Configuration |
GPU |
ConnectX NIC |
FPGA |
|---|---|---|---|
Emulated E2E |
CUDA GPU with GPUDirect RDMA |
Required (loopback cable) |
Not required |
FPGA E2E |
CUDA GPU with GPUDirect RDMA |
Required |
Required |
Tested platforms: GB200.
Software
CUDA Toolkit: 12.6 or later
TensorRT: 10.x (headers and libraries)
CUDA-Q SDK: pre-installed (provides
libcudaq,libnvqir,nvq++)DOCA: 3.3 or later (for
gpu_roce_transceiverRDMA transport)PyMatching decoder plugin: the
cudaq-qec-pymatchingshared library (libcudaq-qec-pymatching.so). Built as part of the cudaqx build and required at runtime.Predecoder ONNX model (e.g.
predecoder_memory_d13_T104_X.onnx) placed underlibs/qec/lib/realtime/. A cached TensorRT.enginefile with the same base name is loaded automatically if present; otherwise the engine is built from the ONNX file on first run (this can take 1–2 minutes for large models).
Source Repositories
Repository |
URL |
Version |
|---|---|---|
cudaqx |
|
|
cuda-quantum (realtime) |
Branch |
|
holoscan-sensor-bridge |
Tag |
cuda-quantum provides libcudaq-realtime (the host dispatcher, ring
buffer management, and dispatch kernel). holoscan-sensor-bridge provides
the Hololink GpuRoceTransceiver library for RDMA transport.
Note
The FPGA emulator (hololink_fpga_emulator) is built from the
cuda-quantum repository and is only needed for the emulated test.
Repository Layout
Key files within cudaqx:
libs/qec/
unittests/
realtime/
hololink_predecoder_bridge.cpp # Bridge tool (RDMA <-> AI predecoder + PyMatching)
hololink_predecoder_test.sh # Orchestration script
predecoder_pipeline_common.h # Pipeline config and shared utilities
predecoder_pipeline_common.cpp # Data loading (detectors, H, O, priors)
test_realtime_predecoder_w_pymatching.cpp # Software-only benchmark
utils/
hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes into FPGA)
The FPGA emulator is in the cuda-quantum repository:
cuda-quantum/realtime/
unittests/utils/
hololink_fpga_emulator.cpp # Software FPGA emulator
Data Directory Layout
Note
Exercising the following test program requires data files that are generated using the Ising-Decoding repository. Please see the instructions in the Ising-Decoding README for how to generate these files.
The syndrome data directory follows the same format as the software benchmark. See AI Predecoder with CUDA-Q Realtime for the full specification. In summary, it must contain:
detectors.bin– detector samples (binary, int32)observables.bin– observable ground-truth labels (binary, int32)
The orchestration script automatically converts detectors.bin to the text
format that hololink_fpga_syndrome_playback expects.
Note
FPGA BRAM constraints: The FPGA BRAM has a fixed depth
(RAM_DEPTH=512 lines of 64 bytes each = 32 KB). For large configs
like d13_r104 (frame size 17,536 bytes = 274 lines per shot), only
1 shot fits in BRAM per playback. The --num-shots flag in the
orchestration script controls how many shots are loaded; the script
applies config-appropriate defaults automatically.
Building
Building the FPGA demo requires holoscan-sensor-bridge and
libcudaq-realtime with Hololink tools enabled.
# 1. Clone cuda-quantum (realtime)
git clone --filter=blob:none --no-checkout \
https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src
cd cudaq-realtime-src
git sparse-checkout init --cone
git sparse-checkout set realtime
git checkout releases/v0.14.1
cd ..
# 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2)
# Requires cmake >= 3.30.4
git clone --branch 2.6.0-EA2 \
https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git
cd holoscan-sensor-bridge
# Strip operators we don't need to avoid configure failures
sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d;
/add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d;
/add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d;
/add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d;
/add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d;
/add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d;
/add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \
src/hololink/operators/CMakeLists.txt
mkdir -p build && cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \
-DHOLOLINK_BUILD_ONLY_NATIVE=OFF \
-DHOLOLINK_BUILD_PYTHON=OFF \
-DHOLOLINK_BUILD_TESTS=OFF \
-DHOLOLINK_BUILD_TOOLS=OFF \
-DHOLOLINK_BUILD_EXAMPLES=OFF \
-DHOLOLINK_BUILD_EMULATOR=OFF ..
cmake --build . --target gpu_roce_transceiver hololink_core
cd ../..
# 3. Build libcudaq-realtime with Hololink tools enabled
cd cudaq-realtime-src/realtime && mkdir -p build && cd build
cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \
-DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \
-DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \
-DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \
..
ninja && ninja install
cd ../../..
# 4. Build cudaqx with Hololink tools enabled
cmake -S cudaqx -B cudaqx/build \
-DCMAKE_BUILD_TYPE=Release \
-DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \
-DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
-DCUDAQ_QEC_BUILD_TRT_DECODER=ON \
-DCUDAQX_ENABLE_LIBS="qec" \
-DCUDAQX_INCLUDE_TESTS=ON \
-DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \
-DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \
-DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build
cmake --build cudaqx/build --target \
hololink_predecoder_bridge \
hololink_fpga_syndrome_playback \
cudaq-qec-pymatching
Emulated End-to-End Test
The emulated test replaces the physical FPGA with a software emulator. Three processes run concurrently:
Emulator – receives syndromes via the UDP control plane, sends them to the bridge via RDMA, and captures corrections
Bridge (
hololink_predecoder_bridge) – receives RDMA data, runs the AI predecoder (TensorRT CUDA graph) and PyMatching decode via therealtime_pipelinePlayback (
hololink_fpga_syndrome_playback) – loads syndrome data into the emulator’s BRAM and triggers playback
Requirements
ConnectX NIC with a loopback cable connecting both ports
Software dependencies (DOCA, Holoscan SDK, etc.) as described in the cuda-quantum realtime build guide
All three tools built (bridge, playback, emulator)
Running
./libs/qec/unittests/realtime/hololink_predecoder_test.sh \
--emulate \
--setup-network \
--cuda-quantum-dir /path/to/cuda-quantum \
--cuda-qx-dir /path/to/cudaqx \
--data-dir /path/to/syndrome_data
The --setup-network flag configures the ConnectX interface with the
appropriate IP addresses and MTU. It only needs to be run once per boot.
After the initial network setup, subsequent runs are faster:
./libs/qec/unittests/realtime/hololink_predecoder_test.sh \
--emulate \
--cuda-quantum-dir /path/to/cuda-quantum \
--cuda-qx-dir /path/to/cudaqx \
--data-dir /path/to/syndrome_data
FPGA End-to-End Test
The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two processes run:
Bridge (
hololink_predecoder_bridge) – same as emulated modePlayback (
hololink_fpga_syndrome_playback) – loads syndromes into the FPGA’s BRAM and triggers RDMA playback to the bridge
Requirements
FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via direct cable or switch
FPGA IP and bridge IP on the same subnet
ConnectX device name (e.g.,
mlx5_4)
Running
./libs/qec/unittests/realtime/hololink_predecoder_test.sh \
--cuda-quantum-dir /path/to/cuda-quantum \
--cuda-qx-dir /path/to/cudaqx \
--data-dir /path/to/syndrome_data \
--device mlx5_4 \
--bridge-ip 192.168.0.1 \
--fpga-ip 192.168.0.2 \
--gpu 2 \
--config d13_r104 \
--timeout 60
Expected output:
========================================
Hololink Predecoder + PyMatching Bridge Test
========================================
Mode: Real FPGA (2-tool)
Config: d13_r104
...
[RDMA+TRT] Shot 0: received 17472 detectors (input_nonzero=939),
predecoder logical_pred=1, residual_nonzero=23
========================================
PREDECODER BRIDGE TEST: PASS
========================================
=== Results ===
Total completed: 1
Avg PyMatching decode: 211.0 us (1 samples)
Shot 0: logical_pred=1 total_corrections=64 converged=1
This confirms: FPGA RDMA receipt (939 nonzero detectors), TensorRT inference (reduced to 23 residuals), and PyMatching decode (64 corrections, converged).
Key parameters for FPGA mode:
Parameter |
Description |
|---|---|
|
ConnectX IB device name (e.g., |
|
IP address assigned to the ConnectX interface |
|
FPGA’s IP address |
|
GPU device ID (choose NUMA-local GPU for lowest latency) |
|
Pipeline configuration (e.g., |
|
Path to syndrome data directory |
|
Ring buffer slot size in bytes (auto-set per config by default) |
|
Number of syndrome shots to play back (limited by FPGA BRAM) |
Note
For d13_r104 (frame size 17,536 bytes), the default page size is 32,768 bytes and the maximum number of shots per playback is 1 due to FPGA BRAM constraints. Smaller configs (e.g. d7) can fit more shots.
GPU Selection
For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC.
For example, on a GB200 system where mlx5_4 is on NUMA node 1,
use --gpu 2 or --gpu 3. Check NUMA locality with:
cat /sys/class/infiniband/<device>/device/numa_node
Network Sanity Check
Before running, verify that the bridge IP is assigned to exactly one interface:
ip addr show | grep 192.168.0.1
If multiple interfaces show the same IP, remove the duplicate to avoid routing ambiguity that silently drops RDMA packets.
Changing the Predecoder Model
The ONNX model file for each configuration is set in the PipelineConfig
factory methods in
libs/qec/unittests/realtime/predecoder_pipeline_common.h. To use a
different model, edit the onnx_filename field and rebuild:
static PipelineConfig d13_r104() {
return {
"d13_r104_X", 13, 104,
"predecoder_memory_model_4_d13_T104_X.onnx", // changed model
8, 8, 16};
}
Then rebuild:
cmake --build build --target hololink_predecoder_bridge
Orchestration Script Reference
hololink_predecoder_test.sh [options]
Modes
Flag |
Description |
|---|---|
|
Use FPGA emulator (no real FPGA needed) |
(default) |
FPGA mode (requires real FPGA) |
Actions
Flag |
Description |
|---|---|
|
Configure ConnectX network interfaces |
|
Skip running the test |
Directory Options
Flag |
Default |
Description |
|---|---|---|
|
|
cuda-quantum source directory |
|
|
cudaqx source directory |
|
Per-config default |
Syndrome data directory (expects |
Network Options
Flag |
Default |
Description |
|---|---|---|
|
auto-detect |
ConnectX IB device name |
|
|
Bridge tool IP address |
|
|
Emulator IP (emulate mode only) |
|
|
FPGA IP address |
|
|
MTU size |
Run Options
Flag |
Default |
Description |
|---|---|---|
|
|
Pipeline config ( |
|
|
GPU device ID |
|
|
Timeout in seconds |
|
Per-config |
Number of syndrome shots (limited by FPGA BRAM) |
|
Per-config |
Ring buffer slot size in bytes |
|
|
Number of ring buffer slots |
|
(unset) |
Inter-shot spacing in microseconds |
|
|
UDP control port for emulator |