AI Predecoder with CUDA-Q Realtime (with FPGA Data Injection) ============================================================= .. note:: The following information is about a C++ demonstration that must be built from source and is not part of any distributed CUDA-Q QEC binaries. This guide explains how to build, test, and run the AI predecoder + PyMatching pipeline over Hololink RDMA using CUDA-Q's realtime host dispatch system. The pipeline runs a TensorRT-accelerated neural network (the *predecoder*) on the GPU to reduce syndrome density, then feeds the residual detectors to a pool of PyMatching MWPM decoders on the CPU. It operates in two configurations: - **Emulated end-to-end test** -- software FPGA emulator replaces real hardware - **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE For the software-only benchmark (no FPGA or network hardware), see :doc:`/examples_rst/qec/realtime_predecoder_pymatching`. Prerequisites ------------- Hardware ^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 25 25 20 * - Configuration - GPU - ConnectX NIC - FPGA * - Emulated E2E - CUDA GPU with GPUDirect RDMA - Required (loopback cable) - Not required * - FPGA E2E - CUDA GPU with GPUDirect RDMA - Required - Required Tested platforms: GB200. Software ^^^^^^^^ - **CUDA Toolkit**: 12.6 or later - **TensorRT**: 10.x (headers and libraries) - **CUDA-Q SDK**: pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``) - **DOCA**: 3.3 or later (for ``gpu_roce_transceiver`` RDMA transport) - **PyMatching decoder plugin**: the ``cudaq-qec-pymatching`` shared library (``libcudaq-qec-pymatching.so``). Built as part of the cudaqx build and required at runtime. - **Predecoder ONNX model** (e.g. ``predecoder_memory_d13_T104_X.onnx``) placed under ``libs/qec/lib/realtime/``. A cached TensorRT ``.engine`` file with the same base name is loaded automatically if present; otherwise the engine is built from the ONNX file on first run (this can take 1--2 minutes for large models). Source Repositories ^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 40 30 * - Repository - URL - Version * - **cudaqx** - https://github.com/NVIDIA/cudaqx - ``main`` branch (or your feature branch) * - **cuda-quantum** (realtime) - https://github.com/NVIDIA/cuda-quantum - Branch ``releases/v0.14.1`` * - **holoscan-sensor-bridge** - https://github.com/nvidia-holoscan/holoscan-sensor-bridge - Tag ``2.6.0-EA2`` ``cuda-quantum`` provides ``libcudaq-realtime`` (the host dispatcher, ring buffer management, and dispatch kernel). ``holoscan-sensor-bridge`` provides the Hololink ``GpuRoceTransceiver`` library for RDMA transport. .. note:: The FPGA emulator (``hololink_fpga_emulator``) is built from the ``cuda-quantum`` repository and is only needed for the emulated test. Repository Layout ----------------- Key files within ``cudaqx``: .. code-block:: text libs/qec/ unittests/ realtime/ hololink_predecoder_bridge.cpp # Bridge tool (RDMA <-> AI predecoder + PyMatching) hololink_predecoder_test.sh # Orchestration script predecoder_pipeline_common.h # Pipeline config and shared utilities predecoder_pipeline_common.cpp # Data loading (detectors, H, O, priors) test_realtime_predecoder_w_pymatching.cpp # Software-only benchmark utils/ hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes into FPGA) The FPGA emulator is in the ``cuda-quantum`` repository: .. code-block:: text cuda-quantum/realtime/ unittests/utils/ hololink_fpga_emulator.cpp # Software FPGA emulator Data Directory Layout --------------------- .. note:: Exercising the following test program requires data files that are generated using the Ising-Decoding repository. Please see the instructions in the `Ising-Decoding README `_ for how to generate these files. The syndrome data directory follows the same format as the software benchmark. See :doc:`/examples_rst/qec/realtime_predecoder_pymatching` for the full specification. In summary, it must contain: - ``detectors.bin`` -- detector samples (binary, int32) - ``observables.bin`` -- observable ground-truth labels (binary, int32) The orchestration script automatically converts ``detectors.bin`` to the text format that ``hololink_fpga_syndrome_playback`` expects. .. note:: **FPGA BRAM constraints**: The FPGA BRAM has a fixed depth (``RAM_DEPTH=512`` lines of 64 bytes each = 32 KB). For large configs like d13_r104 (frame size 17,536 bytes = 274 lines per shot), only **1 shot** fits in BRAM per playback. The ``--num-shots`` flag in the orchestration script controls how many shots are loaded; the script applies config-appropriate defaults automatically. Building -------- Building the FPGA demo requires ``holoscan-sensor-bridge`` and ``libcudaq-realtime`` with Hololink tools enabled. .. code-block:: bash # 1. Clone cuda-quantum (realtime) git clone --filter=blob:none --no-checkout \ https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src cd cudaq-realtime-src git sparse-checkout init --cone git sparse-checkout set realtime git checkout releases/v0.14.1 cd .. # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) # Requires cmake >= 3.30.4 git clone --branch 2.6.0-EA2 \ https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git cd holoscan-sensor-bridge # Strip operators we don't need to avoid configure failures sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ src/hololink/operators/CMakeLists.txt mkdir -p build && cd build cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ -DHOLOLINK_BUILD_PYTHON=OFF \ -DHOLOLINK_BUILD_TESTS=OFF \ -DHOLOLINK_BUILD_TOOLS=OFF \ -DHOLOLINK_BUILD_EXAMPLES=OFF \ -DHOLOLINK_BUILD_EMULATOR=OFF .. cmake --build . --target gpu_roce_transceiver hololink_core cd ../.. # 3. Build libcudaq-realtime with Hololink tools enabled cd cudaq-realtime-src/realtime && mkdir -p build && cd build cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ .. ninja && ninja install cd ../../.. # 4. Build cudaqx with Hololink tools enabled cmake -S cudaqx -B cudaqx/build \ -DCMAKE_BUILD_TYPE=Release \ -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ -DCUDAQ_QEC_BUILD_TRT_DECODER=ON \ -DCUDAQX_ENABLE_LIBS="qec" \ -DCUDAQX_INCLUDE_TESTS=ON \ -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build cmake --build cudaqx/build --target \ hololink_predecoder_bridge \ hololink_fpga_syndrome_playback \ cudaq-qec-pymatching Emulated End-to-End Test ------------------------ The emulated test replaces the physical FPGA with a software emulator. Three processes run concurrently: 1. **Emulator** -- receives syndromes via the UDP control plane, sends them to the bridge via RDMA, and captures corrections 2. **Bridge** (``hololink_predecoder_bridge``) -- receives RDMA data, runs the AI predecoder (TensorRT CUDA graph) and PyMatching decode via the ``realtime_pipeline`` 3. **Playback** (``hololink_fpga_syndrome_playback``) -- loads syndrome data into the emulator's BRAM and triggers playback Requirements ^^^^^^^^^^^^ - ConnectX NIC with a loopback cable connecting both ports - Software dependencies (DOCA, Holoscan SDK, etc.) as described in the cuda-quantum realtime build guide - All three tools built (bridge, playback, emulator) Running ^^^^^^^ .. code-block:: bash ./libs/qec/unittests/realtime/hololink_predecoder_test.sh \ --emulate \ --setup-network \ --cuda-quantum-dir /path/to/cuda-quantum \ --cuda-qx-dir /path/to/cudaqx \ --data-dir /path/to/syndrome_data The ``--setup-network`` flag configures the ConnectX interface with the appropriate IP addresses and MTU. It only needs to be run once per boot. After the initial network setup, subsequent runs are faster: .. code-block:: bash ./libs/qec/unittests/realtime/hololink_predecoder_test.sh \ --emulate \ --cuda-quantum-dir /path/to/cuda-quantum \ --cuda-qx-dir /path/to/cudaqx \ --data-dir /path/to/syndrome_data FPGA End-to-End Test -------------------- The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two processes run: 1. **Bridge** (``hololink_predecoder_bridge``) -- same as emulated mode 2. **Playback** (``hololink_fpga_syndrome_playback``) -- loads syndromes into the FPGA's BRAM and triggers RDMA playback to the bridge Requirements ^^^^^^^^^^^^ - FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via direct cable or switch - FPGA IP and bridge IP on the same subnet - ConnectX device name (e.g., ``mlx5_4``) Running ^^^^^^^ .. code-block:: bash ./libs/qec/unittests/realtime/hololink_predecoder_test.sh \ --cuda-quantum-dir /path/to/cuda-quantum \ --cuda-qx-dir /path/to/cudaqx \ --data-dir /path/to/syndrome_data \ --device mlx5_4 \ --bridge-ip 192.168.0.1 \ --fpga-ip 192.168.0.2 \ --gpu 2 \ --config d13_r104 \ --timeout 60 Expected output: .. code-block:: text ======================================== Hololink Predecoder + PyMatching Bridge Test ======================================== Mode: Real FPGA (2-tool) Config: d13_r104 ... [RDMA+TRT] Shot 0: received 17472 detectors (input_nonzero=939), predecoder logical_pred=1, residual_nonzero=23 ======================================== PREDECODER BRIDGE TEST: PASS ======================================== === Results === Total completed: 1 Avg PyMatching decode: 211.0 us (1 samples) Shot 0: logical_pred=1 total_corrections=64 converged=1 This confirms: FPGA RDMA receipt (939 nonzero detectors), TensorRT inference (reduced to 23 residuals), and PyMatching decode (64 corrections, converged). Key parameters for FPGA mode: .. list-table:: :header-rows: 1 :widths: 20 80 * - Parameter - Description * - ``--device`` - ConnectX IB device name (e.g., ``mlx5_4``) * - ``--bridge-ip`` - IP address assigned to the ConnectX interface * - ``--fpga-ip`` - FPGA's IP address * - ``--gpu`` - GPU device ID (choose NUMA-local GPU for lowest latency) * - ``--config`` - Pipeline configuration (e.g., ``d13_r104``) * - ``--data-dir`` - Path to syndrome data directory * - ``--page-size`` - Ring buffer slot size in bytes (auto-set per config by default) * - ``--num-shots`` - Number of syndrome shots to play back (limited by FPGA BRAM) .. note:: For d13_r104 (frame size 17,536 bytes), the default page size is 32,768 bytes and the maximum number of shots per playback is 1 due to FPGA BRAM constraints. Smaller configs (e.g. d7) can fit more shots. GPU Selection ^^^^^^^^^^^^^ For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. For example, on a GB200 system where ``mlx5_4`` is on NUMA node 1, use ``--gpu 2`` or ``--gpu 3``. Check NUMA locality with: .. code-block:: bash cat /sys/class/infiniband//device/numa_node Network Sanity Check ^^^^^^^^^^^^^^^^^^^^ Before running, verify that the bridge IP is assigned to exactly one interface: .. code-block:: bash ip addr show | grep 192.168.0.1 If multiple interfaces show the same IP, remove the duplicate to avoid routing ambiguity that silently drops RDMA packets. Changing the Predecoder Model ----------------------------- The ONNX model file for each configuration is set in the ``PipelineConfig`` factory methods in ``libs/qec/unittests/realtime/predecoder_pipeline_common.h``. To use a different model, edit the ``onnx_filename`` field and rebuild: .. code-block:: cpp static PipelineConfig d13_r104() { return { "d13_r104_X", 13, 104, "predecoder_memory_model_4_d13_T104_X.onnx", // changed model 8, 8, 16}; } Then rebuild: .. code-block:: bash cmake --build build --target hololink_predecoder_bridge Orchestration Script Reference ------------------------------ .. code-block:: text hololink_predecoder_test.sh [options] Modes ^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - Flag - Description * - ``--emulate`` - Use FPGA emulator (no real FPGA needed) * - *(default)* - FPGA mode (requires real FPGA) Actions ^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - Flag - Description * - ``--setup-network`` - Configure ConnectX network interfaces * - ``--no-run`` - Skip running the test Directory Options ^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 30 45 * - Flag - Default - Description * - ``--cuda-quantum-dir DIR`` - ``/workspaces/cuda-quantum`` - cuda-quantum source directory * - ``--cuda-qx-dir DIR`` - ``/workspaces/cudaqx`` - cudaqx source directory * - ``--data-dir DIR`` - Per-config default - Syndrome data directory (expects ``detectors.bin``) Network Options ^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Flag - Default - Description * - ``--device DEV`` - auto-detect - ConnectX IB device name * - ``--bridge-ip ADDR`` - ``10.0.0.1`` - Bridge tool IP address * - ``--emulator-ip ADDR`` - ``10.0.0.2`` - Emulator IP (emulate mode only) * - ``--fpga-ip ADDR`` - ``192.168.0.2`` - FPGA IP address * - ``--mtu N`` - ``4096`` - MTU size Run Options ^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Flag - Default - Description * - ``--config NAME`` - ``d13_r104`` - Pipeline config (``d7``, ``d13``, ``d13_r104``, ``d21``, ``d31``) * - ``--gpu N`` - ``0`` - GPU device ID * - ``--timeout N`` - ``60`` - Timeout in seconds * - ``--num-shots N`` - Per-config - Number of syndrome shots (limited by FPGA BRAM) * - ``--page-size N`` - Per-config - Ring buffer slot size in bytes * - ``--num-pages N`` - ``64`` - Number of ring buffer slots * - ``--spacing N`` - *(unset)* - Inter-shot spacing in microseconds * - ``--control-port N`` - ``8193`` - UDP control port for emulator