Relay BP Decoding with CUDA-Q Realtime ======================================== .. note:: The following information is about a C++ demonstration that must be built from source and is not part of any distributed CUDA-Q QEC binaries. This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP decoder using CUDA-Q's realtime host dispatch system. The decoder runs as a CPU-launched CUDA graph (``HOST_LOOP`` dispatch path) and can operate in three configurations: - **CI unit test** -- standalone executable, no FPGA or network hardware needed - **Emulated end-to-end test** -- software FPGA emulator replaces real hardware - **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE Prerequisites ------------- Hardware ^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 25 25 20 * - Configuration - GPU - ConnectX NIC - FPGA * - CI unit test - Any CUDA-capable GPU - Not required - Not required * - Emulated E2E - CUDA GPU with GPUDirect RDMA - Required (loopback cable) - Not required * - FPGA E2E - CUDA GPU with GPUDirect RDMA - Required - Required Tested platforms: DGX Spark, GB200. Software ^^^^^^^^ - **CUDA Toolkit**: 12.6 or later - **CUDA-Q SDK**: pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``) - **nv-qldpc-decoder plugin**: the proprietary nv-qldpc-decoder shared library (``libcudaq-qec-nv-qldpc-decoder.so``). Required at runtime for all three configurations. Source Repositories ^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 30 40 30 * - Repository - URL - Version * - **cudaqx** - https://github.com/NVIDIA/cudaqx - ``main`` branch (or your feature branch) * - **cuda-quantum** (realtime) - https://github.com/NVIDIA/cuda-quantum - Branch ``releases/v0.14.1`` * - **holoscan-sensor-bridge** - https://github.com/nvidia-holoscan/holoscan-sensor-bridge - Tag ``2.6.0-EA2`` ``cuda-quantum`` provides ``libcudaq-realtime`` (the host dispatcher, ring buffer management, and dispatch kernel). ``holoscan-sensor-bridge`` provides the Hololink ``GpuRoceTransceiver`` library for RDMA transport. .. note:: ``holoscan-sensor-bridge`` is only needed for the emulated and FPGA end-to-end tests. The CI unit test requires only ``libcudaq-realtime``. Repository Layout ----------------- Key files within ``cudaqx``: .. code-block:: text libs/qec/ unittests/ realtime/ qec_graph_decode_test/ test_realtime_qldpc_graph_decoding.cpp # CI unit test qec_roce_decode_test/ data/ config_nv_qldpc_relay.yml # Relay BP decoder config syndromes_nv_qldpc_relay.txt # 100 test syndrome shots utils/ hololink_qldpc_graph_decoder_bridge.cpp # Bridge tool (RDMA <-> decoder) hololink_qldpc_graph_decoder_test.sh # Orchestration script hololink_fpga_syndrome_playback.cpp # Playback tool (loads syndromes) The FPGA emulator is in the ``cuda-quantum`` repository: .. code-block:: text cuda-quantum/realtime/ unittests/utils/ hololink_fpga_emulator.cpp # Software FPGA emulator Building -------- CI unit test only (no Hololink tools) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If you only need to run the CI unit test, you can build without ``holoscan-sensor-bridge``: .. code-block:: bash # 1. Build libcudaq-realtime git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src cd cudaq-realtime-src git checkout releases/v0.14.1 cd realtime && mkdir -p build && cd build cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime .. ninja && ninja install cd ../../.. # 2. Build cudaqx with the nv-qldpc-decoder test cmake -S cudaqx -B cudaqx/build \ -DCMAKE_BUILD_TYPE=Release \ -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ -DCUDAQX_ENABLE_LIBS="qec" \ -DCUDAQX_INCLUDE_TESTS=ON cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding Full build (CI test + Hololink bridge/playback tools) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ To also build the bridge and playback tools for emulated or FPGA testing: .. code-block:: bash # 1. Clone cuda-quantum (realtime) git clone --filter=blob:none --no-checkout \ https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src cd cudaq-realtime-src git sparse-checkout init --cone git sparse-checkout set realtime git checkout releases/v0.14.1 cd .. # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2) # Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger). # If your system cmake is older: pip install cmake git clone --branch 2.6.0-EA2 \ https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git cd holoscan-sensor-bridge # Strip operators we don't need to avoid configure failures from missing deps sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d; /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d; /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d; /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d; /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d; /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d; /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \ src/hololink/operators/CMakeLists.txt mkdir -p build && cd build cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \ -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \ -DHOLOLINK_BUILD_PYTHON=OFF \ -DHOLOLINK_BUILD_TESTS=OFF \ -DHOLOLINK_BUILD_TOOLS=OFF \ -DHOLOLINK_BUILD_EXAMPLES=OFF \ -DHOLOLINK_BUILD_EMULATOR=OFF .. cmake --build . --target gpu_roce_transceiver hololink_core cd ../.. # 3. Build libcudaq-realtime with Hololink tools enabled # This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge # tool) as well as the FPGA emulator. cd cudaq-realtime-src/realtime && mkdir -p build && cd build cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \ -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \ -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \ -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \ .. ninja && ninja install cd ../../.. # 4. Build cudaqx with Hololink tools enabled cmake -S cudaqx -B cudaqx/build \ -DCMAKE_BUILD_TYPE=Release \ -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \ -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \ -DCUDAQX_ENABLE_LIBS="qec" \ -DCUDAQX_INCLUDE_TESTS=ON \ -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \ -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \ -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build cmake --build cudaqx/build --target \ test_realtime_qldpc_graph_decoding \ hololink_qldpc_graph_decoder_bridge \ hololink_fpga_syndrome_playback Using the orchestration script ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The orchestration script can build everything automatically: .. code-block:: bash ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ --build \ --hsb-dir /path/to/holoscan-sensor-bridge \ --cuda-quantum-dir /path/to/cuda-quantum \ --no-run CI Unit Test ------------ The CI unit test (``test_realtime_qldpc_graph_decoding``) exercises the full host dispatch decode path without any network hardware. It: 1. Loads the Relay BP config and syndrome data from YAML/text files 2. Creates the decoder via the ``decoder::get("nv-qldpc-decoder", ...)`` plugin API 3. Captures a CUDA graph of the decode pipeline 4. Wires ``libcudaq-realtime``'s host dispatcher (HOST_LOOP) to a ring buffer 5. Writes RPC requests into the ring buffer, the host dispatcher launches the CUDA graph, and the test verifies corrections Running ^^^^^^^ .. code-block:: bash cd cudaqx/build # The nv-qldpc-decoder plugin must be discoverable at runtime. # Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path: export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so ./libs/qec/unittests/test_realtime_qldpc_graph_decoding Expected output: .. code-block:: text [==========] Running 1 test from 1 test suite. [----------] 1 test from RealtimeQLDPCGraphDecodingTest [ RUN ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots ... [ OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms) [==========] 1 test from 1 test suite ran. [ PASSED ] 1 test. Emulated End-to-End Test ------------------------ The emulated test replaces the physical FPGA with a software emulator. Three processes run concurrently: 1. **Emulator** -- receives syndromes via the UDP control plane, sends them to the bridge via RDMA, and captures corrections 2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU, receiving syndromes and sending corrections via RDMA 3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers playback, then verifies corrections Requirements ^^^^^^^^^^^^ - ConnectX NIC with a loopback cable connecting both ports (the emulator sends RDMA traffic out one port and the bridge receives on the other) - Software dependencies (DOCA, Holoscan SDK, etc.) as described in the `cuda-quantum realtime build guide `__ - All three tools built (bridge, playback, emulator) Running ^^^^^^^ .. code-block:: bash ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ --emulate \ --build \ --setup-network \ --hsb-dir /path/to/holoscan-sensor-bridge The ``--setup-network`` flag configures the ConnectX interface with the appropriate IP addresses and MTU. It only needs to be run once per boot. After the initial build and network setup, subsequent runs are faster: .. code-block:: bash ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate FPGA End-to-End Test -------------------- The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC. Two processes run: 1. **Bridge** -- same as emulated mode 2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback, then reads back corrections from the FPGA's capture RAM to verify them Requirements ^^^^^^^^^^^^ - FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via direct cable or switch. Bitfiles for supported FPGA vendors are available `here `__. See the `cuda-quantum realtime user guide `__ for FPGA setup instructions. - FPGA IP and bridge IP on the same subnet - ConnectX device name (e.g., ``mlx5_4``, ``mlx5_5``) Running ^^^^^^^ .. code-block:: bash ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \ --build \ --setup-network \ --device mlx5_5 \ --bridge-ip 192.168.0.1 \ --fpga-ip 192.168.0.2 \ --gpu 2 \ --page-size 512 \ --hsb-dir /path/to/holoscan-sensor-bridge Key parameters for FPGA mode: .. list-table:: :header-rows: 1 :widths: 20 80 * - Parameter - Description * - ``--device`` - ConnectX IB device name (e.g., ``mlx5_5``) * - ``--bridge-ip`` - IP address assigned to the ConnectX interface * - ``--fpga-ip`` - FPGA's IP address * - ``--gpu`` - GPU device ID (choose NUMA-local GPU for lowest latency) * - ``--page-size`` - Ring buffer slot size in bytes (use ``512`` on GB200 for alignment) * - ``--spacing`` - Inter-shot spacing in microseconds .. note:: The ``--spacing`` value should be set to at least the per-shot decode time to avoid overrunning the input ring buffer. If syndromes arrive faster than the decoder can process them, the buffer fills up and messages are lost. Use a ``--spacing`` value at or above the observed decode time for sustained operation. GPU Selection ^^^^^^^^^^^^^ For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC. For example, on a GB200 system where ``mlx5_5`` is on NUMA node 1, use ``--gpu 2`` or ``--gpu 3``. Check NUMA locality with: .. code-block:: bash cat /sys/class/infiniband//device/numa_node Network Sanity Check ^^^^^^^^^^^^^^^^^^^^ Before running, verify that the bridge IP is assigned to exactly one interface: .. code-block:: bash ip addr show | grep 192.168.0.1 If multiple interfaces show the same IP, remove the duplicate to avoid routing ambiguity that silently drops RDMA packets. Orchestration Script Reference ------------------------------ .. code-block:: text hololink_qldpc_graph_decoder_test.sh [options] Modes ^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - Flag - Description * - ``--emulate`` - Use FPGA emulator (no real FPGA needed) * - *(default)* - FPGA mode (requires real FPGA) Actions ^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 80 * - Flag - Description * - ``--build`` - Build all required tools before running * - ``--setup-network`` - Configure ConnectX network interfaces * - ``--no-run`` - Skip running the test (useful with ``--build``) Build Options ^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 30 45 * - Flag - Default - Description * - ``--hsb-dir DIR`` - ``/workspaces/holoscan-sensor-bridge`` - holoscan-sensor-bridge source directory * - ``--cuda-quantum-dir DIR`` - ``/workspaces/cuda-quantum`` - cuda-quantum source directory * - ``--cuda-qx-dir DIR`` - ``/workspaces/cudaqx`` - cudaqx source directory * - ``--jobs N`` - ``nproc`` - Parallel build jobs Network Options ^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 20 55 * - Flag - Default - Description * - ``--device DEV`` - auto-detect - ConnectX IB device name * - ``--bridge-ip ADDR`` - ``10.0.0.1`` - Bridge tool IP address * - ``--emulator-ip ADDR`` - ``10.0.0.2`` - Emulator IP (emulate mode) * - ``--fpga-ip ADDR`` - ``192.168.0.2`` - FPGA IP address * - ``--mtu N`` - ``4096`` - MTU size Run Options ^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Flag - Default - Description * - ``--gpu N`` - ``0`` - GPU device ID * - ``--timeout N`` - ``60`` - Timeout in seconds * - ``--num-shots N`` - all available - Limit number of syndrome shots * - ``--page-size N`` - ``384`` - Ring buffer slot size in bytes * - ``--num-pages N`` - ``128`` - Number of ring buffer slots * - ``--spacing N`` - ``10`` - Inter-shot spacing in microseconds * - ``--no-verify`` - *(verify)* - Skip correction verification * - ``--control-port N`` - ``8193`` - UDP control port for emulator