Relay BP Decoding with CUDA-Q Realtime
========================================

.. note::

  The following information is about a C++ demonstration that must be built
  from source and is not part of any distributed CUDA-Q QEC binaries.

This guide explains how to build, test, and run the nv-qldpc-decoder Relay BP
decoder using CUDA-Q's realtime host dispatch system.  The decoder runs as a
CPU-launched CUDA graph (``HOST_LOOP`` dispatch path) and can operate in three
configurations:

- **CI unit test** -- standalone executable, no FPGA or network hardware needed
- **Emulated end-to-end test** -- software FPGA emulator replaces real hardware
- **FPGA end-to-end test** -- real FPGA connected via ConnectX RDMA/RoCE

Prerequisites
-------------

Hardware
^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 25 25 20

   * - Configuration
     - GPU
     - ConnectX NIC
     - FPGA
   * - CI unit test
     - Any CUDA-capable GPU
     - Not required
     - Not required
   * - Emulated E2E
     - CUDA GPU with GPUDirect RDMA
     - Required (loopback cable)
     - Not required
   * - FPGA E2E
     - CUDA GPU with GPUDirect RDMA
     - Required
     - Required

Tested platforms: DGX Spark, GB200.

Software
^^^^^^^^

- **CUDA Toolkit**: 12.6 or later
- **CUDA-Q SDK**: pre-installed (provides ``libcudaq``, ``libnvqir``, ``nvq++``)
- **nv-qldpc-decoder plugin**: the proprietary nv-qldpc-decoder shared library
  (``libcudaq-qec-nv-qldpc-decoder.so``).  Required at runtime for all
  three configurations.

Source Repositories
^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 30 40 30

   * - Repository
     - URL
     - Version
   * - **cudaqx**
     - https://github.com/NVIDIA/cudaqx
     - ``main`` branch (or your feature branch)
   * - **cuda-quantum** (realtime)
     - https://github.com/NVIDIA/cuda-quantum
     - Branch ``releases/v0.14.1``
   * - **holoscan-sensor-bridge**
     - https://github.com/nvidia-holoscan/holoscan-sensor-bridge
     - Tag ``2.6.0-EA2``

``cuda-quantum`` provides ``libcudaq-realtime`` (the host dispatcher, ring
buffer management, and dispatch kernel).  ``holoscan-sensor-bridge`` provides
the Hololink ``GpuRoceTransceiver`` library for RDMA transport.

.. note::

   ``holoscan-sensor-bridge`` is only needed for the emulated and FPGA
   end-to-end tests.  The CI unit test requires only ``libcudaq-realtime``.

Repository Layout
-----------------

Key files within ``cudaqx``:

.. code-block:: text

   libs/qec/
     unittests/
       realtime/
         qec_graph_decode_test/
           test_realtime_qldpc_graph_decoding.cpp   # CI unit test
         qec_roce_decode_test/
           data/
             config_nv_qldpc_relay.yml              # Relay BP decoder config
             syndromes_nv_qldpc_relay.txt           # 100 test syndrome shots
       utils/
         hololink_qldpc_graph_decoder_bridge.cpp    # Bridge tool (RDMA <-> decoder)
         hololink_qldpc_graph_decoder_test.sh       # Orchestration script
         hololink_fpga_syndrome_playback.cpp        # Playback tool (loads syndromes)

The FPGA emulator is in the ``cuda-quantum`` repository:

.. code-block:: text

   cuda-quantum/realtime/
     unittests/utils/
       hololink_fpga_emulator.cpp                   # Software FPGA emulator

Building
--------

CI unit test only (no Hololink tools)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

If you only need to run the CI unit test, you can build without
``holoscan-sensor-bridge``:

.. code-block:: bash

   # 1. Build libcudaq-realtime
   git clone https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src
   cd cudaq-realtime-src
   git checkout releases/v0.14.1
   cd realtime && mkdir -p build && cd build
   cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime ..
   ninja && ninja install
   cd ../../..

   # 2. Build cudaqx with the nv-qldpc-decoder test
   cmake -S cudaqx -B cudaqx/build \
     -DCMAKE_BUILD_TYPE=Release \
     -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \
     -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
     -DCUDAQX_ENABLE_LIBS="qec" \
     -DCUDAQX_INCLUDE_TESTS=ON
   cmake --build cudaqx/build --target test_realtime_qldpc_graph_decoding

Full build (CI test + Hololink bridge/playback tools)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To also build the bridge and playback tools for emulated or FPGA testing:

.. code-block:: bash

   # 1. Clone cuda-quantum (realtime)
   git clone --filter=blob:none --no-checkout \
     https://github.com/NVIDIA/cuda-quantum.git cudaq-realtime-src
   cd cudaq-realtime-src
   git sparse-checkout init --cone
   git sparse-checkout set realtime
   git checkout releases/v0.14.1
   cd ..

   # 2. Build holoscan-sensor-bridge (tag 2.6.0-EA2)
   #    Requires cmake >= 3.30.4 (HSB -> find_package(holoscan) -> rapids_logger).
   #    If your system cmake is older: pip install cmake
   git clone --branch 2.6.0-EA2 \
     https://github.com/nvidia-holoscan/holoscan-sensor-bridge.git
   cd holoscan-sensor-bridge

   # Strip operators we don't need to avoid configure failures from missing deps
   sed -i '/add_subdirectory(audio_packetizer)/d; /add_subdirectory(compute_crc)/d;
           /add_subdirectory(csi_to_bayer)/d; /add_subdirectory(image_processor)/d;
           /add_subdirectory(iq_dec)/d; /add_subdirectory(iq_enc)/d;
           /add_subdirectory(linux_coe_receiver)/d; /add_subdirectory(linux_receiver)/d;
           /add_subdirectory(packed_format_converter)/d; /add_subdirectory(sub_frame_combiner)/d;
           /add_subdirectory(udp_transmitter)/d; /add_subdirectory(emulator)/d;
           /add_subdirectory(sig_gen)/d; /add_subdirectory(sig_viewer)/d' \
     src/hololink/operators/CMakeLists.txt

   mkdir -p build && cd build
   cmake -G Ninja -DCMAKE_BUILD_TYPE=Release \
     -DHOLOLINK_BUILD_ONLY_NATIVE=OFF \
     -DHOLOLINK_BUILD_PYTHON=OFF \
     -DHOLOLINK_BUILD_TESTS=OFF \
     -DHOLOLINK_BUILD_TOOLS=OFF \
     -DHOLOLINK_BUILD_EXAMPLES=OFF \
     -DHOLOLINK_BUILD_EMULATOR=OFF ..
   cmake --build . --target gpu_roce_transceiver hololink_core
   cd ../..

   # 3. Build libcudaq-realtime with Hololink tools enabled
   #    This produces libcudaq-realtime-bridge-hololink.so (needed by the bridge
   #    tool) as well as the FPGA emulator.
   cd cudaq-realtime-src/realtime && mkdir -p build && cd build
   cmake -G Ninja -DCMAKE_INSTALL_PREFIX=/tmp/cudaq-realtime \
     -DCUDAQ_REALTIME_ENABLE_HOLOLINK_TOOLS=ON \
     -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=../../holoscan-sensor-bridge \
     -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=../../holoscan-sensor-bridge/build \
     ..
   ninja && ninja install
   cd ../../..

   # 4. Build cudaqx with Hololink tools enabled
   cmake -S cudaqx -B cudaqx/build \
     -DCMAKE_BUILD_TYPE=Release \
     -DCUDAQ_DIR=/path/to/cudaq-install/lib/cmake/cudaq/ \
     -DCUDAQ_REALTIME_ROOT=/tmp/cudaq-realtime \
     -DCUDAQX_ENABLE_LIBS="qec" \
     -DCUDAQX_INCLUDE_TESTS=ON \
     -DCUDAQX_QEC_ENABLE_HOLOLINK_TOOLS=ON \
     -DHOLOSCAN_SENSOR_BRIDGE_SOURCE_DIR=/path/to/holoscan-sensor-bridge \
     -DHOLOSCAN_SENSOR_BRIDGE_BUILD_DIR=/path/to/holoscan-sensor-bridge/build
   cmake --build cudaqx/build --target \
     test_realtime_qldpc_graph_decoding \
     hololink_qldpc_graph_decoder_bridge \
     hololink_fpga_syndrome_playback

Using the orchestration script
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The orchestration script can build everything automatically:

.. code-block:: bash

   ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
     --build \
     --hsb-dir /path/to/holoscan-sensor-bridge \
     --cuda-quantum-dir /path/to/cuda-quantum \
     --no-run

CI Unit Test
------------

The CI unit test (``test_realtime_qldpc_graph_decoding``) exercises the full
host dispatch decode path without any network hardware.  It:

1. Loads the Relay BP config and syndrome data from YAML/text files
2. Creates the decoder via the ``decoder::get("nv-qldpc-decoder", ...)`` plugin API
3. Captures a CUDA graph of the decode pipeline
4. Wires ``libcudaq-realtime``'s host dispatcher (HOST_LOOP) to a ring buffer
5. Writes RPC requests into the ring buffer, the host dispatcher launches the
   CUDA graph, and the test verifies corrections

Running
^^^^^^^

.. code-block:: bash

   cd cudaqx/build

   # The nv-qldpc-decoder plugin must be discoverable at runtime.
   # Set QEC_EXTERNAL_DECODERS if the plugin is not in the default search path:
   export QEC_EXTERNAL_DECODERS=/path/to/libcudaq-qec-nv-qldpc-decoder.so

   ./libs/qec/unittests/test_realtime_qldpc_graph_decoding

Expected output:

.. code-block:: text

   [==========] Running 1 test from 1 test suite.
   [----------] 1 test from RealtimeQLDPCGraphDecodingTest
   [ RUN      ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots
   ...
   [       OK ] RealtimeQLDPCGraphDecodingTest.DispatchHostLoopAllShots (XXX ms)
   [==========] 1 test from 1 test suite ran.
   [  PASSED  ] 1 test.

Emulated End-to-End Test
------------------------

The emulated test replaces the physical FPGA with a software emulator.  Three
processes run concurrently:

1. **Emulator** -- receives syndromes via the UDP control plane, sends them
   to the bridge via RDMA, and captures corrections
2. **Bridge** -- runs the host dispatcher and CUDA graph decode loop on the GPU,
   receiving syndromes and sending corrections via RDMA
3. **Playback** -- loads syndrome data into the emulator's BRAM and triggers
   playback, then verifies corrections

Requirements
^^^^^^^^^^^^

- ConnectX NIC with a loopback cable connecting both ports (the emulator
  sends RDMA traffic out one port and the bridge receives on the other)
- Software dependencies (DOCA, Holoscan SDK, etc.) as described in the
  `cuda-quantum realtime build guide <https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/building.md>`__
- All three tools built (bridge, playback, emulator)

Running
^^^^^^^

.. code-block:: bash

   ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
     --emulate \
     --build \
     --setup-network \
     --hsb-dir /path/to/holoscan-sensor-bridge

The ``--setup-network`` flag configures the ConnectX interface with the
appropriate IP addresses and MTU.  It only needs to be run once per boot.

After the initial build and network setup, subsequent runs are faster:

.. code-block:: bash

   ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh --emulate

FPGA End-to-End Test
--------------------

The FPGA test uses a real FPGA connected to the GPU via a ConnectX NIC.  Two
processes run:

1. **Bridge** -- same as emulated mode
2. **Playback** -- loads syndromes into the FPGA's BRAM and triggers playback,
   then reads back corrections from the FPGA's capture RAM to verify them

Requirements
^^^^^^^^^^^^

- FPGA programmed with the HSB IP bitfile, connected to a ConnectX NIC via
  direct cable or switch.  Bitfiles for supported FPGA vendors are available
  `here <https://edge.urm.nvidia.com/artifactory/sw-holoscan-thirdparty-generic-local/QEC/HSB-2.6.0-EA/>`__.
  See the `cuda-quantum realtime user guide <https://github.com/NVIDIA/cuda-quantum/blob/main/realtime/docs/user_guide.md>`__
  for FPGA setup instructions.
- FPGA IP and bridge IP on the same subnet
- ConnectX device name (e.g., ``mlx5_4``, ``mlx5_5``)

Running
^^^^^^^

.. code-block:: bash

   ./libs/qec/unittests/utils/hololink_qldpc_graph_decoder_test.sh \
     --build \
     --setup-network \
     --device mlx5_5 \
     --bridge-ip 192.168.0.1 \
     --fpga-ip 192.168.0.2 \
     --gpu 2 \
     --page-size 512 \
     --hsb-dir /path/to/holoscan-sensor-bridge

Key parameters for FPGA mode:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Parameter
     - Description
   * - ``--device``
     - ConnectX IB device name (e.g., ``mlx5_5``)
   * - ``--bridge-ip``
     - IP address assigned to the ConnectX interface
   * - ``--fpga-ip``
     - FPGA's IP address
   * - ``--gpu``
     - GPU device ID (choose NUMA-local GPU for lowest latency)
   * - ``--page-size``
     - Ring buffer slot size in bytes (use ``512`` on GB200 for alignment)
   * - ``--spacing``
     - Inter-shot spacing in microseconds

.. note::

   The ``--spacing`` value should be set to at least the per-shot decode
   time to avoid overrunning the input ring buffer.  If syndromes arrive faster
   than the decoder can process them, the buffer fills up and messages are lost.
   Use a ``--spacing`` value at or above the observed decode time for sustained
   operation.

GPU Selection
^^^^^^^^^^^^^

For lowest latency, choose a GPU that is NUMA-local to the ConnectX NIC.
For example, on a GB200 system where ``mlx5_5`` is on NUMA node 1,
use ``--gpu 2`` or ``--gpu 3``.  Check NUMA locality with:

.. code-block:: bash

   cat /sys/class/infiniband/<device>/device/numa_node

Network Sanity Check
^^^^^^^^^^^^^^^^^^^^

Before running, verify that the bridge IP is assigned to exactly one interface:

.. code-block:: bash

   ip addr show | grep 192.168.0.1

If multiple interfaces show the same IP, remove the duplicate to avoid
routing ambiguity that silently drops RDMA packets.

Orchestration Script Reference
------------------------------

.. code-block:: text

   hololink_qldpc_graph_decoder_test.sh [options]

Modes
^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Flag
     - Description
   * - ``--emulate``
     - Use FPGA emulator (no real FPGA needed)
   * - *(default)*
     - FPGA mode (requires real FPGA)

Actions
^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Flag
     - Description
   * - ``--build``
     - Build all required tools before running
   * - ``--setup-network``
     - Configure ConnectX network interfaces
   * - ``--no-run``
     - Skip running the test (useful with ``--build``)

Build Options
^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 25 30 45

   * - Flag
     - Default
     - Description
   * - ``--hsb-dir DIR``
     - ``/workspaces/holoscan-sensor-bridge``
     - holoscan-sensor-bridge source directory
   * - ``--cuda-quantum-dir DIR``
     - ``/workspaces/cuda-quantum``
     - cuda-quantum source directory
   * - ``--cuda-qx-dir DIR``
     - ``/workspaces/cudaqx``
     - cudaqx source directory
   * - ``--jobs N``
     - ``nproc``
     - Parallel build jobs

Network Options
^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 25 20 55

   * - Flag
     - Default
     - Description
   * - ``--device DEV``
     - auto-detect
     - ConnectX IB device name
   * - ``--bridge-ip ADDR``
     - ``10.0.0.1``
     - Bridge tool IP address
   * - ``--emulator-ip ADDR``
     - ``10.0.0.2``
     - Emulator IP (emulate mode)
   * - ``--fpga-ip ADDR``
     - ``192.168.0.2``
     - FPGA IP address
   * - ``--mtu N``
     - ``4096``
     - MTU size

Run Options
^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 25 15 60

   * - Flag
     - Default
     - Description
   * - ``--gpu N``
     - ``0``
     - GPU device ID
   * - ``--timeout N``
     - ``60``
     - Timeout in seconds
   * - ``--num-shots N``
     - all available
     - Limit number of syndrome shots
   * - ``--page-size N``
     - ``384``
     - Ring buffer slot size in bytes
   * - ``--num-pages N``
     - ``128``
     - Number of ring buffer slots
   * - ``--spacing N``
     - ``10``
     - Inter-shot spacing in microseconds
   * - ``--no-verify``
     - *(verify)*
     - Skip correction verification
   * - ``--control-port N``
     - ``8193``
     - UDP control port for emulator