Multi-Processor Platforms

The CUDA-Q machine model elucidates the various devices considered in the broader quantum-classical compute node context. Programmers will have one or many host CPUs, zero or many NVIDIA GPUs, a classical QPU control space, and the quantum register itself. Moreover, the specification notes that the underlying platform may expose multiple QPUs. In the near-term, this will be unlikely with physical QPU instantiations, but the availability of GPU-based circuit simulators on NVIDIA multi-GPU architectures does give one an opportunity to think about programming such a multi-QPU architecture in the near-term. CUDA-Q starts by enabling one to query information about the underlying quantum platform via the quantum_platform abstraction. This type exposes a num_qpus() method that can be used to query the number of available QPUs for asynchronous CUDA-Q kernel and cudaq:: function invocations. Each available QPU is assigned a logical index, and programmers can launch specific asynchronous function invocations targeting a desired QPU.

NVIDIA MQPU Platform

In the multi-QPU mode (mqpu option), the NVIDIA target provides a simulated QPU for every available NVIDIA GPU on the underlying system. Each QPU is simulated via a cuStateVec simulator backend as defined by the NVIDIA target. For more information about using multiple GPUs to simulate each virtual QPU, or using a different backend for virtual QPUs, please see remote MQPU platform. This target enables asynchronous parallel execution of quantum kernel tasks.

Here is a simple example demonstrating its usage.

import cudaq

cudaq.set_target("nvidia", option="mqpu")
target = cudaq.get_target()
qpu_count = target.num_qpus()
print("Number of QPUs:", qpu_count)


@cudaq.kernel
def kernel(qubit_count: int):
    qvector = cudaq.qvector(qubit_count)
    # Place qubits in superposition state.
    h(qvector)
    # Measure.
    mz(qvector)


count_futures = []
for qpu in range(qpu_count):
    count_futures.append(cudaq.sample_async(kernel, 5, qpu_id=qpu))

for counts in count_futures:
    print(counts.get())
  auto kernelToBeSampled = [](int runtimeParam) __qpu__ {
    cudaq::qvector q(runtimeParam);
    h(q);
    mz(q);
  };

  // Get the quantum_platform singleton
  auto &platform = cudaq::get_platform();

  // Query the number of QPUs in the system
  auto num_qpus = platform.num_qpus();
  printf("Number of QPUs: %zu\n", num_qpus);
  // We will launch asynchronous sampling tasks
  // and will store the results immediately as a future
  // we can query at some later point
  std::vector<cudaq::async_sample_result> countFutures;
  for (std::size_t i = 0; i < num_qpus; i++) {
    countFutures.emplace_back(
        cudaq::sample_async(i, kernelToBeSampled, 5 /*runtimeParam*/));
  }

  //
  // Go do other work, asynchronous execution of sample tasks on-going
  //

  // Get the results, note future::get() will kick off a wait
  // if the results are not yet available.
  for (auto &counts : countFutures) {
    counts.get().dump();
  }

One can specify the target multi-QPU architecture with the --target flag:

nvq++ sample_async.cpp --target nvidia --target-option mqpu
./a.out

CUDA-Q exposes asynchronous versions of the default cudaq algorithmic primitive functions like sample, observe, and get_state (e.g., sample_async function in the above code snippets).

Depending on the number of GPUs available on the system, the nvidia multi-QPU platform will create the same number of virtual QPU instances. For example, on a system with 4 GPUs, the above code will distribute the four sampling tasks among those GPUEmulatedQPU instances.

The results might look like the following 4 different random samplings:

Number of QPUs: 4
{ 10011:28 01100:28 ... }
{ 10011:37 01100:25 ... }
{ 10011:29 01100:25 ... }
{ 10011:33 01100:30 ... }

Note

By default, the nvidia multi-QPU platform will utilize all available GPUs (number of QPUs instances is equal to the number of GPUs). To specify the number QPUs to be instantiated, one can set the CUDAQ_MQPU_NGPUS environment variable. For example, use export CUDAQ_MQPU_NGPUS=2 to specify that only 2 QPUs (GPUs) are needed.

Since the underlying GPUEmulatedQPU is a simulator backend, we can also retrieve the state vector from each QPU via the cudaq::get_state_async (C++) or cudaq.get_state_async (Python) as shown in the bellow code snippets.

import cudaq

cudaq.set_target("nvidia", option="mqpu")
target = cudaq.get_target()
qpu_count = target.num_qpus()
print("Number of QPUs:", qpu_count)


@cudaq.kernel
def kernel():
    qvector = cudaq.qvector(5)
    # Place qubits in GHZ State
    h(qvector[0])
    for qubit in range(4):
        x.ctrl(qvector[qubit], qvector[qubit + 1])


state_futures = []
for qpu in range(qpu_count):
    state_futures.append(cudaq.get_state_async(kernel, qpu_id=qpu))

for state in state_futures:
    print(state.get())
  auto kernelToRun = [](int runtimeParam) __qpu__ {
    cudaq::qvector q(runtimeParam);
    h(q[0]);
    for (int i = 0; i < runtimeParam - 1; ++i)
      x<cudaq::ctrl>(q[i], q[i + 1]);
  };

  // Get the quantum_platform singleton
  auto &platform = cudaq::get_platform();

  // Query the number of QPUs in the system
  auto num_qpus = platform.num_qpus();
  printf("Number of QPUs: %zu\n", num_qpus);
  // We will launch asynchronous tasks
  // and will store the results immediately as a future
  // we can query at some later point
  std::vector<cudaq::async_state_result> stateFutures;
  for (std::size_t i = 0; i < num_qpus; i++) {
    stateFutures.emplace_back(
        cudaq::get_state_async(i, kernelToRun, 5 /*runtimeParam*/));
  }

  //
  // Go do other work, asynchronous execution of tasks on-going
  //

  // Get the results, note future::get() will kick off a wait
  // if the results are not yet available.
  for (auto &state : stateFutures) {
    state.get().dump();
  }

One can specify the target multi-QPU architecture with the --target flag:

nvq++ get_state_async.cpp --target nvidia --target-option mqpu
./a.out

Deprecated since version 0.8: The nvidia-mqpu and nvidia-mqpu-fp64 targets, which are equivalent to the multi-QPU options mqpu,fp32 and mqpu,fp64, respectively, of the nvidia target, are deprecated and will be removed in a future release.

Parallel distribution mode

The CUDA-Q nvidia multi-QPU platform supports two modes of parallel distribution of expectation value computation:

  • MPI: distribute the expectation value computations across available MPI ranks and GPUs for each Hamiltonian term.

  • Thread: distribute the expectation value computations among available GPUs via standard C++ threads (each thread handles one GPU).

For instance, if all GPUs are available on a single node, thread-based parallel distribution (cudaq::parallel::thread in C++ or cudaq.parallel.thread in Python, as shown in the above example) is sufficient. On the other hand, if one wants to distribute the tasks across GPUs on multiple nodes, e.g., on a compute cluster, MPI distribution mode should be used.

An example of MPI distribution mode usage in both C++ and Python is given below:

import cudaq
from cudaq import spin

cudaq.mpi.initialize()
cudaq.set_target("nvidia", option="mqpu")


# Define spin ansatz.
@cudaq.kernel
def kernel(angle: float):
    qvector = cudaq.qvector(2)
    x(qvector[0])
    ry(angle, qvector[1])
    x.ctrl(qvector[1], qvector[0])


# Define spin Hamiltonian.
hamiltonian = 5.907 - 2.1433 * spin.x(0) * spin.x(1) - 2.1433 * spin.y(
    0) * spin.y(1) + .21829 * spin.z(0) - 6.125 * spin.z(1)

exp_val = cudaq.observe(kernel, hamiltonian, 0.59,
                        execution=cudaq.parallel.mpi).expectation()
if cudaq.mpi.rank() == 0:
    print("Expectation value: ", exp_val)

cudaq.mpi.finalize()
mpiexec -np <N> python3 file.py
  cudaq::mpi::initialize();
  using namespace cudaq::spin;
  cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
                     .21829 * z(0) - 6.125 * z(1);

  auto ansatz = [](double theta) __qpu__ {
    cudaq::qubit q, r;
    x(q);
    ry(theta, r);
    x<cudaq::ctrl>(r, q);
  };

  double result = cudaq::observe<cudaq::parallel::mpi>(ansatz, h, 0.59);
  if (cudaq::mpi::rank() == 0)
    printf("Expectation value: %lf\n", result);
  cudaq::mpi::finalize();
nvq++ file.cpp --target nvidia --target-option mqpu
mpiexec -np <N> a.out

In the above example, the parallel distribution mode was set to mpi using cudaq::parallel::mpi in C++ or cudaq.parallel.mpi in Python. CUDA-Q provides MPI utility functions to initialize, finalize, or query (rank, size, etc.) the MPI runtime. Last but not least, the compiled executable (C++) or Python script needs to be launched with an appropriate MPI command, e.g., mpiexec, mpirun, srun, etc.

Remote MQPU Platform

As shown in the above examples, the multi-QPU NVIDIA platform enables multi-QPU distribution whereby each QPU is simulated by a single NVIDIA GPU. To run multi-QPU workloads on different simulator backends, one can use the remote-mqpu platform, which encapsulates simulated QPUs as independent HTTP REST server instances. The following code illustrates how to launch asynchronous sampling tasks on multiple virtual QPUs, each simulated by a tensornet simulator backend.

    # Specified as program input, e.g.
    # ```
    # backend = "tensornet"; servers = "2"
    # ```
    backend = args.backend
    servers = args.servers

    # Define a kernel to be sampled.
    @cudaq.kernel
    def kernel(controls_count: int):
        controls = cudaq.qvector(controls_count)
        targets = cudaq.qvector(2)
        # Place controls in superposition state.
        h(controls)
        for target in range(2):
            x.ctrl(controls, targets[target])
        # Measure.
        mz(controls)
        mz(targets)

    # Set the target to execute on and query the number of QPUs in the system;
    # The number of QPUs is equal to the number of (auto-)launched server instances.
    cudaq.set_target("remote-mqpu",
                     backend=backend,
                     auto_launch=str(servers) if servers.isdigit() else "",
                     url="" if servers.isdigit() else servers)
    qpu_count = cudaq.get_target().num_qpus()
    print("Number of virtual QPUs:", qpu_count)

    # We will launch asynchronous sampling tasks,
    # and will store the results as a future we can query at some later point.
    # Each QPU (indexed by an unique Id) is associated with a remote REST server.
    count_futures = []
    for i in range(qpu_count):

        result = cudaq.sample_async(kernel, i + 1, qpu_id=i)
        count_futures.append(result)
    print("Sampling jobs launched for asynchronous processing.")

    # Go do other work, asynchronous execution of sample tasks on-going.
    # Get the results, note future::get() will kick off a wait
    # if the results are not yet available.
    for idx in range(len(count_futures)):
        counts = count_futures[idx].get()
        print(counts)
  // Define a kernel to be sampled.
  auto [kernel, nrControls] = cudaq::make_kernel<int>();
  auto controls = kernel.qalloc(nrControls);
  auto targets = kernel.qalloc(2);
  kernel.h(controls);
  for (std::size_t tidx = 0; tidx < 2; ++tidx) {
    kernel.x<cudaq::ctrl>(controls, targets[tidx]);
  }
  kernel.mz(controls);
  kernel.mz(targets);

  // Query the number of QPUs in the system;
  // The number of QPUs is equal to the number of (auto-)launched server
  // instances.
  auto &platform = cudaq::get_platform();
  auto num_qpus = platform.num_qpus();
  printf("Number of QPUs: %zu\n", num_qpus);

  // We will launch asynchronous sampling tasks,
  // and will store the results as a future we can query at some later point.
  // Each QPU (indexed by an unique Id) is associated with a remote REST server.
  std::vector<cudaq::async_sample_result> countFutures;
  for (std::size_t i = 0; i < num_qpus; i++) {
    countFutures.emplace_back(cudaq::sample_async(
        /*qpuId=*/i, kernel, /*nrControls=*/i + 1));
  }

  // Go do other work, asynchronous execution of sample tasks on-going
  // Get the results, note future::get() will kick off a wait
  // if the results are not yet available.
  for (auto &counts : countFutures) {
    counts.get().dump();
  }

The code above is saved in sample_async.cpp and compiled with the following command, targeting the remote-mqpu platform:

nvq++ sample_async.cpp -o sample_async.x --target remote-mqpu --remote-mqpu-backend tensornet --remote-mqpu-auto-launch 2
./sample_async.x

In the above code snippets, the remote-mqpu platform was used in the auto-launch mode, whereby a specific number of server instances, i.e., virtual QPUs, are launched on the local machine in the background. The remote QPU daemon service, cudaq-qpud, will also be shut down automatically at the end of the session.

Note

By default, auto launching daemon services do not support MPI parallelism. Hence, using the nvidia-mgpu backend to simulate each virtual QPU requires manually launching each server instance. How to do that is explained in the rest of this section.

To customize how many and which GPUs are used for simulating each virtual QPU, one can launch each server manually. For instance, on a machine with 8 NVIDIA GPUs, one may wish to partition those GPUs into 4 virtual QPU instances, each manages 2 GPUs. To do so, first launch a cudaq-qpud server for each virtual QPU:

# Use cudaq-qpud.py wrapper script to automatically find dependencies for the Python wheel configuration.
cudaq_location=`python3 -m pip show cudaq | grep -e 'Location: .*$'`
qpud_py="${cudaq_location#Location: }/bin/cudaq-qpud.py"
CUDA_VISIBLE_DEVICES=0,1 mpiexec -np 2 python3 "$qpud_py" --port <QPU 1 TCP/IP port number>
CUDA_VISIBLE_DEVICES=2,3 mpiexec -np 2 python3 "$qpud_py" --port <QPU 2 TCP/IP port number>
CUDA_VISIBLE_DEVICES=4,5 mpiexec -np 2 python3 "$qpud_py" --port <QPU 3 TCP/IP port number>
CUDA_VISIBLE_DEVICES=6,7 mpiexec -np 2 python3 "$qpud_py" --port <QPU 4 TCP/IP port number>
# It is assumed that your $LD_LIBRARY_PATH is able to find all the necessary dependencies.
CUDA_VISIBLE_DEVICES=0,1 mpiexec -np 2 cudaq-qpud --port <QPU 1 TCP/IP port number>
CUDA_VISIBLE_DEVICES=2,3 mpiexec -np 2 cudaq-qpud --port <QPU 2 TCP/IP port number>
CUDA_VISIBLE_DEVICES=4,5 mpiexec -np 2 cudaq-qpud --port <QPU 3 TCP/IP port number>
CUDA_VISIBLE_DEVICES=6,7 mpiexec -np 2 cudaq-qpud --port <QPU 4 TCP/IP port number>

In the above code snippet, four nvidia-mgpu daemons are started in MPI context via the mpiexec launcher. This activates MPI runtime environment required by the nvidia-mgpu backend. Each QPU daemon is assigned a unique TCP/IP port number via the --port command-line option. The CUDA_VISIBLE_DEVICES environment variable restricts the GPU devices that each QPU daemon sees so that it targets specific GPUs.

With these invocations, each virtual QPU is locally addressable at the URL localhost:<port>.

Warning

There is no authentication required to communicate with this server app. Hence, please make sure to either (1) use a non-public TCP/IP port for internal use or (2) use firewalls or other security mechanisms to manage user access.

User code can then target these QPUs for multi-QPU workloads, such as asynchronous sample or observe shown above for the multi-QPU NVIDIA platform platform.

cudaq.set_target("remote-mqpu", url="localhost:<port1>,localhost:<port2>,localhost:<port3>,localhost:<port4>", backend="nvidia-mgpu")
nvq++ distributed.cpp --target remote-mqpu --remote-mqpu-url localhost:<port1>,localhost:<port2>,localhost:<port3>,localhost:<port4> --remote-mqpu-backend nvidia-mgpu

Each URL is treated as an independent QPU, hence the number of QPUs (num_qpus()) is equal to the number of URLs provided. The multi-node multi-GPU simulator backend (nvidia-mgpu) is requested via the --remote-mqpu-backend command-line option.

Note

The requested backend (nvidia-mgpu) will be executed inside the context of the QPU daemon service, thus inherits its GPU resource allocation (two GPUs per backend simulator instance).

Supported Kernel Arguments

The platform serializes kernel invocation to QPU daemons via REST APIs. Please refer to the Open API Docs for the latest API information. Runtime arguments are serialized into a flat memory buffer (args field of the request JSON). For more information about argument type serialization, please see the table below.

When using a remote backend to simulate each virtual QPU, by default, we currently do not support passing complex data structures, such as nested vectors or class objects, or other kernels as arguments to the entry point kernels. These type limitations only apply to the entry-point kernel and not when passing arguments to other quantum kernels.

Support for the full range of argument types within CUDA-Q can be enabled by compiling the code with the --enable-mlir option. This flag forces quantum kernels to be compiled with the CUDA-Q MLIR-based compiler. As a result, runtime arguments can be resolved by the CUDA Quantum compiler infrastructure to support wider range of argument types. However, certain language constructs within quantum kernels may not yet be fully supported.

Kernel argument serialization

Data type

Example

Serialization

Trivial type (occupies a contiguous memory area)

int, std::size_t, double, etc.

Byte data (via memcpy)

std::vector of trivial type

std::vector<int>, std::vector<double>, etc.

Total vector size in bytes as a 64-bit integer followed by serialized data of all vector elements.

cudaq::pauli_word

cudaq::pauli_word("IXIZ")

Same as std::vector<char>: total vector size in bytes as a 64-bit integer followed by serialized data of all characters.

Single-level nested std::vector of supported std::vector types

std::vector<std::vector<int>>, std::vector<cudaq::pauli_word>, etc.

Number of top-level elements (as a 64-bit integer) followed sizes in bytes of element vectors (as a contiguous array of 64-bit integers) then serialized data of the inner vectors.

For CUDA-Q kernels that return a value, the remote platform supports returning simple data types of bool, integral (e.g., int or std::size_t), and floating-point types (float or double) when MLIR-based compilation is enabled (--enable-mlir).

Accessing Simulated Quantum State

The remote MQPU platform supports accessing simulator backend’s state vector via the cudaq::get_state (C++) or cudaq.get_state (Python) APIs, similar to local simulator backends.

State data can be retrieved as a full state vector or as individual basis states’ amplitudes. The later is designed for large quantum states, which incurred data transfer overheads.

state = cudaq.get_state(kernel)
amplitudes = state.amplitudes(['0000', '1111'])
auto state = cudaq::get_state(kernel)
auto amplitudes = state.amplitudes({{0, 0, 0, 0}, {1, 1, 1, 1}});

In the above example, the amplitudes of the two requested states are returned.

For C++ quantum kernels [*] compiled with the CUDA-Q MLIR-based compiler and Python kernels, state accessor is evaluated in a just-in-time/on-demand manner, and hence can be customize to users’ need.

For instance, in the above amplitude access example, if the state vector is very large, e.g., multi-GPU distributed state vectors or tensor-network encoded quantum states, the full state vector will not be retrieved when get_state is called. Instead, when the amplitudes accessor is called, a specific amplitude calculation request will be sent to the server. Thus, only the amplitudes of those basis states will be computed and returned.

Similarly, for state overlap calculation, if deferred state evaluation is available (Python/MLIR-based compiler) for both of the operand quantum states, a custom overlap calculation request will be constructed and sent to the server. Only the final overlap result will be returned, thereby eliminating back-and-forth state data transfers.