Multiple QPUs

The CUDA-Q machine model elucidates the various devices considered in the broader quantum-classical compute node context. Programmers will have one or many host CPUs, zero or many NVIDIA GPUs, a classical QPU control space, and the quantum register itself. Moreover, the specification notes that the underlying platform may expose multiple QPUs. In the near-term, this will be unlikely with physical QPU instantiations, but the availability of GPU-based circuit simulators on NVIDIA multi-GPU architectures does give one an opportunity to think about programming such a multi-QPU architecture in the near-term. CUDA-Q starts by enabling one to query information about the underlying quantum platform via the quantum_platform abstraction. This type exposes a num_qpus() method that can be used to query the number of available QPUs for asynchronous CUDA-Q kernel and cudaq:: function invocations. Each available QPU is assigned a logical index, and programmers can launch specific asynchronous function invocations targeting a desired QPU.

Simulate Multiple QPUs in Parallel

In the multi-QPU mode (mqpu option), the NVIDIA backend provides a simulated QPU for every available NVIDIA GPU on the underlying system. Each QPU is simulated via a cuStateVec simulator backend as defined by the NVIDIA backend. This target enables asynchronous parallel execution of quantum kernel tasks.

Here is a simple example demonstrating its usage.

import cudaq

cudaq.set_target("nvidia", option="mqpu")
target = cudaq.get_target()
qpu_count = target.num_qpus()
print("Number of QPUs:", qpu_count)


@cudaq.kernel
def kernel(qubit_count: int):
    qvector = cudaq.qvector(qubit_count)
    # Place qubits in superposition state.
    h(qvector)
    # Measure.
    mz(qvector)


count_futures = []
for qpu in range(qpu_count):
    count_futures.append(cudaq.sample_async(kernel, 5, qpu_id=qpu))

for counts in count_futures:
    print(counts.get())
  auto kernelToBeSampled = [](int runtimeParam) __qpu__ {
    cudaq::qvector q(runtimeParam);
    h(q);
    mz(q);
  };

  // Get the quantum_platform singleton
  auto &platform = cudaq::get_platform();

  // Query the number of QPUs in the system
  auto num_qpus = platform.num_qpus();
  printf("Number of QPUs: %zu\n", num_qpus);
  // We will launch asynchronous sampling tasks
  // and will store the results immediately as a future
  // we can query at some later point
  std::vector<cudaq::async_sample_result> countFutures;
  for (std::size_t i = 0; i < num_qpus; i++) {
    countFutures.emplace_back(
        cudaq::sample_async(i, kernelToBeSampled, 5 /*runtimeParam*/));
  }

  //
  // Go do other work, asynchronous execution of sample tasks on-going
  //

  // Get the results, note future::get() will kick off a wait
  // if the results are not yet available.
  for (auto &counts : countFutures) {
    counts.get().dump();
  }

One can specify the target multi-QPU architecture with the --target flag:

nvq++ sample_async.cpp --target nvidia --target-option mqpu
./a.out

CUDA-Q exposes asynchronous versions of the default cudaq algorithmic primitive functions like sample, observe, and get_state (e.g., sample_async function in the above code snippets).

Depending on the number of GPUs available on the system, the nvidia multi-QPU platform will create the same number of virtual QPU instances. For example, on a system with 4 GPUs, the above code will distribute the four sampling tasks among those virtual QPU instances.

The results might look like the following 4 different random samplings:

Number of QPUs: 4
{ 10011:28 01100:28 ... }
{ 10011:37 01100:25 ... }
{ 10011:29 01100:25 ... }
{ 10011:33 01100:30 ... }

Note

By default, the nvidia multi-QPU platform will utilize all available GPUs (number of QPUs instances is equal to the number of GPUs). To specify the number QPUs to be instantiated, one can set the CUDAQ_MQPU_NGPUS environment variable. For example, use export CUDAQ_MQPU_NGPUS=2 to specify that only 2 QPUs (GPUs) are needed.

Since the underlying virtual QPU is a simulator backend, we can also retrieve the state vector from each QPU via the cudaq::get_state_async (C++) or cudaq.get_state_async (Python) as shown in the bellow code snippets.

import cudaq

cudaq.set_target("nvidia", option="mqpu")
target = cudaq.get_target()
qpu_count = target.num_qpus()
print("Number of QPUs:", qpu_count)


@cudaq.kernel
def kernel():
    qvector = cudaq.qvector(5)
    # Place qubits in GHZ State
    h(qvector[0])
    for qubit in range(4):
        x.ctrl(qvector[qubit], qvector[qubit + 1])


state_futures = []
for qpu in range(qpu_count):
    state_futures.append(cudaq.get_state_async(kernel, qpu_id=qpu))

for state in state_futures:
    print(state.get())
  auto kernelToRun = [](int runtimeParam) __qpu__ {
    cudaq::qvector q(runtimeParam);
    h(q[0]);
    for (int i = 0; i < runtimeParam - 1; ++i)
      x<cudaq::ctrl>(q[i], q[i + 1]);
  };

  // Get the quantum_platform singleton
  auto &platform = cudaq::get_platform();

  // Query the number of QPUs in the system
  auto num_qpus = platform.num_qpus();
  printf("Number of QPUs: %zu\n", num_qpus);
  // We will launch asynchronous tasks
  // and will store the results immediately as a future
  // we can query at some later point
  std::vector<cudaq::async_state_result> stateFutures;
  for (std::size_t i = 0; i < num_qpus; i++) {
    stateFutures.emplace_back(
        cudaq::get_state_async(i, kernelToRun, 5 /*runtimeParam*/));
  }

  //
  // Go do other work, asynchronous execution of tasks on-going
  //

  // Get the results, note future::get() will kick off a wait
  // if the results are not yet available.
  for (auto &state : stateFutures) {
    state.get().dump();
  }

One can specify the target multi-QPU architecture with the --target flag:

nvq++ get_state_async.cpp --target nvidia --target-option mqpu
./a.out

See the Hadamard Test notebook for an application that leverages the mqpu backend.

Deprecated since version 0.8: The nvidia-mqpu and nvidia-mqpu-fp64 targets, which are equivalent to the multi-QPU options mqpu,fp32 and mqpu,fp64, respectively, of the nvidia target, are deprecated and will be removed in a future release.

Parallel distribution mode

The CUDA-Q nvidia multi-QPU platform supports two modes of parallel distribution of expectation value computation:

  • MPI: distribute the expectation value computations across available MPI ranks and GPUs for each Hamiltonian term.

  • Thread: distribute the expectation value computations among available GPUs via standard C++ threads (each thread handles one GPU).

For instance, if all GPUs are available on a single node, thread-based parallel distribution (cudaq::parallel::thread in C++ or cudaq.parallel.thread in Python, as shown in the above example) is sufficient. On the other hand, if one wants to distribute the tasks across GPUs on multiple nodes, e.g., on a compute cluster, MPI distribution mode should be used.

An example of MPI distribution mode usage in both C++ and Python is given below:

import cudaq
from cudaq import spin

cudaq.mpi.initialize()
cudaq.set_target("nvidia", option="mqpu")


# Define spin ansatz.
@cudaq.kernel
def kernel(angle: float):
    qvector = cudaq.qvector(2)
    x(qvector[0])
    ry(angle, qvector[1])
    x.ctrl(qvector[1], qvector[0])


# Define spin Hamiltonian.
hamiltonian = 5.907 - 2.1433 * spin.x(0) * spin.x(1) - 2.1433 * spin.y(
    0) * spin.y(1) + .21829 * spin.z(0) - 6.125 * spin.z(1)

exp_val = cudaq.observe(kernel, hamiltonian, 0.59,
                        execution=cudaq.parallel.mpi).expectation()
if cudaq.mpi.rank() == 0:
    print("Expectation value: ", exp_val)

cudaq.mpi.finalize()
mpiexec -np <N> python3 file.py
  cudaq::mpi::initialize();
  cudaq::spin_op h =
      5.907 - 2.1433 * cudaq::spin_op::x(0) * cudaq::spin_op::x(1) -
      2.1433 * cudaq::spin_op::y(0) * cudaq::spin_op::y(1) +
      .21829 * cudaq::spin_op::z(0) - 6.125 * cudaq::spin_op::z(1);

  auto ansatz = [](double theta) __qpu__ {
    cudaq::qubit q, r;
    x(q);
    ry(theta, r);
    x<cudaq::ctrl>(r, q);
  };

  double result = cudaq::observe<cudaq::parallel::mpi>(ansatz, h, 0.59);
  if (cudaq::mpi::rank() == 0)
    printf("Expectation value: %lf\n", result);
  cudaq::mpi::finalize();
nvq++ file.cpp --target nvidia --target-option mqpu
mpiexec -np <N> a.out

In the above example, the parallel distribution mode was set to mpi using cudaq::parallel::mpi in C++ or cudaq.parallel.mpi in Python. CUDA-Q provides MPI utility functions to initialize, finalize, or query (rank, size, etc.) the MPI runtime. Last but not least, the compiled executable (C++) or Python script needs to be launched with an appropriate MPI command, e.g., mpiexec, mpirun, srun, etc.

Multi-QPU with Multi-Node Multi-GPU Backends

Some simulator backends, such as tensornet and nvidia with the mgpu option, can distribute a single simulation across multiple GPUs on multiple nodes using MPI. This section shows how to run multiple independent simulations in parallel within the same MPI program, where each simulation uses its own dedicated group of MPI ranks and GPUs.

The approach is straightforward:

  1. Partition the global MPI communicator into sub-communicators — one per QPU group.

  2. Pass each sub-communicator to CUDA-Q so the underlying simulator uses only the ranks in that group.

  3. Each QPU group then calls CUDA-Q APIs independently.

Note

This replaces the former remote-mqpu target, which required standing up HTTP servers and had higher overhead. With direct MPI distribution, users leverage the job scheduler’s resource binding (e.g., SLURM’s --ntasks-per-node) and have full control over rank placement.

Once the communicator is set, all standard CUDA-Q execution APIs — sample, observe, run, and get_state — work as usual within each QPU group.

In all examples below, 4 MPI ranks are divided into 2 QPU groups of 2 ranks each (2 GPUs per simulation).

Using the CUDA-Q MPI API

For users new to MPI, CUDA-Q provides its own cudaq::mpi (C++) and cudaq.mpi (Python) helpers that wrap the common MPI operations needed to set up QPU groups.

import cudaq


@cudaq.kernel
def ghz(n: int):
    q = cudaq.qvector(n)
    h(q[0])
    for i in range(n - 1):
        cx(q[i], q[i + 1])


cudaq.mpi.initialize()

world_rank = cudaq.mpi.rank()
world_size = cudaq.mpi.num_ranks()

# Each QPU is backed by `ranks_per_qpu` MPI ranks / GPUs.
ranks_per_qpu = 2
if world_size % ranks_per_qpu != 0:
    if world_rank == 0:
        print(f"World size must be a multiple of {ranks_per_qpu}.")
    cudaq.mpi.finalize()
    raise SystemExit(1)

# Assign each rank to a QPU group and split the communicator accordingly.
# ```
#  MPI_COMM_WORLD
#  +----------+----------+----------+----------+
#  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
#  +----------+----------+----------+----------+
#  |        QPU 0        |        QPU 1        |
#  |    (qpu_id = 0)     |    (qpu_id = 1)     |
#  +---------------------+---------------------+
# ```
qpu_id = world_rank // ranks_per_qpu
qpu_comm = cudaq.mpi.split_communicator(qpu_id)

# Select the target and pass the sub-communicator for this QPU group.
cudaq.set_target("tensornet", comm_handle=qpu_comm)

# Run an independent circuit on each QPU group.
# The `tensornet` backend can handle a large number of qubits via multi-GPU
# tensor network contraction, well beyond what state vector simulators support.
num_qubits = 40 + 5 * qpu_id
result = cudaq.sample(ghz, num_qubits)

if world_rank % ranks_per_qpu == 0:
    print(f"QPU {qpu_id} ({num_qubits} qubits):")
    print(result)

cudaq.mpi.finalize()
mpirun -n 4 python3 sample_cudaq_mpi.py
QPU 0 (40 qubits):
{ 0000000000000000000000000000000000000000:489 1111111111111111111111111111111111111111:511 }
QPU 1 (45 qubits):
{ 000000000000000000000000000000000000000000000:495 111111111111111111111111111111111111111111111:505 }
#include <cudaq.h>

__qpu__ void ghz(int n) {
  cudaq::qvector q(n);
  h(q[0]);
  for (int i = 0; i < n - 1; i++)
    cx(q[i], q[i + 1]);
}

int main() {
  cudaq::mpi::initialize();

  const int world_rank = cudaq::mpi::rank();
  const int world_size = cudaq::mpi::num_ranks();

  // Each `QPU` is backed by `ranks_per_qpu` MPI ranks / GPUs.
  const int ranks_per_qpu = 2;
  if (world_size % ranks_per_qpu != 0) {
    if (world_rank == 0)
      fprintf(stderr, "World size must be a multiple of %d.\n", ranks_per_qpu);
    cudaq::mpi::finalize();
    return 1;
  }

  // Assign each rank to a `QPU` group and split the communicator accordingly.
  // ```
  //  MPI_COMM_WORLD
  //  +----------+----------+----------+----------+
  //  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
  //  +----------+----------+----------+----------+
  //  |        QPU 0        |        QPU 1        |
  //  |    (qpu_id = 0)     |    (qpu_id = 1)     |
  //  +---------------------+---------------------+
  // ```
  const int qpu_id = world_rank / ranks_per_qpu;
  void *qpu_comm = cudaq::mpi::split_communicator(qpu_id);

  // Inform CUDA-Q which sub-communicator this `QPU` group should use.
  cudaq::mpi::set_communicator(qpu_comm);

  // Run an independent circuit on each `QPU` group.
  // The `tensornet` backend can handle a large number of qubits via multi-GPU
  // tensor network contraction, well beyond what state vector simulators
  // support.
  const int num_qubits = 40 + 5 * qpu_id;
  auto result = cudaq::sample(ghz, num_qubits);

  if (world_rank % ranks_per_qpu == 0) {
    printf("QPU %d (%d qubits):\n", qpu_id, num_qubits);
    result.dump();
  }

  cudaq::mpi::finalize();
  return 0;
}
nvq++ --target tensornet -o sample_cudaq_mpi sample_cudaq_mpi.cpp
mpirun -n 4 ./sample_cudaq_mpi
QPU 0 (40 qubits):
{ 0000000000000000000000000000000000000000:483 1111111111111111111111111111111111111111:517 }
QPU 1 (45 qubits):
{ 000000000000000000000000000000000000000000000:501 111111111111111111111111111111111111111111111:499 }

Using Native MPI APIs

Users who already manage MPI in their application (e.g. with <mpi.h> in C++ or mpi4py in Python) can split the communicator using those libraries directly and hand the result to CUDA-Q.

The sub-communicator handle is passed to CUDA-Q via the comm_handle keyword argument of cudaq.set_target().

import sys

import cudaq
from mpi4py import MPI


@cudaq.kernel
def ghz(n: int):
    q = cudaq.qvector(n)
    h(q[0])
    for i in range(n - 1):
        cx(q[i], q[i + 1])


def mpi_comm_handle(comm) -> int:
    """Return a pointer to the MPI_Comm handle as an integer."""
    return MPI._addressof(comm)


world_comm = MPI.COMM_WORLD
world_rank = world_comm.Get_rank()
world_size = world_comm.Get_size()

# Each QPU is backed by `ranks_per_qpu` MPI ranks / GPUs.
ranks_per_qpu = 2
if world_size % ranks_per_qpu != 0:
    if world_rank == 0:
        print(f"World size must be a multiple of {ranks_per_qpu}.",
              file=sys.stderr)
    sys.exit(1)

# Assign each rank to a QPU group and split the communicator accordingly.
# ```
#  MPI_COMM_WORLD
#  +----------+----------+----------+----------+
#  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
#  +----------+----------+----------+----------+
#  |        QPU 0        |        QPU 1        |
#  |    (qpu_id = 0)     |    (qpu_id = 1)     |
#  +---------------------+---------------------+
# ```
qpu_id = world_rank // ranks_per_qpu
qpu_comm = world_comm.Split(color=qpu_id, key=world_rank)

# Pass the sub-communicator handle to CUDA-Q when selecting the target.
cudaq.set_target("tensornet", comm_handle=mpi_comm_handle(qpu_comm))

# Run an independent circuit on each QPU group.
# The `tensornet` backend can handle a large number of qubits via multi-GPU
# tensor network contraction, well beyond what state vector simulators support.
num_qubits = 40 + 5 * qpu_id
result = cudaq.sample(ghz, num_qubits)

if qpu_comm.Get_rank() == 0:
    print(f"QPU {qpu_id} ({num_qubits} qubits):")
    print(result)
# [End Documentation]
mpirun -n 4 python3 sample.py
QPU 0 (40 qubits):
{ 0000000000000000000000000000000000000000:489 1111111111111111111111111111111111111111:511 }
QPU 1 (45 qubits):
{ 000000000000000000000000000000000000000000000:495 111111111111111111111111111111111111111111111:505 }
#include <cudaq.h>
#include <mpi.h>

__qpu__ void ghz(int n) {
  cudaq::qvector q(n);
  h(q[0]);
  for (int i = 0; i < n - 1; i++)
    cx(q[i], q[i + 1]);
}

int main(int argc, char **argv) {
  MPI_Init(&argc, &argv);

  int world_rank, world_size;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  // Each `QPU` is backed by `ranks_per_qpu` MPI ranks / GPUs.
  const int ranks_per_qpu = 2;
  if (world_size % ranks_per_qpu != 0) {
    if (world_rank == 0)
      fprintf(stderr, "World size must be a multiple of %d.\n", ranks_per_qpu);
    MPI_Finalize();
    return 1;
  }

  // Assign each rank to a `QPU` group and split the communicator accordingly.
  // ```
  //  MPI_COMM_WORLD
  //  +----------+----------+----------+----------+
  //  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
  //  +----------+----------+----------+----------+
  //  |        QPU 0        |        QPU 1        |
  //  |    (qpu_id = 0)     |    (qpu_id = 1)     |
  //  +---------------------+---------------------+
  // ```
  const int qpu_id = world_rank / ranks_per_qpu;
  MPI_Comm qpu_comm;
  MPI_Comm_split(MPI_COMM_WORLD, qpu_id, world_rank, &qpu_comm);

  // Inform CUDA-Q which sub-communicator this `QPU` group should use.
  cudaq::mpi::set_communicator(reinterpret_cast<void *>(&qpu_comm));

  // Run an independent circuit on each `QPU` group.
  // The `tensornet` backend can handle a large number of qubits via multi-GPU
  // tensor network contraction, well beyond what state vector simulators
  // support.
  const int num_qubits = 40 + 5 * qpu_id;
  auto result = cudaq::sample(ghz, num_qubits);

  int qpu_rank;
  MPI_Comm_rank(qpu_comm, &qpu_rank);
  if (qpu_rank == 0) {
    printf("QPU %d (%d qubits):\n", qpu_id, num_qubits);
    result.dump();
  }

  MPI_Comm_free(&qpu_comm);
  MPI_Finalize();
  return 0;
}
nvq++ --target tensornet -o sample sample.cpp \
    -I$(mpicc -showme:incdirs) -L$(mpicc -showme:libdirs) -lmpi
mpirun -n 4 ./sample
QPU 0 (40 qubits):
{ 0000000000000000000000000000000000000000:483 1111111111111111111111111111111111111111:517 }
QPU 1 (45 qubits):
{ 000000000000000000000000000000000000000000000:501 111111111111111111111111111111111111111111111:499 }

Note

When using native MPI APIs, the pointer passed to cudaq::mpi::set_communicator must refer to a live MPI_Comm object. Keep it alive for the duration of all CUDA-Q calls in that QPU group and free it with MPI_Comm_free once simulation is complete.

Note

MPI must be initialized before calling cudaq::mpi::set_communicator, cudaq.mpi.set_communicator, or cudaq.set_target with a comm_handle. CUDA-Q will raise an error if MPI has not been initialized. Refer to Distributed Computing with MPI for instructions on enabling MPI support in CUDA-Q.

Gradient Computation for VQE

A common application of the multi-QPU pattern is computing the energy gradient for the Variational Quantum Eigensolver (VQE). Most gradient rules (e.g. parameter-shift) evaluate each gradient component independently via one or more observe calls per parameter, which maps naturally onto the multi-QPU model: assign one QPU group per parameter and all gradient components are evaluated in parallel. The pattern scales linearly — a circuit with \(N\) variational parameters requires \(N\) QPU groups and \(N \times\) ranks_per_qpu total MPI ranks, with no code changes beyond launching more ranks and providing more initial parameters.

Note

This MPI-based approach offers two advantages over the thread-based multi-QPU mode (Simulate Multiple QPUs in Parallel):

  • Larger problems: increasing ranks_per_qpu assigns more GPUs to each virtual QPU, allowing each observe call to simulate circuits that exceed the memory of a single GPU.

  • Arbitrary cluster scale: QPU groups span across nodes via MPI, so the total number of virtual QPUs is limited only by the cluster size, not by the number of GPUs on a single node.

The examples below use a 40-qubit placeholder ansatz and Hamiltonian to illustrate the distribution pattern. To use this in a real VQE workflow, substitute your application-specific ansatz and physical Hamiltonian (e.g. generated from a chemistry package such as PySCF), wrap the gradient evaluation in an optimization loop (e.g. using cudaq.optimizers), and feed the gathered gradient back to the optimizer at each iteration.

import math

import cudaq


@cudaq.kernel
def ansatz(n: int, theta0: float, theta1: float):
    """Dummy ansatz for demonstration: replace with your VQE circuit.
    Ry(theta0) on even qubits, Ry(theta1) on odd qubits — product state,
    no entanglement. With H = sum_i Z(i) this gives an analytically
    verifiable gradient: grad[k] = -n/2 * sin(theta_k).
    """
    q = cudaq.qvector(n)
    for i in range(n):
        if i % 2 == 0:
            ry(theta0, q[i])
        else:
            ry(theta1, q[i])


cudaq.mpi.initialize()

world_rank = cudaq.mpi.rank()
world_size = cudaq.mpi.num_ranks()

ranks_per_qpu = 2
if world_size % ranks_per_qpu != 0:
    if world_rank == 0:
        print(f"World size must be a multiple of {ranks_per_qpu}.")
    cudaq.mpi.finalize()
    raise SystemExit(1)

num_qpus = world_size // ranks_per_qpu

# Assign each rank to a QPU group and split the communicator.
# QPU group g computes the gradient for parameter theta_g.
#
# This example uses 2 parameters and 2 QPU groups. To scale to N parameters,
# launch with N * `ranks_per_qpu` total MPI ranks and provide N initial `params`.
# Each additional QPU group adds `ranks_per_qpu` ranks and handles one more
# gradient component in parallel — no other code changes required.
#
#  MPI_COMM_WORLD
#  +----------+----------+----------+----------+
#  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
#  +----------+----------+----------+----------+
#  |        QPU 0        |        QPU 1        |
#  | grad[0]=(E+-E-)/2   | grad[1]=(E+-E-)/2   |
#  +---------------------+---------------------+
#
qpu_id = world_rank // ranks_per_qpu
qpu_comm = cudaq.mpi.split_communicator(qpu_id)

# Each QPU group uses `ranks_per_qpu` GPUs for every cudaq.observe() call.
cudaq.set_target("tensornet", comm_handle=qpu_comm)

# Dummy Hamiltonian (sum of Z on all qubits) for demonstration:
# replace with your physical Hamiltonian, e.g. generated from PySCF.
n_qubits = 40
H = cudaq.spin.z(0)
for i in range(1, n_qubits):
    H += cudaq.spin.z(i)
params = [0.5, 0.3]  # theta_0, theta_1
shift = math.pi / 2.0

# Each QPU group applies +/-shift to its assigned parameter and runs two
# observe() calls. Both calls use all `ranks_per_qpu` GPUs via the
# sub-communicator, enabling multi-GPU tensor-network contraction.
#
# In a VQE optimization loop, this block executes every iteration:
# the optimizer supplies updated `params`, each QPU group re-evaluates its
# two shifted energies, and the gathered gradient drives the next step.
p_plus = params.copy()
p_plus[qpu_id] += shift
p_minus = params.copy()
p_minus[qpu_id] -= shift

e_plus = cudaq.observe(ansatz, H, n_qubits, p_plus[0], p_plus[1]).expectation()
e_minus = cudaq.observe(ansatz, H, n_qubits, p_minus[0],
                        p_minus[1]).expectation()
local_grad = (e_plus - e_minus) / 2.0

# Gather local_grad from every world rank. All ranks within a QPU group
# hold the same value (collective result), so sample every `ranks_per_qpu-th`
# entry to reconstruct the full gradient vector.
all_grads = cudaq.mpi.all_gather(world_size, [local_grad])

if world_rank == 0:
    gradient = [all_grads[i * ranks_per_qpu] for i in range(num_qpus)]
    print(f"Gradient: {gradient}")

cudaq.mpi.finalize()
mpirun -n 4 python3 observe_gradient_cudaq_mpi.py
Gradient: [-9.588510772084065, -5.91040413322679]
import math
import sys

import cudaq
from mpi4py import MPI


@cudaq.kernel
def ansatz(n: int, theta0: float, theta1: float):
    """Dummy ansatz for demonstration: replace with your VQE circuit.
    Ry(theta0) on even qubits, Ry(theta1) on odd qubits — product state,
    no entanglement. With H = sum_i Z(i) this gives an analytically
    verifiable gradient: grad[k] = -n/2 * sin(theta_k).
    """
    q = cudaq.qvector(n)
    for i in range(n):
        if i % 2 == 0:
            ry(theta0, q[i])
        else:
            ry(theta1, q[i])


def mpi_comm_handle(comm) -> int:
    """Return a pointer to the MPI_Comm handle as an integer."""
    return MPI._addressof(comm)


world_comm = MPI.COMM_WORLD
world_rank = world_comm.Get_rank()
world_size = world_comm.Get_size()

ranks_per_qpu = 2
if world_size % ranks_per_qpu != 0:
    if world_rank == 0:
        print(f"World size must be a multiple of {ranks_per_qpu}.",
              file=sys.stderr)
    sys.exit(1)

num_qpus = world_size // ranks_per_qpu

# Assign each rank to a QPU group and split the communicator.
# QPU group g computes the gradient for parameter theta_g.
#
# This example uses 2 parameters and 2 QPU groups. To scale to N parameters,
# launch with N * `ranks_per_qpu` total MPI ranks and provide N initial `params`.
# Each additional QPU group adds `ranks_per_qpu` ranks and handles one more
# gradient component in parallel — no other code changes required.
#
#  MPI_COMM_WORLD
#  +----------+----------+----------+----------+
#  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
#  +----------+----------+----------+----------+
#  |        QPU 0        |        QPU 1        |
#  | grad[0]=(E+-E-)/2   | grad[1]=(E+-E-)/2   |
#  +---------------------+---------------------+
#
qpu_id = world_rank // ranks_per_qpu
qpu_comm = world_comm.Split(color=qpu_id, key=world_rank)

# Each QPU group uses `ranks_per_qpu` GPUs for every cudaq.observe() call.
cudaq.set_target("tensornet", comm_handle=mpi_comm_handle(qpu_comm))

# Dummy Hamiltonian (sum of Z on all qubits) for demonstration:
# replace with your physical Hamiltonian, e.g. generated from PySCF.
n_qubits = 40
H = cudaq.spin.z(0)
for i in range(1, n_qubits):
    H += cudaq.spin.z(i)
params = [0.5, 0.3]  # theta_0, theta_1
shift = math.pi / 2.0

# Each QPU group applies +/-shift to its assigned parameter and runs two
# observe() calls. Both calls use all `ranks_per_qpu` GPUs via the
# sub-communicator, enabling multi-GPU tensor-network contraction.
#
# In a VQE optimization loop, this block executes every iteration:
# the optimizer supplies updated `params`, each QPU group re-evaluates its
# two shifted energies, and the gathered gradient drives the next step.
p_plus = params.copy()
p_plus[qpu_id] += shift
p_minus = params.copy()
p_minus[qpu_id] -= shift

e_plus = cudaq.observe(ansatz, H, n_qubits, p_plus[0], p_plus[1]).expectation()
e_minus = cudaq.observe(ansatz, H, n_qubits, p_minus[0],
                        p_minus[1]).expectation()
local_grad = (e_plus - e_minus) / 2.0

# Gather local_grad from every world rank. All ranks within a QPU group
# hold the same value (collective result), so sample every `ranks_per_qpu-th`
# entry to reconstruct the gradient vector.
all_grads = world_comm.allgather(local_grad)

if world_rank == 0:
    gradient = [all_grads[i * ranks_per_qpu] for i in range(num_qpus)]
    print(f"Gradient: {gradient}")

qpu_comm.Free()
mpirun -n 4 python3 observe_gradient.py
Gradient: [-9.588510772084065, -5.91040413322679]
#include <cmath>
#include <cstdio>
#include <cudaq.h>
#include <vector>

// Dummy ansatz for demonstration: replace with your VQE circuit.
// Ry(theta0) on even qubits, Ry(theta1) on odd qubits — product state,
// no entanglement. With H = sum_i Z(i) this gives an analytically
// verifiable gradient: grad[k] = -n/2 * sin(theta_k).
__qpu__ void ansatz(int n, double theta0, double theta1) {
  cudaq::qvector q(n);
  for (int i = 0; i < n; i++) {
    if (i % 2 == 0)
      ry(theta0, q[i]);
    else
      ry(theta1, q[i]);
  }
}

int main() {
  cudaq::mpi::initialize();

  const int world_rank = cudaq::mpi::rank();
  const int world_size = cudaq::mpi::num_ranks();

  const int ranks_per_qpu = 2;
  if (world_size % ranks_per_qpu != 0) {
    if (world_rank == 0)
      fprintf(stderr, "World size must be a multiple of %d.\n", ranks_per_qpu);
    cudaq::mpi::finalize();
    return 1;
  }
  const int num_qpus = world_size / ranks_per_qpu;

  // Assign each rank to a `QPU` group and split the communicator.
  // `QPU` group g computes the gradient for parameter theta_g.
  //
  // This example uses 2 parameters and 2 QPU groups. To scale to N parameters,
  // launch with N * `ranks_per_qpu` total MPI ranks and provide N initial
  // `params`. Each additional QPU group adds `ranks_per_qpu` ranks and handles
  // one more gradient component in parallel — no other code changes required.
  //
  // ```
  //  MPI_COMM_WORLD
  //  +----------+----------+----------+----------+
  //  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
  //  +----------+----------+----------+----------+
  //  |        QPU 0        |        QPU 1        |
  //  | grad[0]=(E+-E-)/2   | grad[1]=(E+-E-)/2   |
  //  +---------------------+---------------------+
  // ```
  const int qpu_id = world_rank / ranks_per_qpu;
  void *qpu_comm = cudaq::mpi::split_communicator(qpu_id);

  // Each `QPU` group uses `ranks_per_qpu` GPUs for every cudaq::observe() call.
  cudaq::mpi::set_communicator(qpu_comm);

  // Dummy Hamiltonian (sum of Z on all qubits) for demonstration:
  // replace with your physical Hamiltonian, e.g. generated from `PySCF`.
  const int n_qubits = 40;
  auto H = cudaq::spin::z(0) + cudaq::spin::z(1);
  for (int i = 2; i < n_qubits; i++)
    H = H + cudaq::spin::z(i);
  const std::vector<double> params = {0.5, 0.3}; // theta_0, theta_1
  const double shift = M_PI / 2.0;

  // Each `QPU` group applies +/-shift to its assigned parameter and runs two
  // observe() calls. Both calls use all `ranks_per_qpu` GPUs via the
  // sub-communicator, enabling multi-GPU tensor-network contraction.
  //
  // In a VQE optimization loop, this block executes every iteration:
  // the optimizer supplies updated `params`, each `QPU` group re-evaluates its
  // two shifted energies, and the gathered gradient drives the next step.
  auto p_plus = params, p_minus = params;
  p_plus[qpu_id] += shift;
  p_minus[qpu_id] -= shift;

  const double e_plus =
      cudaq::observe(ansatz, H, n_qubits, p_plus[0], p_plus[1]).expectation();
  const double e_minus =
      cudaq::observe(ansatz, H, n_qubits, p_minus[0], p_minus[1]).expectation();
  const double local_grad = (e_plus - e_minus) / 2.0;

  // Gather local_grad from every world rank. All ranks within a `QPU` group
  // hold the same value (collective result), so sample every `ranks_per_qpu-th`
  // entry to reconstruct the full gradient vector.
  std::vector<double> all_grads(world_size);
  cudaq::mpi::all_gather(all_grads, std::vector<double>{local_grad});

  if (world_rank == 0) {
    printf("Gradient: [");
    for (int i = 0; i < num_qpus; i++)
      printf("%s%.6f", i > 0 ? ", " : "", all_grads[i * ranks_per_qpu]);
    printf("]\n");
  }

  cudaq::mpi::finalize();
  return 0;
}
nvq++ --target tensornet -o observe_gradient_cudaq_mpi observe_gradient_cudaq_mpi.cpp
mpirun -n 4 ./observe_gradient_cudaq_mpi
Gradient: [-9.588511, -5.910404]
#include <cmath>
#include <cstdio>
#include <cudaq.h>
#include <mpi.h>
#include <vector>

// Dummy ansatz for demonstration: replace with your VQE circuit.
// Ry(theta0) on even qubits, Ry(theta1) on odd qubits — product state,
// no entanglement. With H = sum_i Z(i) this gives an analytically
// verifiable gradient: grad[k] = -n/2 * sin(theta_k).
__qpu__ void ansatz(int n, double theta0, double theta1) {
  cudaq::qvector q(n);
  for (int i = 0; i < n; i++) {
    if (i % 2 == 0)
      ry(theta0, q[i]);
    else
      ry(theta1, q[i]);
  }
}

int main(int argc, char **argv) {
  MPI_Init(&argc, &argv);

  int world_rank, world_size;
  MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
  MPI_Comm_size(MPI_COMM_WORLD, &world_size);

  const int ranks_per_qpu = 2;
  if (world_size % ranks_per_qpu != 0) {
    if (world_rank == 0)
      fprintf(stderr, "World size must be a multiple of %d.\n", ranks_per_qpu);
    MPI_Finalize();
    return 1;
  }
  const int num_qpus = world_size / ranks_per_qpu;

  // Assign each rank to a `QPU` group and split the communicator.
  // `QPU` group g computes the gradient for parameter theta_g.
  //
  // This example uses 2 parameters and 2 QPU groups. To scale to N parameters,
  // launch with N * `ranks_per_qpu` total MPI ranks and provide N initial
  // `params`. Each additional QPU group adds `ranks_per_qpu` ranks and handles
  // one more gradient component in parallel — no other code changes required.
  //
  // ```
  //  MPI_COMM_WORLD
  //  +----------+----------+----------+----------+
  //  |  rank 0  |  rank 1  |  rank 2  |  rank 3  |
  //  +----------+----------+----------+----------+
  //  |        QPU 0        |        QPU 1        |
  //  | grad[0]=(E+-E-)/2   | grad[1]=(E+-E-)/2   |
  //  +---------------------+---------------------+
  // ```
  const int qpu_id = world_rank / ranks_per_qpu;
  MPI_Comm qpu_comm;
  MPI_Comm_split(MPI_COMM_WORLD, qpu_id, world_rank, &qpu_comm);

  // Each `QPU` group uses `ranks_per_qpu` GPUs for every cudaq::observe() call.
  cudaq::mpi::set_communicator(reinterpret_cast<void *>(&qpu_comm));

  // Dummy Hamiltonian (sum of Z on all qubits) for demonstration:
  // replace with your physical Hamiltonian, e.g. generated from `PySCF`.
  const int n_qubits = 40;
  auto H = cudaq::spin::z(0) + cudaq::spin::z(1);
  for (int i = 2; i < n_qubits; i++)
    H = H + cudaq::spin::z(i);
  const std::vector<double> params = {0.5, 0.3}; // theta_0, theta_1
  const double shift = M_PI / 2.0;

  // Each `QPU` group applies +/-shift to its assigned parameter and runs two
  // observe() calls. Both calls use all `ranks_per_qpu` GPUs via the
  // sub-communicator, enabling multi-GPU tensor-network contraction.
  //
  // In a VQE optimization loop, this block executes every iteration:
  // the optimizer supplies updated `params`, each `QPU` group re-evaluates its
  // two shifted energies, and the gathered gradient drives the next step.
  auto p_plus = params, p_minus = params;
  p_plus[qpu_id] += shift;
  p_minus[qpu_id] -= shift;

  const double e_plus =
      cudaq::observe(ansatz, H, n_qubits, p_plus[0], p_plus[1]).expectation();
  const double e_minus =
      cudaq::observe(ansatz, H, n_qubits, p_minus[0], p_minus[1]).expectation();
  const double local_grad = (e_plus - e_minus) / 2.0;

  // Gather local_grad from every world rank. All ranks within a `QPU` group
  // hold the same value (collective result), so sample every `ranks_per_qpu-th`
  // entry to reconstruct the gradient vector.
  std::vector<double> all_grads(world_size);
  MPI_Allgather(&local_grad, 1, MPI_DOUBLE, all_grads.data(), 1, MPI_DOUBLE,
                MPI_COMM_WORLD);

  if (world_rank == 0) {
    printf("Gradient: [");
    for (int i = 0; i < num_qpus; i++)
      printf("%s%.6f", i > 0 ? ", " : "", all_grads[i * ranks_per_qpu]);
    printf("]\n");
  }

  MPI_Comm_free(&qpu_comm);
  MPI_Finalize();
  return 0;
}
nvq++ --target tensornet -o observe_gradient observe_gradient.cpp \
    -I$(mpicc -showme:incdirs) -L$(mpicc -showme:libdirs) -lmpi
mpirun -n 4 ./observe_gradient
Gradient: [-9.588511, -5.910404]