Taking Advantage of the Underlying Quantum Platform¶
The CUDA Quantum machine model elucidates the various devices considered in the
broader quantum-classical compute node context. One will have one or many
host CPUs, zero or many NVIDIA GPUs, a classical QPU control space, and the
quantum register itself. Moreover, the specification notes that the underlying
platform may expose multiple QPUs. In the near-term, this will be unlikely with
physical QPU instantiations, but the availability of GPU-based circuit
simulators on NVIDIA multi-GPU architectures does give one an opportunity
to think about programming such a multi-QPU architecture in the near-term.
CUDA Quantum starts by enabling one to query information about the underlying quantum
platform via the quantum_platform
abstraction. This type exposes a
num_qpus()
method that can be used to query the number of available
QPUs for asynchronous CUDA Quantum kernel and cudaq::
function invocations.
Each available QPU is assigned a logical index, and programmers can launch
specific asynchronous function invocations targeting a desired QPU.
Here is a simple example demonstrating this
auto kernelToBeSampled = [](int runtimeParam) __qpu__ {
cudaq::qreg q(runtimeParam);
h(q);
mz(q);
};
// Get the quantum_platform singleton
auto &platform = cudaq::get_platform();
// Query the number of QPUs in the system
auto num_qpus = platform.num_qpus();
printf("Number of QPUs: %zu\n", num_qpus);
// We will launch asynchronous sampling tasks
// and will store the results immediately as a future
// we can query at some later point
std::vector<cudaq::async_sample_result> countFutures;
for (std::size_t i = 0; i < num_qpus; i++) {
countFutures.emplace_back(
cudaq::sample_async(i, kernelToBeSampled, 5 /*runtimeParam*/));
}
//
// Go do other work, asynchronous execution of sample tasks on-going
//
// Get the results, note future::get() will kick off a wait
// if the results are not yet available.
for (auto &counts : countFutures) {
counts.get().dump();
}
CUDA Quantum exposes asynchronous versions of the default cudaq::
algorithmic
primitive functions like sample
and observe
(e.g., cudaq::sample_async
function in the above code snippet).
One can then specify the target multi-QPU architecture (nvidia-mqpu
) with the --target
flag:
nvq++ sample_async.cpp -target nvidia-mqpu
./a.out
Depending on the number of GPUs available on the system, the nvidia-mqpu
platform will create the same number of virtual QPU instances.
For example, on a system with 4 GPUs, the above code will distribute the four sampling tasks among those GPUEmulatedQPU
instances.
The results might look like the following (4 different random samplings).
Number of QPUs: 4
{ 10011:28 01100:28 ... }
{ 10011:37 01100:25 ... }
{ 10011:29 01100:25 ... }
{ 10011:33 01100:30 ... }
Note
By default, the nvidia-mqpu
platform will utilize all available GPUs (number of QPUs instances is equal to the number of GPUs).
To specify the number QPUs to be instantiated, one can set the CUDAQ_MQPU_NGPUS
environment variable.
For example, export CUDAQ_MQPU_NGPUS=2
to specify that only 2 QPUs (GPUs) are needed.
An equivalent example in Python is as follows.
import cudaq
cudaq.set_target("nvidia-mqpu")
target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)
kernel, runtime_param = cudaq.make_kernel(int)
qubits = kernel.qalloc(runtime_param)
# Place qubits in superposition state.
kernel.h(qubits)
# Measure.
kernel.mz(qubits)
count_futures = []
for qpu in range(num_qpus):
count_futures.append(cudaq.sample_async(kernel, 5, qpu_id=qpu))
for counts in count_futures:
print(counts.get())
Asynchronous expectation value computations¶
One typical use case of the nvidia-mqpu
platform is to distribute the
expectation value computations of a multi-term Hamiltonian across multiple virtual QPUs (GPUEmulatedQPU
).
Here is an example.
using namespace cudaq::spin;
cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
.21829 * z(0) - 6.125 * z(1);
// Get the quantum_platform singleton
auto &platform = cudaq::get_platform();
// Query the number of QPUs in the system
auto num_qpus = platform.num_qpus();
printf("Number of QPUs: %zu\n", num_qpus);
auto ansatz = [](double theta) __qpu__ {
cudaq::qubit q, r;
x(q);
ry(theta, r);
x<cudaq::ctrl>(r, q);
};
double result = cudaq::observe<cudaq::parallel::thread>(ansatz, h, 0.59);
printf("Expectation value: %lf\n", result);
One can then target the nvidia-mqpu
platform by:
nvq++ observe_mqpu.cpp -target nvidia-mqpu
./a.out
Equivalently, in Python
import cudaq
from cudaq import spin
cudaq.set_target("nvidia-mqpu")
target = cudaq.get_target()
num_qpus = target.num_qpus()
print("Number of QPUs:", num_qpus)
# Define spin ansatz.
kernel, theta = cudaq.make_kernel(float)
qreg = kernel.qalloc(2)
kernel.x(qreg[0])
kernel.ry(theta, qreg[1])
kernel.cx(qreg[1], qreg[0])
# Define spin Hamiltonian.
hamiltonian = 5.907 - 2.1433 * spin.x(0) * spin.x(1) - 2.1433 * spin.y(
0) * spin.y(1) + .21829 * spin.z(0) - 6.125 * spin.z(1)
exp_val = cudaq.observe(kernel,
hamiltonian,
0.59,
execution=cudaq.parallel.thread).expectation_z()
print("Expectation value: ", exp_val)
In the above code snippet, since the Hamiltonian contains four non-identity terms, there are four quantum circuits that need to be executed
in order to compute the expectation value of that Hamiltonian and given the quantum state prepared by the ansatz kernel. When the nvidia-mqpu
platform
is selected, these circuits will be distributed across all available QPUs. The final expectation value result is computed from all QPU execution results.
Parallel distribution mode¶
The CUDA Quantum nvidia-mqpu
platform supports two modes of parallel distribution of expectation value computation:
MPI: distribute the expectation value computations across available MPI ranks and GPUs for each Hamiltonian term.
Thread: distribute the expectation value computations among available GPUs via standard C++ threads (each thread handles one GPU).
For instance, if all GPUs are available on a single node, thread-based parallel distribution
(cudaq::parallel::thread
in C++ or cudaq.parallel.thread
in Python, as shown in the above example) is sufficient.
On the other hand, if one wants to distribute the tasks across GPUs on multiple nodes, e.g., on a compute cluster, MPI distribution mode
should be used.
An example of MPI distribution mode usage is as follows:
C++¶
cudaq::mpi::initialize();
using namespace cudaq::spin;
cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
.21829 * z(0) - 6.125 * z(1);
auto ansatz = [](double theta) __qpu__ {
cudaq::qubit q, r;
x(q);
ry(theta, r);
x<cudaq::ctrl>(r, q);
};
double result = cudaq::observe<cudaq::parallel::mpi>(ansatz, h, 0.59);
if (cudaq::mpi::rank() == 0)
printf("Expectation value: %lf\n", result);
cudaq::mpi::finalize();
nvq++ observe_mqpu_mpi.cpp -target nvidia-mqpu
mpirun -np <N> a.out
Python¶
import cudaq
from cudaq import spin
cudaq.mpi.initialize()
cudaq.set_target("nvidia-mqpu")
# Define spin ansatz.
kernel, theta = cudaq.make_kernel(float)
qreg = kernel.qalloc(2)
kernel.x(qreg[0])
kernel.ry(theta, qreg[1])
kernel.cx(qreg[1], qreg[0])
# Define spin Hamiltonian.
hamiltonian = 5.907 - 2.1433 * spin.x(0) * spin.x(1) - 2.1433 * spin.y(
0) * spin.y(1) + .21829 * spin.z(0) - 6.125 * spin.z(1)
exp_val = cudaq.observe(kernel, hamiltonian, 0.59,
execution=cudaq.parallel.mpi).expectation_z()
if cudaq.mpi.rank() == 0:
print("Expectation value: ", exp_val)
cudaq.mpi.finalize()
mpirun -np <N> python3 observe_mpi.py
In the above examples, the parallel distribution mode was set to mpi
using cudaq::parallel::mpi
in C++ or cudaq.parallel.mpi
in Python.
CUDA Quantum provides MPI utility functions to initialize, finalize, or query (rank, size, etc.) the MPI runtime.
Last but not least, the compiled executable (C++) or Python script needs to be launched with an appropriate MPI command,
e.g., mpirun
, mpiexec
, srun
, etc.