State Vector Simulators¶
CPU¶
The qpp-cpu
backend backend provides a state vector simulator based on the CPU-only, OpenMP threaded Q++ library.
This backend is good for basic testing and experimentation with just a few qubits, but performs poorly for all but the smallest simulation and is the default target when running on CPU-only systems.
To execute a program on the qpp-cpu
target even if a GPU-accelerated backend is available,
use the following commands:
python3 program.py [...] --target qpp-cpu
The target can also be defined in the application code by calling
cudaq.set_target('qpp-cpu')
If a target is set in the application code, this target will override the --target
command line flag given during program invocation.
nvq++ --target qpp-cpu program.cpp [...] -o program.x
./program.x
Single-GPU¶
The nvidia
backend provides a state vector simulator accelerated with -
the cuStateVec
library. The cuStateVec documentation provides a detailed explanation for how the simulations are performed on the GPU.
The nvidia
target supports multiple configurable options including specification of floating point precision.
To execute a program on the nvidia
backend, use the following commands:
Single Precision (Default):
python3 program.py [...] --target nvidia --target-option fp32
Double Precision:
python3 program.py [...] --target nvidia --target-option fp64
The target can also be defined in the application code by calling
cudaq.set_target('nvidia', option = 'fp64')
If a target is set in the application code, this target will override the --target
command line flag given during program invocation.
Single Precision (Default):
nvq++ --target nvidia --target-option fp32 program.cpp [...] -o program.x
./program.x
Double Precision (Default):
nvq++ --target nvidia --target-option fp64 program.cpp [...] -o program.x
./program.x
Note
This backend requires an NVIDIA GPU and CUDA runtime libraries. If you do not have these dependencies installed, you may encounter an error stating Invalid simulator requested
. See the section Dependencies and Compatibility for more information about how to install dependencies.
In the single-GPU mode, the nvidia
backend provides the following
environment variable options. Any environment variables must be set prior to
setting the target. It is worth drawing attention to gate fusion, a powerful tool for improving simulation performance which is discussed in greater detail here.
Option |
Value |
Description |
|
positive integer |
The max number of qubits used for gate fusion. The default value is |
|
integer greater than or equal to -1 |
The max number of qubits used for diagonal gate fusion. The default value is set to |
|
positive integer |
Number of CPU threads used for circuit processing. The default value is |
|
non-negative integer, or |
CPU memory size (in GB) allowed for state-vector migration. |
|
positive integer, or |
GPU memory (in GB) allowed for on-device state-vector allocation. As the state-vector size exceeds this limit, host memory will be utilized for migration. |
Deprecated since version 0.8: The nvidia-fp64
targets, which is equivalent setting the fp64
option on the nvidia
target,
is deprecated and will be removed in a future release.
Multi-node multi-GPU¶
The nvidia
backend also provides a state vector simulator accelerated with
the cuStateVec
library with support for Multi-Node, Multi-GPU distribution of the
state vector.
This backend is necessary to scale applications that require a state vector that cannot fit on a single GPU memory.
The multi-node multi-GPU simulator expects to run within an MPI context.
To execute a program on the multi-node multi-GPU NVIDIA target, use the following commands
(adjust the value of the -np
flag as needed to reflect available GPU resources on your system):
See the Divisive Clustering application to see how this backend can be used in practice.
Double precision simulation:
mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp64,mgpu
Single precision simulation:
mpiexec -np 2 python3 program.py [...] --target nvidia --target-option fp32,mgpu
Note
If you installed CUDA-Q via pip
, you will need to install the necessary MPI dependencies separately;
please follow the instructions for installing dependencies in the Project Description.
In addition to using MPI in the simulator, you can use it in your application code by installing mpi4py, and invoking the program with the command
mpiexec -np 2 python3 -m mpi4py program.py [...] --target nvidia --target-option fp64,mgpu
The target can also be defined in the application code by calling
cudaq.set_target('nvidia', option='mgpu,fp64')
If a target is set in the application code, this target will override the --target
command line flag given during program invocation.
Note
The order of the option settings are interchangeable. For example,
cudaq.set_target('nvidia', option='mgpu,fp64')
is equivalent tocudaq.set_target('nvidia', option='fp64,mgpu')
.The
nvidia
target has single-precision as the default setting. Thus, usingoption='mgpu'
implies thatoption='mgpu,fp32'
.
Double precision simulation:
nvq++ --target nvidia --target-option mgpu,fp64 program.cpp [...] -o program.x
mpiexec -np 2 ./program.x
Single precision simulation:
nvq++ --target nvidia --target-option mgpu,fp32 program.cpp [...] -o program.x
mpiexec -np 2 ./program.x
Note
This backend requires an NVIDIA GPU, CUDA runtime libraries, as well as an MPI installation. If you do not have these dependencies installed, you may encounter either an error stating invalid simulator requested
(missing CUDA libraries), or an error along the lines of failed to launch kernel
(missing MPI installation). See the section Dependencies and Compatibility for more information about how to install dependencies.
The number of processes and nodes should be always power-of-2.
Host-device state vector migration is also supported in the multi-node multi-GPU configuration.
In addition to those environment variable options supported in the single-GPU mode,
the nvidia
backend provides the following environment variable options particularly for
the multi-node multi-GPU configuration. Any environment variables must be set
prior to setting the target.
Option |
Value |
Description |
|
string |
The shared library name for inter-process communication. The default value is |
|
|
Selecting |
|
positive integer |
The qubit count threshold where state vector distribution is activated. Below this threshold, simulation is performed as independent (non-distributed) tasks across all MPI processes for optimal performance. Default is 25. |
|
positive integer |
The max number of qubits used for gate fusion. The default value is |
|
positive integer |
Specify the number of GPUs that can communicate by using GPUDirect P2P. Default value is 0 (P2P communication is disabled). |
|
|
Automatically set the number of P2P device bits based on the total number of processes when multi-node NVLink ( |
|
comma-separated list of positive integers |
Specify the inter-node network structure (faster to slower). For example, assuming a 8 nodes, 4 GPUs/node simulation whereby network communication is faster, this |
|
positive integer |
Specify host-device memory migration w.r.t. the network structure. If provided, this setting determines the position to insert the number of migration index bits to the |
Deprecated since version 0.8: The nvidia-mgpu
backend, which is equivalent to the multi-node multi-GPU double-precision option (mgpu,fp64
) of the nvidia
is deprecated and will be removed in a future release.
The above configuration options of the nvidia
backend
can be tuned to reduce your simulation runtimes. One of the
performance improvements is to fuse multiple gates together during runtime. For
example, x(qubit0)
and x(qubit1)
can be fused together into a
single 4x4 matrix operation on the state vector rather than 2 separate 2x2
matrix operations on the state vector. This fusion reduces memory bandwidth on
the GPU because the state vector is transferred into and out of memory fewer
times. By default, up to 4 gates are fused together for single-GPU simulations,
and up to 6 gates are fused together for multi-GPU simulations. The number of
gates fused can significantly affect performance of some circuits, so users
can override the default fusion level by setting the setting CUDAQ_MGPU_FUSE
environment variable to another integer value as shown below.
CUDAQ_MGPU_FUSE=5 mpiexec -np 2 python3 program.py [...] --target nvidia --target-option mgpu,fp64
nvq++ --target nvidia --target-option mgpu,fp64 program.cpp [...] -o program.x
CUDAQ_MGPU_FUSE=5 mpiexec -np 2 ./program.x