Multi-GPU Workflows
===================

There are many backends available with CUDA-Q which enable seamless
switching between GPUs, QPUs and CPUs and also allow for workflows
involving multiple architectures working in tandem.

Available Targets
-------------------

-  **`qpp-cpu`**: The QPP based CPU backend which is multithreaded to
   maximize the usage of available cores on your system.

-  **`nvidia`**: GPU-accelerated state-vector based backend which accelerates quantum circuit
   simulation on NVIDIA GPUs powered by cuQuantum.

-  **`nvidia-mgpu`**: Allows for scaling circuit simulation on multiple GPUs.

-  **`nvidia-mqpu`**: Enables users to program workflows utilizing
   multiple virtual quantum processors in parallel, where each QPU is simulated by the `nvidia` backend.

-  **`remote-mqpu`**: Enables users to program workflows utilizing
   multiple virtual quantum processors in parallel, where the backend used to simulate each QPU is configurable.

Please see :doc:`../backends/backends` for a full list of all available backends. 
Below we explore how to effectively utilize multiple CUDA-Q targets with the same GHZ state preparation code

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
    :language: python
    :start-after: [Begin state]
    :end-before: [End state]

You can execute the code by running a statevector simulator on your CPU:

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
    :language: python
    :start-after: [Begin CPU]
    :end-before: [End CPU]

.. parsed-literal::

    { 00:475 11:525 }

You will notice a speedup of up to **2500x** in executing the circuit below on
NVIDIA GPUs vs CPUs:

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/multiple_targets.py
    :language: python
    :start-after: [Begin GPU]
    :end-before: [End GPU]

.. parsed-literal::

    { 0000000000000000000000000:510 1111111111111111111111111:490 }

If one incrementally increases the qubit count, we
reach a limit where the memory required is beyond the capabilities of a
single GPU: A :math:`n` qubit quantum state has :math:`2^n` complex amplitudes, each
of which require 8 bytes of memory to store. Hence the total memory
required to store a :math:`n` qubit quantum state is :math:`8` bytes
:math:`\times 2^n`. For :math:`n = 30` qubits, this is roughly :math:`8`
GB but for :math:`n = 40`, this exponentially increases to 8700 GB.


Parallelization across Multiple Processors
---------------------------------------------

The ``nvidia-mgpu`` target allows for memory from additional
GPUs to be pooled enabling qubit counts to be scaled.
Execution on the ``nvidia-mgpu`` backend is enabled via ``mpirun``. Users
need to create a ``.py`` file with their code and run the command below
in terminal:

``mpirun -np 4 python3 test.py``

where 4 is the number of GPUs one has access to and ``test`` is the file
name chosen.

The ``nvidia-mqpu`` target uses a statevector simulator to simulate execution 
on each virtual QPU.
The ``remote-mqpu`` platform allows to freely configure what backend is used 
for each platform QPU. 
For more information about the different platform targets, please take a look at
:doc:`../backends/platform`.

Batching Hamiltonian Terms
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expectation value computations of multi-term Hamiltonians can be
asynchronously processed via the ``mqpu`` platform.

.. image:: ../../applications/python/images/hsplit.png

For workflows involving multiple GPUs, save the code below in a
``filename.py`` file and execute via:
``mpirun -np n python3 filename.py`` where ``n`` is an integer
specifying the number of GPUs you have access to.

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/hamiltonian_batching.py
    :language: python
    :start-after: [Begin Docs]
    :end-before: [End Docs]

.. parsed-literal::

    mpi is initialized?  True
    rank 0 num_ranks 1


Circuit Batching
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Execution of parameterized circuits with different parameters can be
executed asynchronously via the ``mqpu`` platform.

.. image:: ../../applications/python/images/circsplit.png

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
    :language: python
    :start-after: [Begin prepare]
    :end-before: [End prepare]

Let's time the execution on single GPU.

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
    :language: python
    :start-after: [Begin single]
    :end-before: [End single]

.. parsed-literal::

    31.7 s ± 990 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Now let's try to time multi GPU run.

.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
    :language: python
    :start-after: [Begin split]
    :end-before: [End split]

.. parsed-literal::

    We have 10000 parameters which we would like to execute
    We split this into 4 batches of 2500 , 2500 , 2500 , 2500


.. literalinclude:: ../../snippets/python/using/examples/multi_gpu_workflows/circuit_batching.py
    :language: python
    :start-after: [Begin multiple]
    :end-before: [End multiple]

.. parsed-literal::

    85.3 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)