{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-GPU Workflows\n",
    "\n",
    "There are many backends available with CUDA-Q which enable seamless switching between GPUs, QPUs and CPUs and also allow for workflows involving multiple architectures working in tandem.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudaq\n",
    "\n",
    "targets = cudaq.get_targets()\n",
    "\n",
    "# for target in targets:\n",
    "#     print(target)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Available Targets\n",
    "\n",
    "- **qpp-cpu**: The qpp based CPU backend which is multithreaded to maximize the usage of available cores on your system.\n",
    "\n",
    "- **nvidia**: GPU based backend which accelerates quantum circuit simulation on NVIDIA GPUs powered by cuQuantum.\n",
    "\n",
    "- **nvidia-mqpu**: Enables users to program workflows utilizing multiple quantum processors enabled today by GPU emulation. \n",
    "\n",
    "- **nvidia-mgpu**: Allows for scaling circuit simulation beyond what is feasible with any QPU today. \n",
    "\n",
    "- **density-matrix-cpu**: Noisy simulations via density matrix calculations. CPU only for now with GPU support coming soon. \n",
    "\n",
    "Below we explore how to effectively utilize multi-GPU targets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ghz_state(qubit_count, target):\n",
    "    \"\"\"A function that will generate a variable sized GHZ state (`qubit_count`).\"\"\"\n",
    "    cudaq.set_target(target)\n",
    "\n",
    "    kernel = cudaq.make_kernel()\n",
    "\n",
    "    qubits = kernel.qalloc(qubit_count)\n",
    "\n",
    "    kernel.h(qubits[0])\n",
    "\n",
    "    for i in range(1, qubit_count):\n",
    "        kernel.cx(qubits[0], qubits[i])\n",
    "\n",
    "    kernel.mz(qubits)\n",
    "\n",
    "    result = cudaq.sample(kernel, shots_count=1000)\n",
    "\n",
    "    return result"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QPP-based CPU Backend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{ 00:475 11:525 }\n"
     ]
    }
   ],
   "source": [
    "cpu_result = ghz_state(qubit_count=2, target=\"qpp-cpu\")\n",
    "\n",
    "cpu_result.dump()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Acceleration via NVIDIA GPUs\n",
    "\n",
    "Users will notice a **200x speedup** in executing the circuit below on NVIDIA GPUs vs CPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{ 0000000000000000000000000:510 1111111111111111111111111:490 }\n"
     ]
    }
   ],
   "source": [
    "gpu_result = ghz_state(qubit_count=25, target=\"nvidia\")\n",
    "\n",
    "gpu_result.dump()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple NVIDIA GPUs\n",
    "\n",
    "A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a $n$ qubit quantum state is $8$ bytes $\\times 2^n$. For $n = 30$ qubits, this is roughly $8$ GB but for $n = 40$, this exponentially increases to 8700 GB. \n",
    "\n",
    "If one incrementally increases the qubit count in their circuit, we reach a limit where the memory required is beyond the capabilities of a single GPU. The `nvidia-mgpu` target allows for memory from additional GPUs to be pooled enabling qubit counts to be scaled.  \n",
    "\n",
    "Execution on the `nvidia-mgpu` backed is enabled via `mpirun`. Users need to create a `.py` file with their code and run the command below in terminal:\n",
    "\n",
    "`mpirun -np 4 python3 test.py`\n",
    "\n",
    "where 4 is the number of GPUs one has access to and `test` is the file name chosen.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple QPU's\n",
    "\n",
    "The `nvidia-mqpu` backend allows for future workflows made possible via GPU simulation today. \n",
    "\n",
    "\n",
    "\n",
    "### Asynchronous data collection via batching Hamiltonian terms\n",
    "\n",
    "Expectation value computations of multi-term hamiltonians can be asynchronously processed via the `mqpu` platform.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/hsplit.png\" alt=\"Alt Text\" width=\"500\" height=\"200\">\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For workflows involving multiple GPUs, save the code below in a `filename.py` file and execute via:  `mpirun -np n python3 filename.py` where `n` is an integer specifying the number of GPUs you have access to.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "mpi is initialized?  True\n",
      "rank 0 num_ranks 1\n"
     ]
    }
   ],
   "source": [
    "import cudaq\n",
    "from cudaq import spin\n",
    "\n",
    "cudaq.set_target(\"nvidia-mqpu\")\n",
    "\n",
    "cudaq.mpi.initialize()\n",
    "num_ranks = cudaq.mpi.num_ranks()\n",
    "rank = cudaq.mpi.rank()\n",
    "\n",
    "print('mpi is initialized? ', cudaq.mpi.is_initialized())\n",
    "print('rank', rank, 'num_ranks', num_ranks)\n",
    "\n",
    "qubit_count = 15\n",
    "term_count = 100000\n",
    "\n",
    "kernel = cudaq.make_kernel()\n",
    "qubits = kernel.qalloc(qubit_count)\n",
    "kernel.h(qubits[0])\n",
    "for i in range(1, qubit_count):\n",
    "    kernel.cx(qubits[0], qubits[i])\n",
    "\n",
    "# We create a random hamiltonian\n",
    "hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)\n",
    "\n",
    "# The observe calls allows us to calculate the expectation value of the Hamiltonian with respect to a specified kernel.\n",
    "\n",
    "# Single node, single GPU.\n",
    "result = cudaq.observe(kernel, hamiltonian)\n",
    "result.expectation()\n",
    "\n",
    "# If we have multiple GPUs/ QPUs available, we can parallelize the workflow with the addition of an argument in the observe call.\n",
    "\n",
    "# Single node, multi-GPU.\n",
    "result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.thread)\n",
    "result.expectation()\n",
    "\n",
    "# Multi-node, multi-GPU.\n",
    "result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.mpi)\n",
    "result.expectation()\n",
    "\n",
    "cudaq.mpi.finalize()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=\"4\"> Asynchronous data collection via circuit batching</font>\n",
    "\n",
    "Execution of parameterized circuits with different parameters can be executed asynchronously via the `mqpu` platform.\n",
    "\n",
    "<img src=\"images/circsplit.png\" alt=\"Alt Text\" width=\"500\" height=\"200\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudaq\n",
    "from cudaq import spin\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(1)\n",
    "\n",
    "cudaq.set_target(\"nvidia-mqpu\")\n",
    "\n",
    "qubit_count = 5\n",
    "sample_count = 10000\n",
    "h = spin.z(0)\n",
    "parameter_count = qubit_count\n",
    "\n",
    "# Below we run a circuit for 10000 different input parameters.\n",
    "parameters = np.random.default_rng(13).uniform(low=0,\n",
    "                                               high=1,\n",
    "                                               size=(sample_count,\n",
    "                                                     parameter_count))\n",
    "\n",
    "kernel, params = cudaq.make_kernel(list)\n",
    "\n",
    "qubits = kernel.qalloc(qubit_count)\n",
    "qubits_list = list(range(qubit_count))\n",
    "\n",
    "for i in range(qubit_count):\n",
    "    kernel.rx(params[i], qubits[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "31.7 s ± 990 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit result = cudaq.observe(kernel, h, parameters)   # Single GPU result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We have 10000 parameters which we would like to execute\n",
      "We split this into 4 batches of 2500 , 2500 , 2500 , 2500\n"
     ]
    }
   ],
   "source": [
    "print('We have', parameters.shape[0],\n",
    "      'parameters which we would like to execute')\n",
    "\n",
    "xi = np.split(\n",
    "    parameters,\n",
    "    4)  # We split our parameters into 4 arrays since we have 4 GPUs available.\n",
    "\n",
    "print('We split this into', len(xi), 'batches of', xi[0].shape[0], ',',\n",
    "      xi[1].shape[0], ',', xi[2].shape[0], ',', xi[3].shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "85.3 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",    
    "# Timing the execution on a single GPU vs 4 GPUs,\n",
    "# one will see a 4x performance improvement if 4 GPUs are available.\n",
    "\n",
    "asyncresults = []\n",
    "num_gpus = cudaq.num_available_gpus()\n",
    "\n",
    "for i in range(len(xi)):\n",
    "    for j in range(xi[i].shape[0]):\n",
    "        qpu_id = i * num_gpus // len(xi)\n",
    "        asyncresults.append(\n",
    "            cudaq.observe_async(kernel, h, xi[i][j, :], qpu_id=qpu_id))\n",
    "\n",
    "result = [res.get() for res in asyncresults]"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}