{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multi-GPU Workflows\n",
    "\n",
    "There are many backends available with CUDA Quantum which enable seamless switching between GPUs, QPUs and CPUs and also allow for workflows involing multiple architectures working in tandem.\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudaq\n",
    "\n",
    "targets = cudaq.get_targets()\n",
    "\n",
    "# for target in targets:\n",
    "#     print(target)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Available Targets\n",
    "\n",
    "- **qpp-cpu**: The qpp based CPU backend which is multithreaded to maximize the usage of available cores on your system.\n",
    "\n",
    "- **nvidia**: GPU based backend which accelerates quantum circuit simulation on NVIDIA GPUs powered by cuQuantum.\n",
    "\n",
    "- **nvidia-mqpu**: Enables users to program workflows utilizing multiple quantum processors enabled today by GPU emulation. \n",
    "\n",
    "- **nvidia-mgpu**: Allows for scaling circuit simulation beyond what is feasible with any QPU today. \n",
    "\n",
    "- **density-matrix-cpu**: Noisy simulations via density matrix calculations. CPU only for now with GPU support coming soon. \n",
    "\n",
    "Below we explore how to effectively utilize multi-GPU targets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "def ghz_state(qubit_count, target):\n",
    "    \"\"\"A function that will generate a variable sized GHZ state (`qubit_count`).\"\"\"\n",
    "    cudaq.set_target(target)\n",
    "\n",
    "    kernel = cudaq.make_kernel()\n",
    "\n",
    "    qubits = kernel.qalloc(qubit_count)\n",
    "\n",
    "    kernel.h(qubits[0])\n",
    "\n",
    "    for i in range(1, qubit_count):\n",
    "        kernel.cx(qubits[0], qubits[i])\n",
    "\n",
    "    kernel.mz(qubits)\n",
    "\n",
    "    result = cudaq.sample(kernel, shots_count=1000)\n",
    "\n",
    "    return result"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### QPP-based CPU Backend"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{ 00:518 11:482 }\n"
     ]
    }
   ],
   "source": [
    "cpu_result = ghz_state(qubit_count=2, target=\"qpp-cpu\")\n",
    "\n",
    "cpu_result.dump()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Acceleration via NVIDIA GPUs\n",
    "\n",
    "Users will notice a **200x speedup** in executing the circuit below on NVIDIA GPUs vs CPUs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{ 0000000000000000000000000:477 1111111111111111111111111:523 }\n"
     ]
    }
   ],
   "source": [
    "gpu_result = ghz_state(qubit_count=25, target=\"nvidia\")\n",
    "\n",
    "gpu_result.dump()"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple NVIDIA GPUs\n",
    "\n",
    "A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a $n$ qubit quantum state is $8$ bytes $\\times 2^n$. For $n = 30$ qubits, this is roughly $8$ GB but for $n = 40$, this exponentially increases to 8700 GB. \n",
    "\n",
    "If one incrementally increases the qubit count in their circuit, we reach a limit where the memory required is beyond the capabilities of a single GPU. The `nvidia-mgpu` target allows for memory from additional GPUs to be pooled enabling qubit counts to be scaled.  \n",
    "\n",
    "Execution on the `nvidia-mgpu` backed is enabled via `mpirun`. Users need to create a `.py` file with their code and run the command below in terminal:\n",
    "\n",
    "`mpirun -np 4 python3 test.py`\n",
    "\n",
    "where 4 is the number of GPUs one has access to and `test` is the file name chosen.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Multiple QPU's\n",
    "\n",
    "The `nvidia-mqpu` backend allows for future workflows made possible via GPU simulation today. \n",
    "\n",
    "\n",
    "\\\n",
    "### Asynchronous data collection via batching Hamiltonian terms\n",
    "\n",
    "Expectation value computations of multi-term hamiltonians can be asynchronously processed via the `mqpu` platform.\n"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"images/hsplit.png\" alt=\"Alt Text\" width=\"500\" height=\"200\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "cudaq.set_target(\"nvidia-mqpu\")\n",
    "\n",
    "qubit_count = 15\n",
    "term_count = 100000\n",
    "\n",
    "kernel = cudaq.make_kernel()\n",
    "\n",
    "qubits = kernel.qalloc(qubit_count)\n",
    "\n",
    "kernel.h(qubits[0])\n",
    "\n",
    "for i in range(1, qubit_count):\n",
    "    kernel.cx(qubits[0], qubits[i])\n",
    "\n",
    "# We create a random hamiltonian with 10e5 terms\n",
    "hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)\n",
    "\n",
    "# The observe calls allows us to calculate the expectation value of the Hamiltonian, batches the terms, and distributes them over the multiple QPU's/GPUs.\n",
    "\n",
    "# expectation = cudaq.observe(kernel, hamiltonian)  # Single node, single GPU.\n",
    "\n",
    "expectation = cudaq.observe(\n",
    "    kernel, hamiltonian,\n",
    "    execution=cudaq.parallel.thread)  # Single node, multi-GPU.\n",
    "\n",
    "# expectation = cudaq.observe(kernel, hamiltonian, execution= cudaq.parallel.mpi) # Multi-node, multi-GPU."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font size=\"4\"> Asynchronous data collection via circuit batching</font>\n",
    "\n",
    "Execution of parameterized circuits with different parameters can be executed asynchronously via the `mqpu` platform.\n",
    "\n",
    "<img src=\"images/circsplit.png\" alt=\"Alt Text\" width=\"500\" height=\"200\">\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "import cudaq\n",
    "from cudaq import spin\n",
    "import numpy as np\n",
    "\n",
    "np.random.seed(1)\n",
    "\n",
    "cudaq.set_target(\"nvidia-mqpu\")\n",
    "\n",
    "qubit_count = 5\n",
    "sample_count = 10000\n",
    "h = spin.z(0)\n",
    "parameter_count = qubit_count\n",
    "\n",
    "# Below we run a circuit for 10000 different input parameters.\n",
    "parameters = np.random.default_rng(13).uniform(low=0,\n",
    "                                               high=1,\n",
    "                                               size=(sample_count,\n",
    "                                                     parameter_count))\n",
    "\n",
    "kernel, params = cudaq.make_kernel(list)\n",
    "\n",
    "qubits = kernel.qalloc(qubit_count)\n",
    "qubits_list = list(range(qubit_count))\n",
    "\n",
    "for i in range(qubit_count):\n",
    "    kernel.rx(params[i], qubits[i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "29.7 s ± 548 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%timeit result = cudaq.observe(kernel, h, parameters)   # Single GPU result."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "We have 10000 parameters which we would like to execute\n",
      "We split this into 4 batches of 2500 , 2500 , 2500 , 2500\n"
     ]
    }
   ],
   "source": [
    "print('We have', parameters.shape[0],\n",
    "      'parameters which we would like to execute')\n",
    "\n",
    "xi = np.split(\n",
    "    parameters,\n",
    "    4)  # We split our parameters into 4 arrays since we have 4 GPUs available.\n",
    "\n",
    "print('We split this into', len(xi), 'batches of', xi[0].shape[0], ',',\n",
    "      xi[1].shape[0], ',', xi[2].shape[0], ',', xi[3].shape[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "939 ms ± 3.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "source": [
    "%%timeit\n",
    "\n",
    "# Timing the execution on a single GPU vs 4 GPUs, users will see a 4x performance improvement\n",
    "\n",
    "asyncresults = []\n",
    "\n",
    "for i in range(len(xi)):\n",
    "    for j in range(xi[i].shape[0]):\n",
    "        asyncresults.append(\n",
    "            cudaq.observe_async(kernel, h, xi[i][j, :], qpu_id=i))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0]"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}