{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Multi-GPU Workflows\n",
"\n",
"There are many backends available with CUDA-Q which enable seamless switching between GPUs, QPUs and CPUs and also allow for workflows involving multiple architectures working in tandem.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import cudaq\n",
"\n",
"targets = cudaq.get_targets()\n",
"\n",
"# for target in targets:\n",
"# print(target)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Available Targets\n",
"\n",
"- **qpp-cpu**: The qpp based CPU backend which is multithreaded to maximize the usage of available cores on your system.\n",
"\n",
"- **nvidia**: GPU based backend which accelerates quantum circuit simulation on NVIDIA GPUs powered by cuQuantum.\n",
"\n",
"- **nvidia-mqpu**: Enables users to program workflows utilizing multiple quantum processors enabled today by GPU emulation. \n",
"\n",
"- **nvidia-mgpu**: Allows for scaling circuit simulation beyond what is feasible with any QPU today. \n",
"\n",
"- **density-matrix-cpu**: Noisy simulations via density matrix calculations. CPU only for now with GPU support coming soon. \n",
"\n",
"Below we explore how to effectively utilize multi-GPU targets."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def ghz_state(qubit_count, target):\n",
" \"\"\"A function that will generate a variable sized GHZ state (`qubit_count`).\"\"\"\n",
" cudaq.set_target(target)\n",
"\n",
" kernel = cudaq.make_kernel()\n",
"\n",
" qubits = kernel.qalloc(qubit_count)\n",
"\n",
" kernel.h(qubits[0])\n",
"\n",
" for i in range(1, qubit_count):\n",
" kernel.cx(qubits[0], qubits[i])\n",
"\n",
" kernel.mz(qubits)\n",
"\n",
" result = cudaq.sample(kernel, shots_count=1000)\n",
"\n",
" return result"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### QPP-based CPU Backend"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{ 00:475 11:525 }\n"
]
}
],
"source": [
"cpu_result = ghz_state(qubit_count=2, target=\"qpp-cpu\")\n",
"\n",
"cpu_result.dump()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Acceleration via NVIDIA GPUs\n",
"\n",
"Users will notice a **200x speedup** in executing the circuit below on NVIDIA GPUs vs CPUs."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{ 0000000000000000000000000:510 1111111111111111111111111:490 }\n"
]
}
],
"source": [
"gpu_result = ghz_state(qubit_count=25, target=\"nvidia\")\n",
"\n",
"gpu_result.dump()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple NVIDIA GPUs\n",
"\n",
"A $n$ qubit quantum state has $2^n$ complex amplitudes, each of which require 8 bytes of memory to store. Hence the total memory required to store a $n$ qubit quantum state is $8$ bytes $\\times 2^n$. For $n = 30$ qubits, this is roughly $8$ GB but for $n = 40$, this exponentially increases to 8700 GB. \n",
"\n",
"If one incrementally increases the qubit count in their circuit, we reach a limit where the memory required is beyond the capabilities of a single GPU. The `nvidia-mgpu` target allows for memory from additional GPUs to be pooled enabling qubit counts to be scaled. \n",
"\n",
"Execution on the `nvidia-mgpu` backed is enabled via `mpirun`. Users need to create a `.py` file with their code and run the command below in terminal:\n",
"\n",
"`mpirun -np 4 python3 test.py`\n",
"\n",
"where 4 is the number of GPUs one has access to and `test` is the file name chosen.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multiple QPU's\n",
"\n",
"The `nvidia-mqpu` backend allows for future workflows made possible via GPU simulation today. \n",
"\n",
"\n",
"\n",
"### Asynchronous data collection via batching Hamiltonian terms\n",
"\n",
"Expectation value computations of multi-term hamiltonians can be asynchronously processed via the `mqpu` platform.\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"
\n"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"For workflows involving multiple GPUs, save the code below in a `filename.py` file and execute via: `mpirun -np n python3 filename.py` where `n` is an integer specifying the number of GPUs you have access to.\n"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mpi is initialized? True\n",
"rank 0 num_ranks 1\n"
]
}
],
"source": [
"import cudaq\n",
"from cudaq import spin\n",
"\n",
"cudaq.set_target(\"nvidia-mqpu\")\n",
"\n",
"cudaq.mpi.initialize()\n",
"num_ranks = cudaq.mpi.num_ranks()\n",
"rank = cudaq.mpi.rank()\n",
"\n",
"print('mpi is initialized? ', cudaq.mpi.is_initialized())\n",
"print('rank', rank, 'num_ranks', num_ranks)\n",
"\n",
"qubit_count = 15\n",
"term_count = 100000\n",
"\n",
"kernel = cudaq.make_kernel()\n",
"qubits = kernel.qalloc(qubit_count)\n",
"kernel.h(qubits[0])\n",
"for i in range(1, qubit_count):\n",
" kernel.cx(qubits[0], qubits[i])\n",
"\n",
"# We create a random hamiltonian\n",
"hamiltonian = cudaq.SpinOperator.random(qubit_count, term_count)\n",
"\n",
"# The observe calls allows us to calculate the expectation value of the Hamiltonian with respect to a specified kernel.\n",
"\n",
"# Single node, single GPU.\n",
"result = cudaq.observe(kernel, hamiltonian)\n",
"result.expectation()\n",
"\n",
"# If we have multiple GPUs/ QPUs available, we can parallelize the workflow with the addition of an argument in the observe call.\n",
"\n",
"# Single node, multi-GPU.\n",
"result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.thread)\n",
"result.expectation()\n",
"\n",
"# Multi-node, multi-GPU.\n",
"result = cudaq.observe(kernel, hamiltonian, execution=cudaq.parallel.mpi)\n",
"result.expectation()\n",
"\n",
"cudaq.mpi.finalize()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
" Asynchronous data collection via circuit batching\n",
"\n",
"Execution of parameterized circuits with different parameters can be executed asynchronously via the `mqpu` platform.\n",
"\n",
"
\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"import cudaq\n",
"from cudaq import spin\n",
"import numpy as np\n",
"\n",
"np.random.seed(1)\n",
"\n",
"cudaq.set_target(\"nvidia-mqpu\")\n",
"\n",
"qubit_count = 5\n",
"sample_count = 10000\n",
"h = spin.z(0)\n",
"parameter_count = qubit_count\n",
"\n",
"# Below we run a circuit for 10000 different input parameters.\n",
"parameters = np.random.default_rng(13).uniform(low=0,\n",
" high=1,\n",
" size=(sample_count,\n",
" parameter_count))\n",
"\n",
"kernel, params = cudaq.make_kernel(list)\n",
"\n",
"qubits = kernel.qalloc(qubit_count)\n",
"qubits_list = list(range(qubit_count))\n",
"\n",
"for i in range(qubit_count):\n",
" kernel.rx(params[i], qubits[i])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"31.7 s ± 990 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit result = cudaq.observe(kernel, h, parameters) # Single GPU result."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"We have 10000 parameters which we would like to execute\n",
"We split this into 4 batches of 2500 , 2500 , 2500 , 2500\n"
]
}
],
"source": [
"print('We have', parameters.shape[0],\n",
" 'parameters which we would like to execute')\n",
"\n",
"xi = np.split(\n",
" parameters,\n",
" 4) # We split our parameters into 4 arrays since we have 4 GPUs available.\n",
"\n",
"print('We split this into', len(xi), 'batches of', xi[0].shape[0], ',',\n",
" xi[1].shape[0], ',', xi[2].shape[0], ',', xi[3].shape[0])"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"85.3 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%%timeit\n",
"\n",
"# Timing the execution on a single GPU vs 4 GPUs, users will see a 4x performance improvement\n",
"\n",
"asyncresults = []\n",
"\n",
"for i in range(len(xi)):\n",
" for j in range(xi[i].shape[0]):\n",
" asyncresults.append(\n",
" cudaq.observe_async(kernel, h, xi[i][j, :], qpu_id=i))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}