NVIDIA Quantum Cloud

NVIDIA Quantum Cloud (NVQC) offers universal access to the world’s most powerful computing platform, for every quantum researcher to do their life’s work. To learn more about NVQC visit this link.

Apply for early access here. Access to the Quantum Cloud early access program requires an NVIDIA Developer account.

Quick Start

Once you have been approved for an early access to NVQC, you will be able to follow these instructions to use it.

1. Follow the instructions in your NVQC Early Access welcome email to obtain an API Key for NVQC. You can also find the instructions here (link available only for approved users)

  1. Set the environment variable NVQC_API_KEY to the API Key obtained above.

export NVQC_API_KEY=<your NVQC API key>

You may wish to persist that environment variable between bash sessions, e.g., by adding it to your $HOME/.bashrc file.

  1. Run your first NVQC example

The following is a typical CUDA-Q kernel example. By selecting the nvqc target, the quantum circuit simulation will run on NVQC in the cloud, rather than running locally.

import cudaq

cudaq.set_target("nvqc")
num_qubits = 25
# Define a simple quantum kernel to execute on NVQC.
kernel = cudaq.make_kernel()
qubits = kernel.qalloc(num_qubits)
# Maximally entangled state between 25 qubits.
kernel.h(qubits[0])
for i in range(num_qubits - 1):
    kernel.cx(qubits[i], qubits[i + 1])
kernel.mz(qubits)

counts = cudaq.sample(kernel)
print(counts)
[2024-03-14 19:26:31.438] Submitting jobs to NVQC service with 1 GPU(s). Max execution time: 3600 seconds (excluding queue wait time).

================ NVQC Device Info ================
GPU Device Name: "NVIDIA H100 80GB HBM3"
CUDA Driver Version / Runtime Version: 12.2 / 11.8
Total global memory (GB): 79.1
Memory Clock Rate (MHz): 2619.000
GPU Clock Rate (MHz): 1980.000
==================================================
{ 1111111111111111111111111:486 0000000000000000000000000:514 }
#include <cudaq.h>

// Define a simple quantum kernel to execute on NVQC.
struct ghz {
  // Maximally entangled state between 25 qubits.
  auto operator()() __qpu__ {
    constexpr int NUM_QUBITS = 25;
    cudaq::qvector q(NUM_QUBITS);
    h(q[0]);
    for (int i = 0; i < NUM_QUBITS - 1; i++) {
      x<cudaq::ctrl>(q[i], q[i + 1]);
    }
    auto result = mz(q);
  }
};

int main() {
  auto counts = cudaq::sample(ghz{});
  counts.dump();
}

The code above is saved in nvqc_intro.cpp and compiled with the following command, targeting the nvqc platform

nvq++ nvqc_intro.cpp -o nvqc_intro.x --target nvqc
./nvqc_intro.x

[2024-03-14 19:25:05.545] Submitting jobs to NVQC service with 1 GPU(s). Max execution time: 3600 seconds (excluding queue wait time).

================ NVQC Device Info ================
GPU Device Name: "NVIDIA H100 80GB HBM3"
CUDA Driver Version / Runtime Version: 12.2 / 11.8
Total global memory (GB): 79.1
Memory Clock Rate (MHz): 2619.000
GPU Clock Rate (MHz): 1980.000
==================================================
{
__global__ : { 1111111111111111111111111:487 0000000000000000000000000:513 }
result : { 1111111111111111111111111:487 0000000000000000000000000:513 }
}

Simulator Backend Selection

NVQC hosts all CUDA-Q simulator backends (see CUDA-Q Simulation Backends). You may use the NVQC backend (Python) or --nvqc-backend (C++) option to select the simulator to be used by the service.

For example, to request the tensornet simulator backend, the user can do the following for C++ or Python.

cudaq.set_target("nvqc", backend="tensornet")
nvq++ nvqc_sample.cpp -o nvqc_sample.x --target nvqc --nvqc-backend tensornet

Note

By default, the single-GPU single-precision custatevec-fp32 simulator backend will be selected if backend information is not specified.

Multiple GPUs

Some CUDA-Q simulator backends are capable of multi-GPU distribution as detailed in CUDA-Q Simulation Backends. For example, the nvidia-mgpu backend can partition and distribute state vector simulation to multiple GPUs to simulate a larger number of qubits, whose state vector size grows beyond the memory size of a single GPU.

To select a specific number of GPUs on the NVQC managed service, the following ngpus (Python) or --nvqc-ngpus (C++) option can be used.

cudaq.set_target("nvqc", backend="nvidia-mgpu", ngpus=4)
nvq++ nvqc_sample.cpp -o nvqc_sample.x --target nvqc --nvqc-backend nvidia-mgpu --nvqc-ngpus 4

Note

If your NVQC subscription does not contain service instances that have the specified number of GPUs, you may encounter the following error.

Unable to find NVQC deployment with 16 GPUs.
Available deployments have {1, 2, 4, 8} GPUs.
Please check your `ngpus` value (Python) or `--nvqc-ngpus` value (C++).

Note

Not all simulator backends are capable of utilizing multiple GPUs. When requesting a multiple-GPU service with a single-GPU simulator backend, you might encounter the following log message:

The requested backend simulator (custatevec-fp32) is not capable of using all 4 GPUs requested.
Only one GPU will be used for simulation.
Please refer to CUDA-Q documentation for a list of multi-GPU capable simulator backends.

Consider removing the ngpus value (Python) or --nvqc-ngpus value (C++) to use the default of 1 GPU if you don’t need to use a multi-GPU backend to better utilize NVQC resources.

Please refer to the table below for a list of backend simulator names along with its multi-GPU capability.

Simulator Backends

Name

Description

GPU Accelerated

Multi-GPU

qpp

CPU-only state vector simulator

no

no

dm

CPU-only density matrix simulator

no

no

custatevec-fp32

Single-precision cuStateVec simulator

yes

no

custatevec-fp64

Double-precision cuStateVec simulator

yes

no

tensornet

Double-precision cuTensorNet full tensor network contraction simulator

yes

yes

tensornet-mps

Double-precision cuTensorNet matrix-product state simulator

yes

no

nvidia-mgpu

Double-precision cuStateVec multi-GPU simulator

yes

yes

Multiple QPUs Asynchronous Execution

NVQC provides scalable QPU virtualization services, whereby clients can submit asynchronous jobs simultaneously to NVQC. These jobs are handled by a pool of service worker instances.

For example, in the following code snippet, using the nqpus (Python) or --nvqc-nqpus (C++) configuration option, the user instantiates 3 virtual QPU instances to submit simulation jobs to NVQC calculating the expectation value along with parameter-shift gradients simultaneously.

import cudaq
from cudaq import spin
import math

# Use NVQC with 3 virtual QPUs
cudaq.set_target("nvqc", nqpus=3)

print("Number of QPUs:", cudaq.get_target().num_qpus())
# Create the parameterized ansatz
kernel, theta = cudaq.make_kernel(float)
qreg = kernel.qalloc(2)
kernel.x(qreg[0])
kernel.ry(theta, qreg[1])
kernel.cx(qreg[1], qreg[0])

# Define its spin Hamiltonian.
hamiltonian = (5.907 - 2.1433 * spin.x(0) * spin.x(1) -
               2.1433 * spin.y(0) * spin.y(1) + 0.21829 * spin.z(0) -
               6.125 * spin.z(1))


def opt_gradient(parameter_vector):
    # Evaluate energy and gradient on different remote QPUs
    # (i.e., concurrent job submissions to NVQC)
    energy_future = cudaq.observe_async(kernel,
                                        hamiltonian,
                                        parameter_vector[0],
                                        qpu_id=0)
    plus_future = cudaq.observe_async(kernel,
                                      hamiltonian,
                                      parameter_vector[0] + 0.5 * math.pi,
                                      qpu_id=1)
    minus_future = cudaq.observe_async(kernel,
                                       hamiltonian,
                                       parameter_vector[0] - 0.5 * math.pi,
                                       qpu_id=2)
    return (energy_future.get().expectation(), [
        (plus_future.get().expectation() - minus_future.get().expectation()) /
        2.0
    ])


optimizer = cudaq.optimizers.LBFGS()
optimal_value, optimal_parameters = optimizer.optimize(1, opt_gradient)
print("Ground state energy =", optimal_value)
print("Optimal parameters =", optimal_parameters)
#include <cudaq.h>
#include <cudaq/algorithm.h>
#include <cudaq/gradients.h>
#include <cudaq/optimizers.h>
#include <iostream>

int main() {
  using namespace cudaq::spin;
  cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
                     .21829 * z(0) - 6.125 * z(1);

  auto [ansatz, theta] = cudaq::make_kernel<double>();
  auto q = ansatz.qalloc();
  auto r = ansatz.qalloc();
  ansatz.x(q);
  ansatz.ry(theta, r);
  ansatz.x<cudaq::ctrl>(r, q);

  // Run VQE with a gradient-based optimizer.
  // Delegate cost function and gradient computation across different NVQC-based
  // QPUs.
  // Note: this needs to be compiled with `--nvqc-nqpus 3` create 3 virtual
  // QPUs.
  cudaq::optimizers::lbfgs optimizer;
  auto [opt_val, opt_params] = optimizer.optimize(
      /*dim=*/1, /*opt_function*/ [&](const std::vector<double> &params,
                                      std::vector<double> &grads) {
        // Queue asynchronous jobs to do energy evaluations across multiple QPUs
        auto energy_future =
            cudaq::observe_async(/*qpu_id=*/0, ansatz, h, params[0]);
        const double paramShift = M_PI_2;
        auto plus_future = cudaq::observe_async(/*qpu_id=*/1, ansatz, h,
                                                params[0] + paramShift);
        auto minus_future = cudaq::observe_async(/*qpu_id=*/2, ansatz, h,
                                                 params[0] - paramShift);
        grads[0] = (plus_future.get().expectation() -
                    minus_future.get().expectation()) /
                   2.0;
        return energy_future.get().expectation();
      });
  std::cout << "Minimum energy = " << opt_val << " (expected -1.74886).\n";
}

The code above is saved in nvqc_vqe.cpp and compiled with the following command, targeting the nvqc platform with 3 virtual QPUs.

nvq++ nvqc_vqe.cpp -o nvqc_vqe.x --target nvqc --nvqc-nqpus 3
./nvqc_vqe.x

Note

The NVQC managed-service has a pool of worker instances processing incoming requests on a first-come-first-serve basis. Thus, the attainable speedup using multiple virtual QPUs vs. sequential execution on a single QPU depends on the NVQC service load. For example, if the number of free workers is greater than the number of requested virtual QPUs, a linear (ideal) speedup could be achieved. On the other hand, if all the service workers are busy, multi-QPU distribution may not deliver any substantial speedup.

FAQ

  1. How do I get more information about my NVQC API submission?

The environment variable NVQC_LOG_LEVEL can be used to turn on and off certain logs. There are three levels:

  • Info (info): basic information about NVQC is logged to the console. This is the default.

  • Off (off or 0): disable all NVQC logging.

  • Trace: (trace): log additional information for each NVQC job execution (including timing)

  1. I want to persist my API key to a configuration file.

You may persist your NVQC API Key to a credential configuration file in lieu of using the NVQC_API_KEY environment variable. The configuration file can be generated as follows, replacing the api_key value with your NVQC API Key.

echo "key: <api_key>" >> $HOME/.nvqc_config