Welcome to CUDA-Q! On this page we will illustrate CUDA-Q with several examples.

We’re going to take a look at how to construct quantum programs through CUDA-Q’s Kernel API.

When you create a Kernel and invoke its methods, a quantum program is constructed that can then be executed by calling, for example, cudaq::sample. Let’s take a closer look!

import cudaq

# We begin by defining the `Kernel` that we will construct our
# program with.
def kernel():
    This is our first CUDA-Q kernel.
    # Next, we can allocate a single qubit to the kernel via `qubit()`.
    qubit = cudaq.qubit()

    # Now we can begin adding instructions to apply to this qubit!
    # Here we'll just add every non-parameterized
    # single qubit gate that is supported by CUDA-Q.

    # Next, we add a measurement to the kernel so that we can sample
    # the measurement results on our simulator!

# Finally, we can execute this kernel on the state vector simulator
# by calling `cudaq.sample`. This will execute the provided kernel
# `shots_count` number of times and return the sampled distribution
# as a `cudaq.SampleResult` dictionary.
result = cudaq.sample(kernel)

# Now let's take a look at the `SampleResult` we've gotten back!

We’re going to take a look at how to construct quantum programs using CUDA-Q kernels.

CUDA-Q kernels are any typed callable in the language that is annotated with the __qpu__ attribute. Let’s take a look at a very simple “Hello World” example, specifically a CUDA-Q kernel that prepares a GHZ state on a programmer-specified number of qubits.

// Compile and run with:
// ```
// nvq++ static_kernel.cpp -o ghz.x && ./ghz.x
// ```

#include <cudaq.h>

// Define a CUDA-Q kernel that is fully specified
// at compile time via templates.
template <std::size_t N>
struct ghz {
  auto operator()() __qpu__ {

    // Compile-time sized array like std::array
    cudaq::qarray<N> q;
    for (int i = 0; i < N - 1; i++) {
      x<cudaq::ctrl>(q[i], q[i + 1]);

int main() {

  auto kernel = ghz<10>{};
  auto counts = cudaq::sample(kernel);

  if (!cudaq::mpi::is_initialized() || cudaq::mpi::rank() == 0) {

    // Fine grain access to the bits and counts
    for (auto &[bits, count] : counts) {
      printf("Observed: %s, %lu\n",, count);

  return 0;

Here we see that we can define a custom struct that is templated on a size_t parameter. Our kernel expression is free to use this template parameter in the allocation of a compile-time-known register of qubits. Within the kernel, we are free to apply various quantum operations, like a Hadamard on qubit 0 h(q[0]). Controlled operations are modifications of single-qubit operations, like the x<cudaq::ctrl>(q[0],q[1]) operation which implements a controlled-X gate. We can measure single qubits or entire registers.

In this example we are interested in sampling the final state produced by this CUDA-Q kernel. To do so, we leverage the generic cudaq::sample function, which returns a data type encoding the qubit measurement strings and the corresponding number of times that string was observed (here the default number of shots is used, 1000).

The following example illustrates how to compile and execute this code.

nvq++ static_kernel.cpp -o ghz.x