Running your first CUDA-Q Program¶

Now that you have defined your first quantum kernel, let’s look at different options for how to execute it. In CUDA-Q, quantum circuits are stored as quantum kernels. For estimating the probability distribution of a measured quantum state in a circuit, we use the sample function call, for analyzing individual return values from multiple executions, we use the run function call, and for computing the expectation value of a quantum state with a given observable, we use the observe function call.

Run¶

The run method executes a quantum kernel multiple times and returns each individual result. Unlike sample, which collects measurement statistics as counts, run preserves each individual return value from every execution. This is useful when you need to analyze the distribution of returned values rather than just aggregated measurement counts. Additionally, the run method also supports returning various types of values from the quantum kernel, including scalar types (bool, int, float and their variants) and user-defined data structures.

Python

The cudaq.run method takes a kernel and its arguments as inputs, and returns a list containing the result values from each execution. The kernel must return a non-void value.

C++

The cudaq::run method takes a kernel and its arguments as inputs, and returns a std::vector containing the result values from each execution. The kernel must return a non-void value.

Below is an example of a quantum kernel that creates a GHZ state, measures all qubits, and returns the total count of qubits in state \(|1\rangle\):

Python

import cudaq


# Define a quantum kernel that returns an integer
@cudaq.kernel
def ghz_kernel(qubit_count: int) -> int:
    # Allocate qubits
    qubits = cudaq.qvector(qubit_count)

    # Create GHZ state
    h(qubits[0])
    for i in range(1, qubit_count):
        x.ctrl(qubits[0], qubits[i])

    # Measure and count the number of qubits in state |1⟩
    result = 0
    for i in range(qubit_count):
        if mz(qubits[i]):
            result += 1

    return result


# Execute the kernel multiple times and collect individual results
qubit_count = 3
results = cudaq.run(ghz_kernel, qubit_count, shots_count=10)
print(f"Executed {len(results)} shots")
print(f"Results: {results}")

C++

#include <algorithm>
#include <cudaq.h>
#include <iostream>
#include <map>
#include <numeric>

// Define a quantum kernel that returns an integer
__qpu__ int ghz_kernel(int qubit_count) {
  // Allocate qubits
  cudaq::qvector qubits(qubit_count);

  // Create GHZ state
  h(qubits[0]);
  for (int i = 1; i < qubit_count; ++i) {
    x<cudaq::ctrl>(qubits[0], qubits[i]);
  }

  // Measure and count the number of qubits in state |1⟩
  int result = 0;
  for (int i = 0; i < qubit_count; ++i) {
    if (mz(qubits[i])) {
      result += 1;
    }
  }

  return result;
}

int main() {
  // Execute the kernel multiple times and collect individual results
  int qubit_count = 3;
  auto results = cudaq::run(10, ghz_kernel, qubit_count);

  std::cout << "Executed " << results.size() << " shots\n";
  std::cout << "Results: ";
  for (auto result : results) {
    std::cout << result << " ";
  }
  std::cout << "\n";

The code above will execute the kernel multiple times (defined by shots_count) and return a list of individual results. By default, the shots_count for run is 100.

You can process the results to get statistics or other insights:

Python

# Count occurrences of each result
value_counts = {}
for value in results:
    value_counts[value] = value_counts.get(value, 0) + 1

print("\nCounts of each result:")
for value, count in sorted(value_counts.items()):
    print(f"Result {value}: {count} times")

# Analyze patterns in the results
zero_count = results.count(0)
full_count = results.count(qubit_count)
other_count = len(results) - zero_count - full_count
print(f"\nGHZ state analysis:")
print(
    f"  All qubits in |0⟩: {zero_count} times ({zero_count/len(results)*100:.1f}%)"
)
print(
    f"  All qubits in |1⟩: {full_count} times ({full_count/len(results)*100:.1f}%)"
)
print(
    f"  Other states: {other_count} times ({other_count/len(results)*100:.1f}%)"
)

C++

  // Count occurrences of each result
  std::map<int, int> value_counts;
  for (auto value : results) {
    value_counts[value]++;
  }

  std::cout << "\nCounts of each result:\n";
  for (auto &[value, count] : value_counts) {
    std::cout << "Result " << value << ": " << count << " times\n";
  }

  // Analyze patterns in the results
  int zero_count = std::count(results.begin(), results.end(), 0);
  int full_count = std::count(results.begin(), results.end(), qubit_count);
  int other_count = results.size() - zero_count - full_count;

  std::cout << "\nGHZ state analysis:\n";
  std::cout << "  All qubits in |0⟩: " << zero_count << " times ("
            << (float)zero_count / results.size() * 100.0 << "%)\n";
  std::cout << "  All qubits in |1⟩: " << full_count << " times ("
            << (float)full_count / results.size() * 100.0 << "%)\n";
  std::cout << "  Other states: " << other_count << " times ("
            << (float)other_count / results.size() * 100.0 << "%)\n";

Note

Currently, run supports kernels returning scalar types (bool, int, float) and custom data structures.

Note

When using custom data structures, they must be defined with slots=True in Python or as simple aggregates in C++.

Similar to sample_async, the run API also supports asynchronous execution through run_async. This is particularly useful for parallelizing execution of multiple kernels on a multi-processor platform:

Python

# Define a simple kernel for asynchronous execution
@cudaq.kernel
def simple_kernel(theta: float) -> bool:
    q = cudaq.qubit()
    rx(theta, q)
    return mz(q)


# Check if we have multiple GPUs
num_gpus = cudaq.num_available_gpus()
if num_gpus > 1:
    # Set the target to include multiple virtual QPUs
    cudaq.set_target("nvidia", option="mqpu")

    # Run kernels asynchronously with different parameters
    future1 = cudaq.run_async(simple_kernel, 0.0, shots_count=100, qpu_id=0)
    future2 = cudaq.run_async(simple_kernel, 3.14159, shots_count=100, qpu_id=1)
else:
    # Schedule for execution on the same virtual QPU, defaulting to `qpu_id=0`
    future1 = cudaq.run_async(simple_kernel, 0.0, shots_count=100)
    future2 = cudaq.run_async(simple_kernel, 3.14159, shots_count=100)

# Get results when ready
results1 = future1.get()
results2 = future2.get()

# Analyze the results
print("\nAsynchronous execution results:")
true_count1 = sum(1 for res in results1 if res)
true_count2 = sum(1 for res in results2 if res)
print(f"Kernel with theta=0.0: {true_count1}/100 times measured |1⟩")
print(f"Kernel with theta=π: {true_count2}/100 times measured |1⟩")

C++

  // Define a simple kernel for async execution
  auto simple_kernel = [](float theta) __qpu__ -> bool {
    cudaq::qubit q;
    rx(theta, q);
    return mz(q);
  };

  // Check if we have multiple QPUs available
  // Note: In C++ API, we would check this differently
  // Here we'll use the target setting directly
  bool has_multiple_qpus = false;

  if (has_multiple_qpus) {
    // Set the target to include multiple virtual QPUs
    // In a real application, this would involve proper target configuration

    // Run kernels asynchronously with different parameters
    auto future1 = cudaq::run_async(0, 100, simple_kernel, 0.0);
    auto future2 = cudaq::run_async(1, 100, simple_kernel, 3.14159);

    // Get results when ready
    auto results1 = future1.get();
    auto results2 = future2.get();

    // Analyze the results
    std::cout << "\nAsynchronous execution results:\n";
    int true_count1 = std::count(results1.begin(), results1.end(), true);
    int true_count2 = std::count(results2.begin(), results2.end(), true);

    std::cout << "Kernel with theta=0.0: " << true_count1
              << "/100 times measured |1⟩\n";
    std::cout << "Kernel with theta=π: " << true_count2
              << "/100 times measured |1⟩\n";
  } else {
    // Schedule for execution on the same QPU
    auto future1 = cudaq::run_async(0, 100, simple_kernel, 0.0);
    auto future2 = cudaq::run_async(0, 100, simple_kernel, 3.14159);

    // Get results when ready
    auto results1 = future1.get();
    auto results2 = future2.get();

    // Analyze the results
    std::cout << "\nAsynchronous execution results:\n";
    int true_count1 = std::count(results1.begin(), results1.end(), true);
    int true_count2 = std::count(results2.begin(), results2.end(), true);

    std::cout << "Kernel with theta=0.0: " << true_count1
              << "/100 times measured |1⟩\n";
    std::cout << "Kernel with theta=π: " << true_count2
              << "/100 times measured |1⟩\n";
  }

More information about parallelizing execution can be found at the Simulate Multiple QPUs in Parallel page.

Note

Currently, run and run_async are only supported on simulator targets.

Observe¶

The observe function allows us to calculate expectation values for a defined quantum operator, that is the value of \(\bra{\psi}H\ket{\psi}\), where \(H\) is the desired operator and \(\ket{\psi}\) is the quantum state after executing a given kernel.

Python

The cudaq.observe() method takes a kernel and its arguments as inputs, along with a cudaq.operators.spin.SpinOperator.

Using the cudaq.spin module, operators may be defined as a linear combination of Pauli strings. Functions, such as cudaq.spin.i(), cudaq.spin.x(), cudaq.spin.y(), cudaq.spin.z() may be used to construct more complex spin Hamiltonians on multiple qubits.

C++

The cudaq::observe method takes a kernel and its arguments as inputs, along with a cudaq::spin_op.

Operators may be defined as a linear combination of Pauli strings. Functions, such as cudaq::spin_op::i, cudaq::spin_op::x, cudaq::spin_op::y, cudaq::spin_op::z may be used to construct more complex spin Hamiltonians on multiple qubits.

Below is an example of a spin operator object consisting of a Z(0) operator, or a Pauli Z-operator on the qubit zero. This is followed by the construction of a kernel with a single qubit in an equal superposition. The Hamiltonian is printed to confirm it has been constructed properly.

Python

import cudaq
from cudaq import spin

operator = spin.z(0)
print(operator)  # prints: [1+0j] Z


@cudaq.kernel
def kernel():
    qubit = cudaq.qubit()
    h(qubit)

C++

#include <cudaq.h>
#include <cudaq/algorithm.h>

#include <iostream>

__qpu__ void kernel() {
  cudaq::qubit qubit;
  h(qubit);
}

int main() {
  auto spin_operator = cudaq::spin_op::z(0);
  std::cout << spin_operator.to_string() << "\n";

The observe function takes a kernel, any kernel arguments, and a spin operator as inputs and produces an ObserveResult object. The expectation value can be printed using the expectation method.

Note

It is important to exclude a measurement in the kernel, otherwise the expectation value will be determined from a collapsed classical state. For this example, the expected result of 0.0 is produced.

Python

result = cudaq.observe(kernel, operator)
print(result.expectation())  # prints: 0.0

C++

  auto result_0 = cudaq::observe(kernel, spin_operator);
  // Expectation value of kernel with respect to single `Z` term
  // should print: 0.0
  std::cout << "<kernel | spin_operator | kernel> = " << result_0.expectation()
            << "\n";

Unlike sample, the default shots_count for observe is 1. This result is deterministic and equivalent to the expectation value in the limit of infinite shots. To produce an approximate expectation value from sampling, shots_count can be specified to any integer.

Python

result = cudaq.observe(kernel, operator, shots_count=1000)
print(result.expectation())  # prints non-zero value

C++

  auto result_1 = cudaq::observe(1000, kernel, spin_operator);
  // Expectation value of kernel with respect to single `Z` term,
  // but instead of a single deterministic execution of the kernel,
  // we sample over 1000 shots. We should now print an expectation
  // value that is close to, but not quite, zero.
  // Example: 0.025
  std::cout << "<kernel | spin_operator | kernel> = " << result_1.expectation()
            << "\n";
}

Similar to sample_async above, observe also supports asynchronous execution. More information about parallelizing execution can be found at the Simulate Multiple QPUs in Parallel page.

Running on a GPU¶

Python

Using cudaq.set_target(), different targets can be specified for kernel execution.

C++

Using the --target argument to nvq++, different targets can be specified for kernel execution.

If a local GPU is detected, the target will default to nvidia. Otherwise, the CPU-based simulation target, qpp-cpu, will be selected.

We will demonstrate the benefits of using a GPU by sampling our GHZ kernel with 25 qubits and a shots_count of 1 million. Using a GPU accelerates this task by more than 35x. To learn about all of the available targets and ways to accelerate kernel execution, visit the Backends page.

Python

import sys
import timeit

# Will time the execution of our sample call.
code_to_time = 'cudaq.sample(kernel, qubit_count, shots_count=1000000)'
qubit_count = int(sys.argv[1]) if 1 < len(sys.argv) else 25

# Execute on CPU backend.
cudaq.set_target('qpp-cpu')
print('CPU time')  # Example: 27.57462 s.
print(timeit.timeit(stmt=code_to_time, globals=globals(), number=1))

if cudaq.num_available_gpus() > 0:
    # Execute on GPU backend.
    cudaq.set_target('nvidia')
    print('GPU time')  # Example: 0.773286 s.
    print(timeit.timeit(stmt=code_to_time, globals=globals(), number=1))

C++

To compare the performance, we can create a simple timing script that isolates just the call to cudaq::sample. We are still using the same GHZ kernel as earlier, but the following modification made to the main function:

int main(int argc, char *argv[]) {
  auto qubit_count = 1 < argc ? atoi(argv[1]) : 25;
  auto shots_count = 1000000;
  auto start = std::chrono::high_resolution_clock::now();

  // Timing just the sample execution.
  auto result = cudaq::sample(shots_count, kernel, qubit_count);

  auto stop = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration<double>(stop - start);
  std::cout << "It took " << duration.count() << " seconds.\n";
}

First we execute on the CPU backend:

nvq++ --target=qpp-cpu sample.cpp
./a.out

seeing an output of the order: It took 22.8337 seconds.

Now we can execute on the GPU enabled backend:

nvq++ --target=nvidia sample.cpp
./a.out

seeing an output of the order: It took 3.18988 seconds.

Running your first CUDA-Q Program¶

Sample¶

Run¶

Observe¶

Running on a GPU¶