6. Quantum Kernels

To differentiate between host and quantum device code, the CUDA Quantum programming model defines the concept of a quantum kernel. CUDA Quantum specifically differentiates between kernels invoked from host code and those invoked from within another quantum kernel. The former are denoted entry-point quantum kernels, the latter are pure device quantum kernels. All quantum kernels must be annotated to indicate they are to be compiled to and executed on a specified quantum coprocessor. CUDA Quantum requires the __qpu__ function attribute for quantum kernel declarations.

Quantum kernel function bodies are programmed in a subset of C++. They can be composed of the following:

  • quantum intrinsic operations and measurements

  • in-scope kernel calls with or without any kernel modifiers

  • classical control flow constructs (if, for, while, etc.)

  • primitive variable declarations and arithmetic manipulations (std::is_arithmetic<T> types)

  • coherent conditional execution ( if ( qubit ) { x (another_qubit); } )

  • novel syntax for common quantum programming patterns (e.g. compute-action-uncompute).

  • kernels may reference and call previously defined quantum kernels, but cannot operate on global data.

An entry-point quantum kernel must be defined as a typed callable (i.e. a lambda, or struct with R operator()(Args...) implemented) that can be annotated with an appropriate function attribute. This requirement on typed-callables directly enables the implementation of generic libraries of quantum algorithms parameterized on user-specified quantum kernels. CUDA Quantum requires these typed quantum callables to be annotated with the __qpu__ attribute preceding the left brace that opens the function body declaration:

auto my_first_kernel = [](double x) __qpu__ { ... quantum code ... };
struct my_second_kernel {
  void operator()(double x) __qpu__ {
    ... quantum code ...

Entry-point quantum kernels expressed as structs or classes with an operator()(...) overload may leverage primitive class members within the kernel body, specifically any type by which std::is_arithmetic evaluates to true.

All quantum kernels can specify a return type from the set {void, T : std::is_arithmetic<T> == true, std::vector<bool>}. All quantum kernels can take as input any type in the set {T : std::is_arithmetic<T> == true, std::vector<T>, std::span<T>}. All kernels can take cudaq:spin_op instances as input. Entry-point quantum kernels cannot take quantum input arguments because quantum memory cannot be allocated from within host code.

Pure device quantum kernels can be expressed as typed-callables, but can also be represented as annotated free functions. Pure device quantum kernels can take cudaq::qudit<N> specializations and containers (e.g. cudaq::qview, cudaq::qvector) as input.

auto my_first_device_kernel = [](cudaq::qvector<>& q) __qpu__ {
   ... quantum code using q ...
struct my_second_device_kernel {
  void operator()(cudaq::qubit& q, double x) __qpu__ {
    ... quantum code ...
__qpu__ void my_third_device_kernel(cudaq::qubit& qb) {
    ... quantum code using qb ...

Classical arithmetic data can be instantiated and manipulated within any quantum kernel and is modeled implicitly using the quantum device classical control memory space. Returning classical data requires an implicit data transfer from device to host, and this should be configured by the compiler implementation.

CUDA Quantum kernels expressed as lambda expressions can capture simple arithmetic variables by value. Specifically, any valid input type for a CUDA Quantum kernel function argument can also be provided as a variable captured by value.

All quantum kernel invocations are synchronous calls by default.

6.1. Kernel Composability

CUDA Quantum kernels can also serve as input to other quantum kernels. This is a typical pattern in quantum computing, whereby you have an algorithm that is dependent on some input sub-circuit, e.g. for state preparation, oracle invocation, etc. In order to to support this pattern, CUDA Quantum kernels can be passed as arguments for indirect invocation.

CUDA Quantum builds upon C++ to enable this capability. To support CUDA Quantum kernel parameterization on callable quantum kernel code, programmers can leverage standard C++ template definitions:

struct MyStatePrep {
  void operator()(cudaq::qview<> qubits) __qpu__ {
    ... apply state prep operations on qubits ...

struct MyGenericAlgorithm {
  template<typename StatePrep>
  void operator()(StatePrep&& statePrep) __qpu__ {
    cudaq::qarray<10> q;

// -or- with placeholder type specifiers
struct MyGenericAlgorithm2 {
  void operator()(auto&& statePrep) __qpu__ {
    cudaq::qarray<10> q;

MyGenericAlgorithm algorithm;

MyGenericAlgorithm2 anotherVersion;

CUDA Quantum kernel inputs can also be constrained.

namespace cudaq {

  // Generic constraint on Kernel Function Signatures
  template <typename Kernel, typename Signature>
  concept signature = std::is_convertible_v<Kernel, std::function<Signature>>;

  // Specialized for taking a single qubit
  template<typename Kernel>
  concept takes_qubit = signature<Kernel, void(qubit&)>;

struct MyGenericAlgorithmOnQarray {
  void operator()(cudaq::signature<void(cudaq::qarray&)> auto&& statePrep) __qpu__ {
    cudaq::qarray<10> q;

struct MyGenericAlgorithmOnQubit {
  void operator()(cudaq::takes_qubit auto&& statePrep) __qpu__ {
    cudaq::qarray<10> q;

This approach enables the development of generic libraries of quantum algorithms that are parameterized on sub-units of the global circuit representation.

6.2. Allowed Kernel Classical Function Invocations

To be filled in…