12. Quantum Algorithmic Primitives

[1] The general philosophy of the CUDA-Q specification is that quantum device code should be encapsulated as stand-alone callable instances of generic signature, and that operations or primitive algorithms targeting a quantum coprocessor be implemented as adaptors on those callable instances. Adaptors, by definition, are generic functions that take any quantum kernel as input along with the runtime arguments for that kernel. Runtime arguments passed to adaptor functions flow through the adaptor to the provided kernel. This pattern allows general pre- and post-processing around concrete kernel execution.

12.1. cudaq::sample

[1] A common task for near-term quantum execution is to sample the state of a given quantum circuit for a specified number of shots (circuit executions). The result of this task is typically a mapping of observed measurement bit strings to the number of times each was observed. This is typically termed the counts dictionary in the community.

[2] The CUDA-Q model enables this functionality via template functions within the cudaq namespace with the following structure:

template <typename ReturnType>
concept HasVoidReturnType = std::is_void_v<ReturnType>;

// Kernel only
template<typename QuantumKernel, typename... Args>
  requires HasVoidReturnType<QuantumKernel, Args...>
sample_result sample(QuantumKernel&& kernel, Args&&... args);

// Specify shots
template<typename QuantumKernel, typename... Args>
  requires HasVoidReturnType<QuantumKernel, Args...>
sample_result sample(std::size_t shots, QuantumKernel&& kernel, Args&&... args);

// Specify sample options (including shots and noise model)
template<typename QuantumKernel, typename... Args>
  requires HasVoidReturnType<QuantumKernel, Args...>
sample_result sample(const sample_options &options,
                     QuantumKernel&& kernel, Args&&... args);

[3] This function takes as input a quantum kernel instance followed by the concrete arguments at which the kernel should be invoked. CUDA-Q kernels passed to this function must be entry-point kernels and return void.

[4] Overloaded functions exist for specifying the number of shots to sample and the noise model to apply.

[5] The function returns an instance of the cudaq::sample_result type which encapsulates the counts dictionary produced by the sampling task. Programmers can extract the result information in the following manner:

auto bell = []() __qpu__ { ... };

// Sample the state generated by bel
auto counts = cudaq::sample(bell)

// Print to standard out
counts.dump();

// Fine-grained access to the bits and counts
for (auto& [bits, count] : counts) {
  printf("Observed: %s, %lu\n", bits, count);
}
@cudaq.kernel()
def bell():
   ...

# Sample the state generated by bell
counts = cudaq.sample(bell)

# Print to standard out
counts.dump()

# Fine-grained access to the bits and counts
for bits, count in counts:
    print('Observed: {}, {}'.format(bits, count))

[6] CUDA-Q specifies the following structure for cudaq::sample_result:

namespace cudaq {
  using CountsDictionary = std::unordered_map<std::string, std::size_t>;
  inline static const std::string GlobalRegisterName = "__global__";
  class sample_result {
    public:
      sample_result() = default;
      sample_result(const sample_result &);
      ~sample_result();

      std::vector<std::string> register_names();

      std::size_t count(std::string_view bitString,
                  const std::string_view registerName = GlobalRegisterName);

      std::vector<std::string>
      sequential_data(const std::string_view registerName = GlobalRegisterName);

      CountsDictionary
      to_map(const std::string_view registerName = GlobalRegisterName);

      sample_result
      get_marginal(const std::vector<std::size_t> &&marginalIndices,
             const std::string_view registerName = GlobalRegisterName);

      double expectation(const std::string_view registerName == GlobalRegisterName);
      double probability(std::string_view bitString, const std::string_view registerName == GlobalRegisterName);
      std::size_t size(const std::string_view registerName == GlobalRegisterName);

      void dump();
      void clear();

      CountsDictionary::iterator begin();
      CountsDictionary::iterator end();
  };
}

[7] The sample_result type enables one to encode measurement results from a quantum circuit sampling task. It keeps track of a list of sample results, each one corresponding to a measurement action during the sampling process and represented by a unique register name. It also tracks a unique global register, the implicit sampling of the state at the end of circuit execution. The API gives fine-grain access to the measurement results for each register. To illustrate this, observe

auto kernel = []() __qpu__ {
  cudaq::qubit q;
  h(q);
  auto reg1 = mz(q);
  reset (q);
  x(q);
};
cudaq::sample(kernel).dump();
@cudaq.kernel()
def kernel():
   q = cudaq.qubit()
   h(q)
   reg1 = mz(q)
   reset(q)
   x(q)

cudaq.sample(kernel).dump()

should produce

{
  __global__ : { 1:1000 }
  reg1 : { 0:501 1:499 }
}

Here we see that we have measured a qubit in a uniform superposition to a register named reg1, and followed it with a reset and the application of an NOT operation. The sample_result returned for this sampling tasks contains the default __global__ register as well as the user specified reg1 register.

The contents of the __global__ register will depend on how your kernel is written:

  1. If no measurements appear in the kernel, then the __global__ register is formed with implicit measurements being added for all the qubits defined in the kernel, and the measurements all occur at the end of the kernel. The order of the bits in the bitstring corresponds to the qubit allocation order specified in the kernel. That is - the [0] element in the __global__ bitstring corresponds with the first declared qubit in the kernel. For example,

auto kernel = []() __qpu__ {
  cudaq::qubit a, b;
  x(a);
};
cudaq::sample(kernel).dump();
@cudaq.kernel
def kernel():
    a, b = cudaq.qubit(), cudaq.qubit()
    x(a)

cudaq.sample(kernel).dump()

should produce

{
  __global__ : { 10:1000 }
}
  1. Conversely, if any measurements appear in the kernel, then only the measured qubits will appear in the __global__ register. Similar to #1, the bitstring corresponds to the qubit allocation order specified in the kernel. Also (again, similar to #1), the values of the sampled qubits always correspond to the values at the end of the kernel execution. That is - if a qubit is measured in the middle of a kernel and subsequent operations change the state of the qubit, the qubit will be implicitly re-measured at the end of the kernel, and that re-measured value is the value that will appear in the __global__ register. For example,

auto kernel = []() __qpu__ {
  cudaq::qubit a, b;
  x(a);
  mz(b);
  mz(a);
};
cudaq::sample(kernel).dump();
@cudaq.kernel
def kernel():
    a, b = cudaq.qubit(), cudaq.qubit()
    x(a)
    mz(b)
    mz(a)

cudaq.sample(kernel).dump()

should produce

{
  __global__ : { 10:1000 }
}

Note

If you don’t specify any measurements in your kernel and allow the nvq++ compiler to perform passes that introduce ancilla qubits into your kernel, it may be difficult to discern which qubits are the ancilla qubits vs which ones are your qubits. In this case, it is recommended that you provide explicit measurements in your kernel in order to only receive measurements from your qubits and silently discard the measurements from the ancillary qubits.

[8] The API exposed by the sample_result data type allows one to extract the information contained at a variety of levels and for each available register name. One can get the number of times a bit string was observed via sample_result::count, extract a std::unordered_map representation via sample_result::to_map, get a new sample_result instance over a subset of measured qubits via sample_result::get_marginal, and extract the measurement data as it was produced sequentially (a vector of bit string observations for each shot in the sampling process). One can also compute probabilities and expectation values.

[9] There are specific requirements on input quantum kernels for the use of the sample function which must be enforced by compiler implementations. The kernel must be an entry-point kernel that returns void.

[10] CUDA-Q also provides an asynchronous version of this function (cudaq::sample_async) which returns a sample_async_result.

template<typename QuantumKernel, typename... Args>
async_sample_result sample_async(const std::size_t qpu_id, QuantumKernel&& kernel, Args&&... args);

Programmers can asynchronously launch sampling tasks on any qpu_id.

[11] The async_sample_result wraps a std::future<sample_result> and exposes the same get() functionality to extract the results after asynchronous execution.

[12] For remote QPU systems with long queue times, the async_sample_result type encodes job ID information and can be persisted to file and loaded from file at a later time. After loading from file, and when remote queue jobs are completed, one can invoke get() and the results will be retrieved and returned.

12.2. cudaq::observe

[1] A common task in variational algorithms is the computation of the expected value of a given observable with respect to a parameterized quantum circuit (\(\langle H \rangle(𝚹) = \langle \psi(𝚹)|H|\psi(𝚹) \rangle\)).

[2] The cudaq::observe function is provided to enable one to quickly compute this expectation value via execution of the parameterized quantum circuit with repeated measurements in the bases of the provided spin_op terms. The function has the following signature:

// Kernel only
template<typename QuantumKernel, typename... Args>
observe_result observe(QuantumKernel&&, cudaq::spin_op&, Args&&... args);

// Specify shots
template<typename QuantumKernel, typename... Args>
observe_result observe(std::size_t shots, QuantumKernel&&, cudaq::spin_op&, Args&&... args);

// Specify sample options (including shots and noise model)
template<typename QuantumKernel, typename... Args>
observe_result observe(const cudaq::observe_options &options,
                      QuantumKernel&&, cudaq::spin_op&, Args&&... args);

[3] cudaq::observe takes as input an instantiated quantum kernel, the cudaq::spin_op whose expectation is requested, and the concrete arguments used as input to the parameterized quantum kernel.

[4] cudaq::observe returns an instance of the observe_result type which can be implicitly converted to a double expectation value, but also retains all data directly generated and used as part of that expectation value computation. The observe_result takes on the following form:

class observe_result {
  public:
    observe_result(double &e, spin_op &H);
    observe_result(double &e, spin_op &H, MeasureCounts counts);

    sample_results raw_data() { return data; };
    operator double();
    double expectation();

    template <typename SpinOpType>
    double expectation(SpinOpType term);

    template <typename SpinOpType>
    sample_result counts(SpinOpType term);
    double id_coefficient()
    void dump();
};

[5] The public API for observe_result enables one to extract the sample_result data for each term in the provided spin_op. This return type can be used in the following way.

// I only care about the expected value, discard
// the fine-grain data produced
double expVal = cudaq::observe(kernel, spinOp, args...);

// I require the result with all generated data
auto result = cudaq::observe(kernel, spinOp, args...);
auto expVal = result.expectation();
auto X0X1Exp = result.expectation(x(0)*x(1));
auto X0X1Data = result.counts(x(0)*x(1));
result.dump();
# I require the result with all generated data
result = cudaq::observe(kernel, spinOp, *args)
expVal = result.expectation()
X0X1Exp = result.expectation(x(0)*x(1))
X0X1Data = result.counts(x(0)*x(1))
result.dump()

Here is an example of the utility of the cudaq::observe function:

struct ansatz {
  auto operator()(double theta) __qpu__ {
    cudaq::qarray<2> q;
    x(q[0]);
    ry(theta, q[1]);
    x<cudaq::ctrl>(q[1], q[0]);
  }
};

int main() {
  using namespace cudaq::spin; // make it easier to use pauli X,Y,Z below

  spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
              .21829 * z(0) - 6.125 * z(1);

  double energy = cudaq::observe(ansatz{}, h, .59);
  printf("Energy is %lf\n", energy);
  return 0;
}
@cudaq.kernel()
def ansatz(theta : float):
   q = cudaq.qvector(2)
   x(q[0])
   ry(theta, q[1])
   x.ctrl(q[1], q[0])

h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
                .21829 * z(0) - 6.125 * z(1)
energy = cudaq.observe(ansatz, h, .59).expectation()
print('Energy is {}'.format(energy))

[5] There are specific requirements on input quantum kernels for the use of the observe function which must be enforced by compiler implementations. The kernel must be an entry-point kernel that does not contain any conditional or measurement statements.

[6] By default on simulation backends, cudaq::observe computes the true analytic expectation value (i.e. without stochastic noise due to shots-based sampling). If a specific shot count is provided then the returned expectation value will contain some level of statistical noise. Overloaded observe functions are provided to specify the number of shots and/or specify the noise model to apply.

[7] CUDA-Q also provides an asynchronous version of this function (cudaq::observe_async) which returns a async_observe_result.

template<typename QuantumKernel, typename... Args>
async_observe_result observe_async(const std::size_t qpu_id, QuantumKernel&& kernel, cudaq::spin_op&, Args&&... args);

Programmers can asynchronously launch sampling tasks on any qpu_id.

[8] For remote QPU systems with long queue times, the async_observe_result type encodes job ID information for each execution and can be persisted to file and loaded from file at a later time. After loading from file, and when remote queue jobs are completed, one can invoke get() and the results will be retrieved and returned.

12.3. cudaq::optimizer (deprecated, functionality moved to CUDA-Q libraries)

The primary use case for cudaq::observe is to leverage it as the core of a broader objective function optimization workflow. cudaq::observe produces the expected value of a specified spin_op with respect to a given parameterized ansatz at a concrete set of parameters, and often programmers will require an extremal value of that expected value at a specific set of concrete parameters. This will directly require abstractions for gradient-based and gradient-free optimization strategies.

The CUDA-Q model provides a cudaq::optimizer data type that exposes an optimize() method that takes as input an optimizable_function to optimize and the number of independent function dimensions. Implementations are free to implement this abstraction in any way that is pertinent, but it is expected that most approaches will enable optimization strategy extensibility. For example, programmers should be able to instantiate a specific cudaq::optimizer sub-type, thereby dictating the underlying optimization algorithm in a type-safe manner. Moreover, the optimizer should expose a public API of pertinent optimizer-specific options that the programmer can customize.

CUDA-Q models the cudaq::optimizer as follows:

namespace cudaq {
  // Encode the optimal value and optimal parameters
  using optimization_result = std::tuple<double, std::vector<double>>;
  // Initialized with user specified callable of a specific signature
  // Clients can query if the function computes gradients or not
  class optimizable_function {
    public:
      template<typename Callable>
      optimizable_function(Callable&&);
      bool providesGradients() { return _providesGradients; }
      double operator()(const std::vector<double> &x, std::vector<double> &dx);
  };
  class optimizer {
    public:
      virtual bool requiresGradients() = 0;
      virtual optimization_result optimize(const int dimensions,
                                          optimizable_function&& opt_function) = 0;
  };
}

Here, optimization_result should encode the optimal value and optimal parameters achieved during the optimization workflow (i.e. a tuple<double, std::vector<double>>). The optimize method takes as input the number of parameters (or dimensions of the objective function), and a function-like object (i.e. std::function or a lambda, something optimizable_function can be constructed from) that takes a const std::vector<double>& and std::vector<double>& for the function input parameters and gradient vector, respectively. The objective function must return a double representing the scalar cost for the objective function (e.g. the expected value from cudaq::observe()).

Here is an example of how the cudaq::optimizer is intended to be used:

auto ansatz = [](double theta, double phi) __qpu__ {...};
cudaq::spin_op H = ... ;

cudaq::optimizers::cobyla optimizer;
optimizer.max_eval = 200;

auto [opt_energy, opt_params] = optimizer.optimize(
      2, [&](const std::vector<double> &x, std::vector<double> &grad_vec) {
        return cudaq::observe(ansatz, H, x[0], x[1]);
      });

12.4. cudaq::gradient (deprecated, functionality moved to CUDA-Q libraries)

Typical optimization use cases will require the computation of gradients for the specified objective function. The gradient is a vector over all ansatz circuit parameters \(∂H(𝚹) / ∂𝚹_i\). There are a number of potential strategies for computing this gradient vector, but most require additional evaluations of the ansatz circuit on the quantum processor.

To enable true extensibility in gradient strategies, CUDA-Q programmers can instantiate custom sub-types of the cudaq::gradient type. The cudaq::gradient type defines a compute(...) method that takes a mutable reference to the current gradient vector and is free to update that vector in a strategy-specific way. The method also takes the current evaluation parameter vector, the cudaq::spin_op used in the current variational task, and the computed expected value at the given parameters. The gradient strategy type takes the following form:

namespace cudaq {
  class gradient {
    public:
      gradient(std::function<void(std::vector<double>)> &&kernel);

      template <typename QuantumKernel, typename ArgsMapper>
      gradient(QuantumKernel &&kernel, ArgsMapper &&argsMapper);

      virtual void compute(std::vector<double>& x, std::vector<double> &dx,
                        spin_op& h, double exp_h) = 0;

      virtual std::vector<double>
      compute(const std::vector<double> &x,
              std::function<double(std::vector<double>)> &func) = 0;

  };

  // gradient is intended for subclassing
  class central_difference : public gradient {
    public:
      void compute(std::vector<double>& x, std::vector<double> &dx, spin_op& h,
              double exp_h) override { ... }
  };
}

The compute function can make use of the quantum kernel parameterized ansatz, the spin_op for which the expected value is being computed, the pre-computed expected value at the current iteration’s parameter, and the concrete arguments for the given quantum kernel at this iteration.

A non-trivial aspect of the computation of gradients (in an extensible manner) is that we model the gradient as a derivative over concrete parameters for the circuit ansatz represented as a std::vector<double> when the actual quantum kernel may be defined with general variadic Args... types. To address this issue, programmers can provide a default translation mechanism for mapping common quantum kernel ansatz functional expressions to a vector<double> representation - the ArgsMapper callable template type. This type must implement the std::tuple<Args...>(std::vector<double>&) callable concept.

The overall CUDA-Q workflow for leveraging the cudaq::optimizer will work as follows (here we demonstrate with an ansatz without the default std::vector<double> signature):

auto deuteron_n3_ansatz = [](double x0, double x1) __qpu__ {
  cudaq::qarray<3> q;
  x(q[0]);
  ry(x0, q[1]);
  ry(x1, q[2]);
  x<cudaq::ctrl>(q[2], q[0]);
  x<vctrl>(q[0], q[1]);
  ry(-x0, q[1]);
  x<cudaq::ctrl>(q[0], q[1]);
  x<cudaq::ctrl>(q[1], q[0]);
};

cudaq::spin_op h = 5.907 - 2.1433 * x(0) * x(1) - 2.1433 * y(0) * y(1) +
          .21829 * z(0) - 6.125 * z(1);
cudaq::spin_op h3 = h + 9.625 - 9.625 * z(2) - 3.913119 * x(1) * x(2) -
            3.913119 * y(1) * y(2);

// The above ansatz takes 2 doubles, not a single std::vector<double>, which
// the gradient type is expecting. So we must provide an ArgsMapper callable type
auto argsMapper = [](std::vector<double> x) {return std::make_tuple(x[0],x[1]);};

// Create the gradient strategy
cudaq::gradients::central_difference gradient(deuteron_n3_ansatz, argsMapper);

// Create the L-BFGS optimizer, requires gradients
cudaq::optimizers::lbfgs optimizer;

// Run the optimization routine.
auto [min_val, opt_params] = optimizer.optimize(
    2, [&](const std::vector<double>& x, std::vector<double>& grad_vec) {
      // Compute the cost, here its an energy
      auto cost = cudaq::observe(deuteron_n3_ansatz, h3, x);

      // Compute the gradient, results written to the grad_vec reference
      gradient.compute(x, grad_vec, h3, cost);

      // Return the cost to the optimizer
      return cost;
    });

// Print the results
printf("Optimizer found %lf at [%lf,%lf]\n", min_val, opt_params[0], opt_params[1]);