cub::DeviceScan

Defined in cub/device/device_scan.cuh

struct DeviceScan

DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.

Overview

Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction. When the input and output sequences are the same, the scan is performed in-place.

As of CUB 1.0.1 (2013), CUB’s device-wide scan APIs have implemented our “decoupled look-back” algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report 1. The central idea is to leverage a small, constant factor of redundant work in order to overlap the latencies of global prefix propagation with local computation. As such, our algorithm requires only ~2*n* data movement (n inputs are read, n outputs are written), and typically proceeds at “memcpy” speeds. Our algorithm supports inplace operations.

1

Duane Merrill and Michael Garland. Single-pass Parallel Prefix Scan with Decoupled Look-back, NVIDIA Technical Report NVR-2016-002, 2016.

Usage Considerations

  • Dynamic parallelism. DeviceScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Performance

The work-complexity of prefix scan as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.

Exclusive scans

template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix sum. The value of 0 is applied as the initial value, and is assigned to *d_out.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix sum of an int device vector.

#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_items;      // e.g., 7
int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items);

// d_out <-- [0, 8, 14, 21, 26, 29, 29]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename NumItemsT>
static inline cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix sum in-place. The value of 0 is applied as the initial value, and is assigned to *d_data.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix sum of an int device vector.

#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_items;      // e.g., 7
int  *d_data;        // e.g., [8, 6, 7, 5, 3, 0, 9]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_data, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_data, num_items);

// d_data <-- [0, 8, 14, 21, 26, 29, 29]

Template Parameters
  • IteratorT[inferred] Random-access iterator type for reading scan inputs and wrigin scan outputs

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Random-access iterator to the sequence of data items

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is applied as the initial value, and is assigned to *d_out.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix min-scan of an int device vector

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
CustomMin    min_op;
...

// Determine temporary device storage requirements for exclusive
// prefix scan
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, (int) INT_MAX, num_items);

// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, (int) INT_MAX, num_items);

// d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value used Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the exclusive scan (and is assigned to *d_out)

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is applied as the initial value, and is assigned to *d_data.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix min-scan of an int device vector:

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_data;        // e.g., [8, 6, 7, 5, 3, 0, 9]
CustomMin    min_op;
...

// Determine temporary device storage requirements for exclusive
// prefix scan
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_data, min_op, (int) INT_MAX, num_items);

// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_data, min_op, (int) INT_MAX, num_items);

// d_data <-- [2147483647, 8, 6, 6, 5, 3, 0]

Template Parameters
  • IteratorT[inferred] Random-access input iterator type for reading scan inputs and writing scan outputs

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value used Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Random-access iterator to the sequence of data items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the exclusive scan (and is assigned to *d_out)

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename InitValueIterT = InitValueT*, typename NumItemsT = int>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, FutureValue<InitValueT, InitValueIterT> init_value, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is provided as a future value.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix min-scan of an int device vector

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
int          *d_init_iter;   // e.g., INT_MAX
CustomMin    min_op;

auto future_init_value =
  cub::FutureValue<InitialValueT, IterT>(d_init_iter);

...

// Determine temporary device storage requirements for exclusive
// prefix scan
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, future_init_value, num_items);

// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, future_init_value, num_items);

// d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value used Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Pointer to the input sequence of data items

  • d_out[out] Pointer to the output sequence of data items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the exclusive scan (and is assigned to *d_out)

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename ScanOpT, typename InitValueT, typename InitValueIterT = InitValueT*, typename NumItemsT = int>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, FutureValue<InitValueT, InitValueIterT> init_value, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide exclusive prefix scan using the specified binary scan_op functor. The init_value value is provided as a future value.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix min-scan of an int device vector

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_data;        // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_init_iter;   // e.g., INT_MAX
CustomMin    min_op;

auto future_init_value =
  cub::FutureValue<InitialValueT, IterT>(d_init_iter);

...

// Determine temporary device storage requirements for exclusive
// prefix scan
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_data, min_op, future_init_value, num_items);

// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_data, min_op, future_init_value, num_items);

// d_data <-- [2147483647, 8, 6, 6, 5, 3, 0]

Template Parameters
  • IteratorT[inferred] Random-access input iterator type for reading scan inputs and writing scan outputs

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value used Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Pointer to the sequence of data items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the exclusive scan (and is assigned to *d_out)

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

Inclusive scans

template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide inclusive prefix sum.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix sum of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_items;      // e.g., 7
int  *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements for inclusive
// prefix sum
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items);

// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix sum
cub::DeviceScan::InclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, num_items);

// d_out <-- [8, 14, 21, 26, 29, 29, 38]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename IteratorT, typename NumItemsT>
static inline cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide inclusive prefix sum in-place.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix sum of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int  num_items;      // e.g., 7
int  *d_data;        // e.g., [8, 6, 7, 5, 3, 0, 9]
...

// Determine temporary device storage requirements for inclusive
// prefix sum
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_data, num_items);

// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix sum
cub::DeviceScan::InclusiveSum(
  d_temp_storage, temp_storage_bytes,
  d_data, num_items);

// d_data <-- [8, 14, 21, 26, 29, 29, 38]

Template Parameters
  • IteratorT[inferred] Random-access input iterator type for reading scan inputs and writing scan outputs

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[inout] Random-access iterator to the sequence of data items

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename NumItemsT>
static inline cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix min-scan of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_in;          // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_out;         // e.g., [ ,  ,  ,  ,  ,  ,  ]
CustomMin    min_op;
...

// Determine temporary device storage requirements for inclusive
// prefix scan
void *d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, num_items);

// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix min-scan
cub::DeviceScan::InclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, num_items);

// d_out <-- [8, 6, 6, 5, 3, 0, 0]

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • scan_op[in] Binary scan functor

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t InclusiveScanInit(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide inclusive prefix scan using the specified binary scan_op functor. The result of applying the scan_op binary operator to init_value value and *d_in is assigned to *d_out.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_in and d_out are equal, the scan is performed in-place. The range [d_in, d_in + num_items) and [d_out, d_out + num_items) shall not overlap in any other way.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive max-scan of an int device vector.

thrust::device_vector<int> input{0, -1, 2, -3, 4, -5, 6};
thrust::device_vector<int> out(input.size());

int init = 1;
size_t temp_storage_bytes{};

cub::DeviceScan::InclusiveScanInit(
  nullptr, temp_storage_bytes, input.begin(), out.begin(), cub::Max{}, init, static_cast<int>(input.size()));

// Allocate temporary storage for inclusive scan
thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes);

// Run inclusive prefix sum
cub::DeviceScan::InclusiveScanInit(
  thrust::raw_pointer_cast(temp_storage.data()),
  temp_storage_bytes,
  input.begin(),
  out.begin(),
  cub::Max{},
  init,
  static_cast<int>(input.size()));

thrust::host_vector<int> expected{1, 1, 2, 2, 4, 4, 6};

Template Parameters
  • InputIteratorT[inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)

  • OutputIteratorT[inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)

  • ScanOpT[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to the size in bytes of the d_temp_storage allocation

  • d_in[in] Random-access iterator to the input sequence of data items

  • d_out[out] Random-access iterator to the output sequence of data items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the inclusive scan (scan_op(init_value, d_in[0]) is assigned to *d_out)

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in] CUDA stream to launch kernels within.

template<typename IteratorT, typename ScanOpT, typename NumItemsT>
static inline cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, NumItemsT num_items, cudaStream_t stream = 0)

Computes a device-wide inclusive prefix scan using the specified binary scan_op functor.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix min-scan of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_data;        // e.g., [8, 6, 7, 5, 3, 0, 9]
CustomMin    min_op;
...

// Determine temporary device storage requirements for inclusive
// prefix scan
void *d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_data, min_op, num_items);

// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix min-scan
cub::DeviceScan::InclusiveScan(
  d_temp_storage, temp_storage_bytes,
  d_in, d_out, min_op, num_items);

// d_data <-- [8, 6, 6, 5, 3, 0, 0]

Template Parameters
  • IteratorT[inferred] Random-access input iterator type for reading scan inputs and writing scan outputs

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • NumItemsT[inferred] An integral type representing the number of input elements

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_data[in] Random-access iterator to the sequence of data items

  • scan_op[in] Binary scan functor

  • num_items[in] Total number of input items (i.e., the length of d_in)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename EqualityOpT = Equality>
static inline cudaError_t ExclusiveSumByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, int num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0)

Computes a device-wide exclusive prefix sum-by-key with key equality defined by equality_op. The value of 0 is applied as the initial value, and is assigned to the beginning of each segment in d_values_out.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • d_keys_in may equal d_values_out but the range [d_keys_in, d_keys_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • d_values_in may equal d_values_out but the range [d_values_in, d_values_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix sum-by-key of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int num_items;      // e.g., 7
int *d_keys_in;     // e.g., [0, 0, 1, 1, 1, 2, 2]
int *d_values_in;   // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_values_out;  // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveSumByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, num_items);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix sum
cub::DeviceScan::ExclusiveSumByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, num_items);

// d_values_out <-- [0, 8, 0, 7, 12, 0, 0]

Template Parameters
  • KeysInputIteratorT[inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)

  • ValuesInputIteratorT[inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)

  • ValuesOutputIteratorT[inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)

  • EqualityOpT[inferred] Functor type having member T operator()(const T &a, const T &b) for binary operations that defines the equality of keys

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access input iterator to the input sequence of key items

  • d_values_in[in] Random-access input iterator to the input sequence of value items

  • d_values_out[out] Random-access output iterator to the output sequence of value items

  • num_items[in] Total number of input items (i.e., the length of d_keys_in and d_values_in)

  • equality_op[in] Binary functor that defines the equality of keys. Default is cub::Equality().

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename ScanOpT, typename InitValueT, typename EqualityOpT = Equality>
static inline cudaError_t ExclusiveScanByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, ScanOpT scan_op, InitValueT init_value, int num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0)

Computes a device-wide exclusive prefix scan-by-key using the specified binary scan_op functor. The key equality is defined by equality_op. The init_value value is applied as the initial value, and is assigned to the beginning of each segment in d_values_out.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • d_keys_in may equal d_values_out but the range [d_keys_in, d_keys_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • d_values_in may equal d_values_out but the range [d_values_in, d_values_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the exclusive prefix min-scan-by-key of an int device vector

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// CustomEqual functor
struct CustomEqual
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return a == b;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_keys_in;     // e.g., [0, 0, 1, 1, 1, 2, 2]
int          *d_values_in;   // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_values_out;  // e.g., [ ,  ,  ,  ,  ,  ,  ]
CustomMin    min_op;
CustomEqual  equality_op;
...

// Determine temporary device storage requirements for exclusive
// prefix scan
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::ExclusiveScanByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, min_op,
  (int) INT_MAX, num_items, equality_op);

// Allocate temporary storage for exclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run exclusive prefix min-scan
cub::DeviceScan::ExclusiveScanByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, min_op,
  (int) INT_MAX, num_items, equality_op);

// d_values_out <-- [2147483647, 8, 2147483647, 7, 5, 2147483647, 0]

Template Parameters
  • KeysInputIteratorT[inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)

  • ValuesInputIteratorT[inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)

  • ValuesOutputIteratorT[inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • InitValueT[inferred] Type of the init_value value used in Binary scan functor type having member T operator()(const T &a, const T &b)

  • EqualityOpT[inferred] Functor type having member T operator()(const T &a, const T &b) for binary operations that defines the equality of keys

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access input iterator to the input sequence of key items

  • d_values_in[in] Random-access input iterator to the input sequence of value items

  • d_values_out[out] Random-access output iterator to the output sequence of value items

  • scan_op[in] Binary scan functor

  • init_value[in] Initial value to seed the exclusive scan (and is assigned to the beginning of each segment in d_values_out)

  • num_items[in] Total number of input items (i.e., the length of d_keys_in and d_values_in)

  • equality_op[in] Binary functor that defines the equality of keys. Default is cub::Equality().

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename EqualityOpT = Equality>
static inline cudaError_t InclusiveSumByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, int num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0)

Computes a device-wide inclusive prefix sum-by-key with key equality defined by equality_op.

  • Supports non-commutative sum operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • d_keys_in may equal d_values_out but the range [d_keys_in, d_keys_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • d_values_in may equal d_values_out but the range [d_values_in, d_values_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix sum-by-key of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int num_items;      // e.g., 7
int *d_keys_in;     // e.g., [0, 0, 1, 1, 1, 2, 2]
int *d_values_in;   // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_values_out;  // e.g., [ ,  ,  ,  ,  ,  ,  ]
...

// Determine temporary device storage requirements for inclusive prefix sum
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceScan::InclusiveSumByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, num_items);

// Allocate temporary storage for inclusive prefix sum
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix sum
cub::DeviceScan::InclusiveSumByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, num_items);

// d_out <-- [8, 14, 7, 12, 15, 0, 9]

Template Parameters
  • KeysInputIteratorT[inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)

  • ValuesInputIteratorT[inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)

  • ValuesOutputIteratorT[inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)

  • EqualityOpT[inferred] Functor type having member T operator()(const T &a, const T &b) for binary operations that defines the equality of keys

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access input iterator to the input sequence of key items

  • d_values_in[in] Random-access input iterator to the input sequence of value items

  • d_values_out[out] Random-access output iterator to the output sequence of value items

  • num_items[in] Total number of input items (i.e., the length of d_keys_in and d_values_in)

  • equality_op[in] Binary functor that defines the equality of keys. Default is cub::Equality().

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename ScanOpT, typename EqualityOpT = Equality>
static inline cudaError_t InclusiveScanByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, ScanOpT scan_op, int num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0)

Computes a device-wide inclusive prefix scan-by-key using the specified binary scan_op functor. The key equality is defined by equality_op.

  • Supports non-commutative scan operators.

  • Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.

  • d_keys_in may equal d_values_out but the range [d_keys_in, d_keys_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • d_values_in may equal d_values_out but the range [d_values_in, d_values_in + num_items) and the range [d_values_out, d_values_out + num_items) shall not overlap otherwise.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the inclusive prefix min-scan-by-key of an int device vector.

#include <cub/cub.cuh>   // or equivalently <cub/device/device_scan.cuh>
#include <climits>       // for INT_MAX

// CustomMin functor
struct CustomMin
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return (b < a) ? b : a;
    }
};

// CustomEqual functor
struct CustomEqual
{
    template <typename T>
    __host__ __device__ __forceinline__
    T operator()(const T &a, const T &b) const {
        return a == b;
    }
};

// Declare, allocate, and initialize device-accessible pointers for
// input and output
int          num_items;      // e.g., 7
int          *d_keys_in;     // e.g., [0, 0, 1, 1, 1, 2, 2]
int          *d_values_in;   // e.g., [8, 6, 7, 5, 3, 0, 9]
int          *d_values_out;  // e.g., [ ,  ,  ,  ,  ,  ,  ]
CustomMin    min_op;
CustomEqual  equality_op;
...

// Determine temporary device storage requirements for inclusive prefix scan
void *d_temp_storage = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceScan::InclusiveScanByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, min_op, num_items, equality_op);

// Allocate temporary storage for inclusive prefix scan
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run inclusive prefix min-scan
cub::DeviceScan::InclusiveScanByKey(
  d_temp_storage, temp_storage_bytes,
  d_keys_in, d_values_in, d_values_out, min_op, num_items, equality_op);

// d_out <-- [8, 6, 7, 5, 3, 0, 0]

Template Parameters
  • KeysInputIteratorT[inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)

  • ValuesInputIteratorT[inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)

  • ValuesOutputIteratorT[inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)

  • ScanOp[inferred] Binary scan functor type having member T operator()(const T &a, const T &b)

  • EqualityOpT[inferred] Functor type having member T operator()(const T &a, const T &b) for binary operations that defines the equality of keys

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Random-access input iterator to the input sequence of key items

  • d_values_in[in] Random-access input iterator to the input sequence of value items

  • d_values_out[out] Random-access output iterator to the output sequence of value items

  • scan_op[in] Binary scan functor

  • num_items[in] Total number of input items (i.e., the length of d_keys_in and d_values_in)

  • equality_op[in] Binary functor that defines the equality of keys. Default is cub::Equality().

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.