cub::DeviceScan
Defined in cub/device/device_scan.cuh
-
struct DeviceScan
DeviceScan provides device-wide, parallel operations for computing a prefix scan across a sequence of data items residing within device-accessible memory.
Overview
Given a sequence of input elements and a binary reduction operator, a prefix scan produces an output sequence where each element is computed to be the reduction of the elements occurring earlier in the input sequence. Prefix sum connotes a prefix scan with the addition operator. The term inclusive indicates that the ith output reduction incorporates the ith input. The term exclusive indicates the ith input is not incorporated into the ith output reduction. When the input and output sequences are the same, the scan is performed in-place.
As of CUB 1.0.1 (2013), CUB’s device-wide scan APIs have implemented our “decoupled look-back” algorithm for performing global prefix scan with only a single pass through the input data, as described in our 2016 technical report 1. The central idea is to leverage a small, constant factor of redundant work in order to overlap the latencies of global prefix propagation with local computation. As such, our algorithm requires only
~2*n*
data movement (n
inputs are read,n
outputs are written), and typically proceeds at “memcpy” speeds. Our algorithm supports inplace operations.- 1
Duane Merrill and Michael Garland. Single-pass Parallel Prefix Scan with Decoupled Look-back, NVIDIA Technical Report NVR-2016-002, 2016.
Usage Considerations
Dynamic parallelism. DeviceScan methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Performance
The work-complexity of prefix scan as a function of input size is linear, resulting in performance throughput that plateaus with problem sizes large enough to saturate the GPU.
Exclusive scans
-
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix sum. The value of
0
is applied as the initial value, and is assigned to*d_out
.Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix sum of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [ , , , , , , ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveSum( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix sum cub::DeviceScan::ExclusiveSum( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // d_out <-- [0, 8, 14, 21, 26, 29, 29]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename IteratorT, typename NumItemsT>
static inline cudaError_t ExclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix sum in-place. The value of
0
is applied as the initial value, and is assigned to*d_data
.Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix sum of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_data; // e.g., [8, 6, 7, 5, 3, 0, 9] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveSum( d_temp_storage, temp_storage_bytes, d_data, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix sum cub::DeviceScan::ExclusiveSum( d_temp_storage, temp_storage_bytes, d_data, num_items); // d_data <-- [0, 8, 14, 21, 26, 29, 29]
- Template Parameters
IteratorT – [inferred] Random-access iterator type for reading scan inputs and wrigin scan outputs
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_data – [inout] Random-access iterator to the sequence of data items
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix scan using the specified binary
scan_op
functor. Theinit_value
value is applied as the initial value, and is assigned to*d_out
.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix min-scan of an
int
device vector#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [ , , , , , , ] CustomMin min_op; ... // Determine temporary device storage requirements for exclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) INT_MAX, num_items); // Allocate temporary storage for exclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix min-scan cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, (int) INT_MAX, num_items); // d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
used Binary scan functor type having memberT operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the exclusive scan (and is assigned to
*d_out
)num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename IteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix scan using the specified binary
scan_op
functor. Theinit_value
value is applied as the initial value, and is assigned to*d_data
.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix min-scan of an
int
device vector:#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_data; // e.g., [8, 6, 7, 5, 3, 0, 9] CustomMin min_op; ... // Determine temporary device storage requirements for exclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_data, min_op, (int) INT_MAX, num_items); // Allocate temporary storage for exclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix min-scan cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_data, min_op, (int) INT_MAX, num_items); // d_data <-- [2147483647, 8, 6, 6, 5, 3, 0]
- Template Parameters
IteratorT – [inferred] Random-access input iterator type for reading scan inputs and writing scan outputs
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
used Binary scan functor type having memberT operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_data – [inout] Random-access iterator to the sequence of data items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the exclusive scan (and is assigned to
*d_out
)num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename InitValueIterT = InitValueT*, typename NumItemsT = int>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, FutureValue<InitValueT, InitValueIterT> init_value, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix scan using the specified binary
scan_op
functor. Theinit_value
value is provided as a future value.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix min-scan of an
int
device vector#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [ , , , , , , ] int *d_init_iter; // e.g., INT_MAX CustomMin min_op; auto future_init_value = cub::FutureValue<InitialValueT, IterT>(d_init_iter); ... // Determine temporary device storage requirements for exclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, future_init_value, num_items); // Allocate temporary storage for exclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix min-scan cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, future_init_value, num_items); // d_out <-- [2147483647, 8, 6, 6, 5, 3, 0]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
used Binary scan functor type having memberT operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Pointer to the input sequence of data items
d_out – [out] Pointer to the output sequence of data items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the exclusive scan (and is assigned to
*d_out
)num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename IteratorT, typename ScanOpT, typename InitValueT, typename InitValueIterT = InitValueT*, typename NumItemsT = int>
static inline cudaError_t ExclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, FutureValue<InitValueT, InitValueIterT> init_value, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide exclusive prefix scan using the specified binary
scan_op
functor. Theinit_value
value is provided as a future value.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix min-scan of an
int
device vector#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_data; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_init_iter; // e.g., INT_MAX CustomMin min_op; auto future_init_value = cub::FutureValue<InitialValueT, IterT>(d_init_iter); ... // Determine temporary device storage requirements for exclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_data, min_op, future_init_value, num_items); // Allocate temporary storage for exclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix min-scan cub::DeviceScan::ExclusiveScan( d_temp_storage, temp_storage_bytes, d_data, min_op, future_init_value, num_items); // d_data <-- [2147483647, 8, 6, 6, 5, 3, 0]
- Template Parameters
IteratorT – [inferred] Random-access input iterator type for reading scan inputs and writing scan outputs
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
used Binary scan functor type having memberT operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_data – [inout] Pointer to the sequence of data items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the exclusive scan (and is assigned to
*d_out
)num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
Inclusive scans
-
template<typename InputIteratorT, typename OutputIteratorT, typename NumItemsT>
static inline cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide inclusive prefix sum.
Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix sum of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [ , , , , , , ] ... // Determine temporary device storage requirements for inclusive // prefix sum void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveSum( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // Allocate temporary storage for inclusive prefix sum cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix sum cub::DeviceScan::InclusiveSum( d_temp_storage, temp_storage_bytes, d_in, d_out, num_items); // d_out <-- [8, 14, 21, 26, 29, 29, 38]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename IteratorT, typename NumItemsT>
static inline cudaError_t InclusiveSum(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide inclusive prefix sum in-place.
Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix sum of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_data; // e.g., [8, 6, 7, 5, 3, 0, 9] ... // Determine temporary device storage requirements for inclusive // prefix sum void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveSum( d_temp_storage, temp_storage_bytes, d_data, num_items); // Allocate temporary storage for inclusive prefix sum cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix sum cub::DeviceScan::InclusiveSum( d_temp_storage, temp_storage_bytes, d_data, num_items); // d_data <-- [8, 14, 21, 26, 29, 29, 38]
- Template Parameters
IteratorT – [inferred] Random-access input iterator type for reading scan inputs and writing scan outputs
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_data – [inout] Random-access iterator to the sequence of data items
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename NumItemsT>
static inline cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide inclusive prefix scan using the specified binary
scan_op
functor.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix min-scan of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_out; // e.g., [ , , , , , , ] CustomMin min_op; ... // Determine temporary device storage requirements for inclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items); // Allocate temporary storage for inclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix min-scan cub::DeviceScan::InclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items); // d_out <-- [8, 6, 6, 5, 3, 0, 0]
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
scan_op – [in] Binary scan functor
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIteratorT, typename OutputIteratorT, typename ScanOpT, typename InitValueT, typename NumItemsT>
static inline cudaError_t InclusiveScanInit(void *d_temp_storage, size_t &temp_storage_bytes, InputIteratorT d_in, OutputIteratorT d_out, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide inclusive prefix scan using the specified binary
scan_op
functor. The result of applying thescan_op
binary operator toinit_value
value and*d_in
is assigned to*d_out
.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_in
andd_out
are equal, the scan is performed in-place. The range[d_in, d_in + num_items)
and[d_out, d_out + num_items)
shall not overlap in any other way.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive max-scan of an
int
device vector.thrust::device_vector<int> input{0, -1, 2, -3, 4, -5, 6}; thrust::device_vector<int> out(input.size()); int init = 1; size_t temp_storage_bytes{}; cub::DeviceScan::InclusiveScanInit( nullptr, temp_storage_bytes, input.begin(), out.begin(), cub::Max{}, init, static_cast<int>(input.size())); // Allocate temporary storage for inclusive scan thrust::device_vector<std::uint8_t> temp_storage(temp_storage_bytes); // Run inclusive prefix sum cub::DeviceScan::InclusiveScanInit( thrust::raw_pointer_cast(temp_storage.data()), temp_storage_bytes, input.begin(), out.begin(), cub::Max{}, init, static_cast<int>(input.size())); thrust::host_vector<int> expected{1, 1, 2, 2, 4, 4, 6};
- Template Parameters
InputIteratorT – [inferred] Random-access input iterator type for reading scan inputs (may be a simple pointer type)
OutputIteratorT – [inferred] Random-access output iterator type for writing scan outputs (may be a simple pointer type)
ScanOpT – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to the size in bytes of the
d_temp_storage
allocationd_in – [in] Random-access iterator to the input sequence of data items
d_out – [out] Random-access iterator to the output sequence of data items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the inclusive scan (
scan_op(init_value, d_in[0])
is assigned to*d_out
)num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in] CUDA stream to launch kernels within.
-
template<typename IteratorT, typename ScanOpT, typename NumItemsT>
static inline cudaError_t InclusiveScan(void *d_temp_storage, size_t &temp_storage_bytes, IteratorT d_data, ScanOpT scan_op, NumItemsT num_items, cudaStream_t stream = 0) Computes a device-wide inclusive prefix scan using the specified binary
scan_op
functor.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix min-scan of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_data; // e.g., [8, 6, 7, 5, 3, 0, 9] CustomMin min_op; ... // Determine temporary device storage requirements for inclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveScan( d_temp_storage, temp_storage_bytes, d_data, min_op, num_items); // Allocate temporary storage for inclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix min-scan cub::DeviceScan::InclusiveScan( d_temp_storage, temp_storage_bytes, d_in, d_out, min_op, num_items); // d_data <-- [8, 6, 6, 5, 3, 0, 0]
- Template Parameters
IteratorT – [inferred] Random-access input iterator type for reading scan inputs and writing scan outputs
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
NumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_data – [in] Random-access iterator to the sequence of data items
scan_op – [in] Binary scan functor
num_items – [in] Total number of input items (i.e., the length of
d_in
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename EqualityOpT = Equality, typename NumItemsT = std::uint32_t>
static inline cudaError_t ExclusiveSumByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, NumItemsT num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0) Computes a device-wide exclusive prefix sum-by-key with key equality defined by
equality_op
. The value of0
is applied as the initial value, and is assigned to the beginning of each segment ind_values_out
.Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
d_keys_in
may equald_values_out
but the range[d_keys_in, d_keys_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.d_values_in
may equald_values_out
but the range[d_values_in, d_values_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix sum-by-key of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_keys_in; // e.g., [0, 0, 1, 1, 1, 2, 2] int *d_values_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_values_out; // e.g., [ , , , , , , ] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveSumByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, num_items); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix sum cub::DeviceScan::ExclusiveSumByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, num_items); // d_values_out <-- [0, 8, 0, 7, 12, 0, 0]
- Template Parameters
KeysInputIteratorT – [inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)
ValuesInputIteratorT – [inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)
ValuesOutputIteratorT – [inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)
EqualityOpT – [inferred] Functor type having member
T operator()(const T &a, const T &b)
for binary operations that defines the equality of keysNumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Random-access input iterator to the input sequence of key items
d_values_in – [in] Random-access input iterator to the input sequence of value items
d_values_out – [out] Random-access output iterator to the output sequence of value items
num_items – [in] Total number of input items (i.e., the length of
d_keys_in
andd_values_in
)equality_op – [in] Binary functor that defines the equality of keys. Default is cub::Equality().
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename ScanOpT, typename InitValueT, typename EqualityOpT = Equality, typename NumItemsT = std::uint32_t>
static inline cudaError_t ExclusiveScanByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, ScanOpT scan_op, InitValueT init_value, NumItemsT num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0) Computes a device-wide exclusive prefix scan-by-key using the specified binary
scan_op
functor. The key equality is defined byequality_op
. Theinit_value
value is applied as the initial value, and is assigned to the beginning of each segment ind_values_out
.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
d_keys_in
may equald_values_out
but the range[d_keys_in, d_keys_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.d_values_in
may equald_values_out
but the range[d_values_in, d_values_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the exclusive prefix min-scan-by-key of an
int
device vector#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // CustomEqual functor struct CustomEqual { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return a == b; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_keys_in; // e.g., [0, 0, 1, 1, 1, 2, 2] int *d_values_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_values_out; // e.g., [ , , , , , , ] CustomMin min_op; CustomEqual equality_op; ... // Determine temporary device storage requirements for exclusive // prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::ExclusiveScanByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, min_op, (int) INT_MAX, num_items, equality_op); // Allocate temporary storage for exclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run exclusive prefix min-scan cub::DeviceScan::ExclusiveScanByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, min_op, (int) INT_MAX, num_items, equality_op); // d_values_out <-- [2147483647, 8, 2147483647, 7, 5, 2147483647, 0]
- Template Parameters
KeysInputIteratorT – [inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)
ValuesInputIteratorT – [inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)
ValuesOutputIteratorT – [inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
InitValueT – [inferred] Type of the
init_value
value used in Binary scan functor type having memberT operator()(const T &a, const T &b)
EqualityOpT – [inferred] Functor type having member
T operator()(const T &a, const T &b)
for binary operations that defines the equality of keysNumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Random-access input iterator to the input sequence of key items
d_values_in – [in] Random-access input iterator to the input sequence of value items
d_values_out – [out] Random-access output iterator to the output sequence of value items
scan_op – [in] Binary scan functor
init_value – [in] Initial value to seed the exclusive scan (and is assigned to the beginning of each segment in
d_values_out
)num_items – [in] Total number of input items (i.e., the length of
d_keys_in
andd_values_in
)equality_op – [in] Binary functor that defines the equality of keys. Default is cub::Equality().
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename EqualityOpT = Equality, typename NumItemsT = std::uint32_t>
static inline cudaError_t InclusiveSumByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, NumItemsT num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0) Computes a device-wide inclusive prefix sum-by-key with key equality defined by
equality_op
.Supports non-commutative sum operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
d_keys_in
may equald_values_out
but the range[d_keys_in, d_keys_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.d_values_in
may equald_values_out
but the range[d_values_in, d_values_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix sum-by-key of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_keys_in; // e.g., [0, 0, 1, 1, 1, 2, 2] int *d_values_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_values_out; // e.g., [ , , , , , , ] ... // Determine temporary device storage requirements for inclusive prefix sum void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveSumByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, num_items); // Allocate temporary storage for inclusive prefix sum cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix sum cub::DeviceScan::InclusiveSumByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, num_items); // d_out <-- [8, 14, 7, 12, 15, 0, 9]
- Template Parameters
KeysInputIteratorT – [inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)
ValuesInputIteratorT – [inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)
ValuesOutputIteratorT – [inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)
EqualityOpT – [inferred] Functor type having member
T operator()(const T &a, const T &b)
for binary operations that defines the equality of keysNumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Random-access input iterator to the input sequence of key items
d_values_in – [in] Random-access input iterator to the input sequence of value items
d_values_out – [out] Random-access output iterator to the output sequence of value items
num_items – [in] Total number of input items (i.e., the length of
d_keys_in
andd_values_in
)equality_op – [in] Binary functor that defines the equality of keys. Default is cub::Equality().
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeysInputIteratorT, typename ValuesInputIteratorT, typename ValuesOutputIteratorT, typename ScanOpT, typename EqualityOpT = Equality, typename NumItemsT = std::uint32_t>
static inline cudaError_t InclusiveScanByKey(void *d_temp_storage, size_t &temp_storage_bytes, KeysInputIteratorT d_keys_in, ValuesInputIteratorT d_values_in, ValuesOutputIteratorT d_values_out, ScanOpT scan_op, NumItemsT num_items, EqualityOpT equality_op = EqualityOpT(), cudaStream_t stream = 0) Computes a device-wide inclusive prefix scan-by-key using the specified binary
scan_op
functor. The key equality is defined byequality_op
.Supports non-commutative scan operators.
Results are not deterministic for pseudo-associative operators (e.g., addition of floating-point types). Results for pseudo-associative operators may vary from run to run. Additional details can be found in the decoupled look-back description.
d_keys_in
may equald_values_out
but the range[d_keys_in, d_keys_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.d_values_in
may equald_values_out
but the range[d_values_in, d_values_in + num_items)
and the range[d_values_out, d_values_out + num_items)
shall not overlap otherwise.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the inclusive prefix min-scan-by-key of an
int
device vector.#include <cub/cub.cuh> // or equivalently <cub/device/device_scan.cuh> #include <climits> // for INT_MAX // CustomMin functor struct CustomMin { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return (b < a) ? b : a; } }; // CustomEqual functor struct CustomEqual { template <typename T> __host__ __device__ __forceinline__ T operator()(const T &a, const T &b) const { return a == b; } }; // Declare, allocate, and initialize device-accessible pointers for // input and output int num_items; // e.g., 7 int *d_keys_in; // e.g., [0, 0, 1, 1, 1, 2, 2] int *d_values_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_values_out; // e.g., [ , , , , , , ] CustomMin min_op; CustomEqual equality_op; ... // Determine temporary device storage requirements for inclusive prefix scan void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceScan::InclusiveScanByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, min_op, num_items, equality_op); // Allocate temporary storage for inclusive prefix scan cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run inclusive prefix min-scan cub::DeviceScan::InclusiveScanByKey( d_temp_storage, temp_storage_bytes, d_keys_in, d_values_in, d_values_out, min_op, num_items, equality_op); // d_out <-- [8, 6, 7, 5, 3, 0, 0]
- Template Parameters
KeysInputIteratorT – [inferred] Random-access input iterator type for reading scan keys inputs (may be a simple pointer type)
ValuesInputIteratorT – [inferred] Random-access input iterator type for reading scan values inputs (may be a simple pointer type)
ValuesOutputIteratorT – [inferred] Random-access output iterator type for writing scan values outputs (may be a simple pointer type)
ScanOp – [inferred] Binary scan functor type having member
T operator()(const T &a, const T &b)
EqualityOpT – [inferred] Functor type having member
T operator()(const T &a, const T &b)
for binary operations that defines the equality of keysNumItemsT – [inferred] An integral type representing the number of input elements
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Random-access input iterator to the input sequence of key items
d_values_in – [in] Random-access input iterator to the input sequence of value items
d_values_out – [out] Random-access output iterator to the output sequence of value items
scan_op – [in] Binary scan functor
num_items – [in] Total number of input items (i.e., the length of
d_keys_in
andd_values_in
)equality_op – [in] Binary functor that defines the equality of keys. Default is cub::Equality().
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.