cub::DeviceSegmentedRadixSort
Defined in cub/device/device_segmented_radix_sort.cuh
-
struct DeviceSegmentedRadixSort
DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.
Overview
The radix sorting method arranges items into ascending (or descending) order. The algorithm relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.
See Also
DeviceSegmentedRadixSort shares its implementation with DeviceRadixSort. See that algorithm’s documentation for more information.
Segments are not required to be contiguous. Any element of input(s) or output(s) outside the specified segments will not be accessed nor modified.
Usage Considerations
Dynamic parallelism. DeviceSegmentedRadixSort methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.
Key-value pairs
-
template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairs(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of key-value pairs into ascending order. (
~2N
auxiliary storage required)The contents of the input data are not altered by the sorting operation
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
in
be one of{d_keys_in, d_values_in}
andout
be any of{d_keys_out, d_values_out}
. The range[out, out + num_items)
shall not overlap[in, in + num_items)
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.This operation requires an allocation of temporary device storage that is
O(N+P)
, whereN
is the length of the input andP
is the number of streaming multiprocessors on the device. For sorting using onlyO(P)
temporary storage, see the sorting interface using DoubleBuffer wrappers below.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys_in[i]
,d_values_in[i]
,d_keys_out[i]
,d_values_out[i]
will not be accessed nor modified.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of
int
keys with associated vector ofint
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_keys_out; // e.g., [-, -, -, -, -, -, -] int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6] int *d_values_out; // e.g., [-, -, -, -, -, -, -] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortPairs( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortPairs( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys_out <-- [6, 7, 8, 0, 3, 5, 9] // d_values_out <-- [1, 2, 0, 5, 4, 3, 6]
- Template Parameters
KeyT – [inferred] Key type
ValueT – [inferred] Value type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Device-accessible pointer to the input data of key data to sort
d_keys_out – [out] Device-accessible pointer to the sorted output sequence of key data
d_values_in – [in] Device-accessible pointer to the corresponding input sequence of associated value items
d_values_out – [out] Device-accessible pointer to the correspondingly-reordered output sequence of associated value items
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in] Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairs(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, DoubleBuffer<ValueT> &d_values, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of key-value pairs into ascending order. (
~N
auxiliary storage required)The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).
The contents of both buffers within each pair may be altered by the sorting operation.
Upon completion, the sorting operation will update the “current” indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
cur
be one of{d_keys.Current(), d_values.Current()}
andalt
be any of{d_keys.Alternate(), d_values.Alternate()}
. The range[cur, cur + num_items)
shall not overlap[alt, alt + num_items)
. Both ranges shall not overlap[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys.Current()[i]
,d_values.Current()[i]
,d_keys.Alternate()[i]
,d_values.Alternate()[i]
will not be accessed nor modified.This operation requires a relatively small allocation of temporary device storage that is
O(P)
, whereP
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input sizeN
).When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of
int
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -] int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6] int *d_value_alt_buf; // e.g., [-, -, -, -, -, -, -] ... // Create a set of DoubleBuffers to wrap pairs of device pointers cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf); cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortPairs( d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortPairs( d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys.Current() <-- [6, 7, 8, 0, 3, 5, 9] // d_values.Current() <-- [5, 4, 3, 1, 2, 0, 6]
- Template Parameters
KeyT – [inferred] Key type
ValueT – [inferred] Value type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys – [inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
d_values – [inout] Double-buffer of values whose “current” device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairsDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of key-value pairs into descending order. (
~2N
auxiliary storage required).The contents of the input data are not altered by the sorting operation
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
in
be one of{d_keys_in, d_values_in}
and out be any of{d_keys_out, d_values_out}
. The range[out, out + num_items)
shall not overlap[in, in + num_items)
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.This operation requires an allocation of temporary device storage that is
O(N+P)
, whereN
is the length of the input andP
is the number of streaming multiprocessors on the device. For sorting using onlyO(P)
temporary storage, see the sorting interface using DoubleBuffer wrappers below.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys_in[i]
,d_values_in[i]
,d_keys_out[i]
,d_values_out[i]
will not be accessed nor modified.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of
int
keys with associated vector ofint
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_keys_out; // e.g., [-, -, -, -, -, -, -] int *d_values_in; // e.g., [0, 1, 2, 3, 4, 5, 6] int *d_values_out; // e.g., [-, -, -, -, -, -, -] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortPairsDescending( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortPairsDescending( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, d_values_in, d_values_out, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys_out <-- [8, 7, 6, 9, 5, 3, 0] // d_values_out <-- [0, 2, 1, 6, 3, 4, 5]
- Template Parameters
KeyT – [inferred] Key type
ValueT – [inferred] Value type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Device-accessible pointer to the input data of key data to sort
d_keys_out – [out] Device-accessible pointer to the sorted output sequence of key data
d_values_in – [in] Device-accessible pointer to the corresponding input sequence of associated value items
d_values_out – [out] Device-accessible pointer to the correspondingly-reordered output sequence of associated value items
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairsDescending(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, DoubleBuffer<ValueT> &d_values, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of key-value pairs into descending order. (
~N
auxiliary storage required).The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).
The contents of both buffers within each pair may be altered by the sorting operation.
Upon completion, the sorting operation will update the “current” indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
cur
be one of{d_keys.Current(), d_values.Current()}
andalt
be any of{d_keys.Alternate(), d_values.Alternate()}
. The range[cur, cur + num_items)
shall not overlap[alt, alt + num_items)
. Both ranges shall not overlap[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys.Current()[i]
,d_values.Current()[i]
,d_keys.Alternate()[i]
,d_values.Alternate()[i]
will not be accessed nor modified. not to be modified.This operation requires a relatively small allocation of temporary device storage that is
O(P)
, whereP
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input sizeN
).When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of
int
keys with associated vector ofint
values.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -] int *d_value_buf; // e.g., [0, 1, 2, 3, 4, 5, 6] int *d_value_alt_buf; // e.g., [-, -, -, -, -, -, -] ... // Create a set of DoubleBuffers to wrap pairs of device pointers cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf); cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortPairsDescending( d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortPairsDescending( d_temp_storage, temp_storage_bytes, d_keys, d_values, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys.Current() <-- [8, 7, 6, 9, 5, 3, 0] // d_values.Current() <-- [0, 2, 1, 6, 3, 4, 5]
- Template Parameters
KeyT – [inferred] Key type
ValueT – [inferred] Value type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys – [inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
d_values – [inout] Double-buffer of values whose “current” device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
Keys-only
-
template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeys(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of keys into ascending order. (
~2N
auxiliary storage required)The contents of the input data are not altered by the sorting operation
An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).The range
[d_keys_out, d_keys_out + num_items)
shall not overlap[d_keys_in, d_keys_in + num_items)
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.This operation requires an allocation of temporary device storage that is
O(N+P)
, whereN
is the length of the input andP
is the number of streaming multiprocessors on the device. For sorting using onlyO(P)
temporary storage, see the sorting interface using DoubleBuffer wrappers below.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys_in[i]
,d_keys_out[i]
will not be accessed nor modified.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_keys_out; // e.g., [-, -, -, -, -, -, -] ... // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortKeys( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortKeys( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys_out <-- [6, 7, 8, 0, 3, 5, 9]
- Template Parameters
KeyT – [inferred] Key type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Device-accessible pointer to the input data of key data to sort
d_keys_out – [out] Device-accessible pointer to the sorted output sequence of key data
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeys(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of keys into ascending order. (
~N
auxiliary storage required).The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).
The contents of both buffers may be altered by the sorting operation.
Upon completion, the sorting operation will update the “current” indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
cur = d_keys.Current()
andalt = d_keys.Alternate()
. The range[cur, cur + num_items)
shall not overlap[alt, alt + num_items)
. Both ranges shall not overlap[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys.Current()[i]
,d_keys[i].Alternate()[i]
will not be accessed nor modified.This operation requires a relatively small allocation of temporary device storage that is
O(P)
, whereP
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input sizeN
).When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of
int
keys.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers for // sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -] ... // Create a DoubleBuffer to wrap the pair of device pointers cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortKeys( d_temp_storage, temp_storage_bytes, d_keys, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortKeys( d_temp_storage, temp_storage_bytes, d_keys, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys.Current() <-- [6, 7, 8, 0, 3, 5, 9]
- Template Parameters
KeyT – [inferred] Key type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys – [inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeysDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of keys into descending order. (
~2N
auxiliary storage required).The contents of the input data are not altered by the sorting operation
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.The range
[d_keys_out, d_keys_out + num_items)
shall not overlap[d_keys_in, d_keys_in + num_items)
,[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.This operation requires an allocation of temporary device storage that is
O(N+P)
, whereN
is the length of the input andP
is the number of streaming multiprocessors on the device. For sorting using onlyO(P)
temporary storage, see the sorting interface using DoubleBuffer wrappers below.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys_in[i]
,d_keys_out[i]
will not be accessed nor modified.When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of
int
keys.#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_keys_out; // e.g., [-, -, -, -, -, -, -] ... // Create a DoubleBuffer to wrap the pair of device pointers cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortKeysDescending( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortKeysDescending( d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys_out <-- [8, 7, 6, 9, 5, 3, 0]
- Template Parameters
KeyT – [inferred] Key type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys_in – [in] Device-accessible pointer to the input data of key data to sort
d_keys_out – [out] Device-accessible pointer to the sorted output sequence of key data
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeysDescending(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0) Sorts segments of keys into descending order. (
~N
auxiliary storage required).The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).
The contents of both buffers may be altered by the sorting operation.
Upon completion, the sorting operation will update the “current” indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).
When input a contiguous sequence of segments, a single sequence
segment_offsets
(of lengthnum_segments + 1
) can be aliased for both thed_begin_offsets
andd_end_offsets
parameters (where the latter is specified assegment_offsets + 1
).An optional bit subrange
[begin_bit, end_bit)
of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.Let
cur = d_keys.Current()
andalt = d_keys.Alternate()
. The range[cur, cur + num_items)
shall not overlap[alt, alt + num_items)
. Both ranges shall not overlap[d_begin_offsets, d_begin_offsets + num_segments)
nor[d_end_offsets, d_end_offsets + num_segments)
in any way.Segments are not required to be contiguous. For all index values
i
outside the specified segmentsd_keys.Current()[i]
,d_keys[i].Alternate()[i]
will not be accessed nor modified.This operation requires a relatively small allocation of temporary device storage that is
O(P)
, whereP
is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input sizeN
).When
d_temp_storage
isnullptr
, no work is done and the required allocation size is returned intemp_storage_bytes
.
Snippet
The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.
#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh> // Declare, allocate, and initialize device-accessible pointers // for sorting data int num_items; // e.g., 7 int num_segments; // e.g., 3 int *d_offsets; // e.g., [0, 3, 3, 7] int *d_key_buf; // e.g., [8, 6, 7, 5, 3, 0, 9] int *d_key_alt_buf; // e.g., [-, -, -, -, -, -, -] ... // Create a DoubleBuffer to wrap the pair of device pointers cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceSegmentedRadixSort::SortKeysDescending( d_temp_storage, temp_storage_bytes, d_keys, num_items, num_segments, d_offsets, d_offsets + 1); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run sorting operation cub::DeviceSegmentedRadixSort::SortKeysDescending( d_temp_storage, temp_storage_bytes, d_keys, num_items, num_segments, d_offsets, d_offsets + 1); // d_keys.Current() <-- [8, 7, 6, 9, 5, 3, 0]
- Template Parameters
KeyT – [inferred] Key type
BeginOffsetIteratorT – [inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)
EndOffsetIteratorT – [inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationd_keys – [inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys
num_items – [in] The total number of items within the segmented array, including items not covered by segments.
num_items
should match the largest element within the range[d_end_offsets, d_end_offsets + num_segments)
.num_segments – [in] The number of segments that comprise the sorting data
d_begin_offsets – [in]
Random-access input iterator to the sequence of beginning offsets of length
num_segments
, such thatd_begin_offsets[i]
is the first element of the ith data segment ind_keys_*
andd_values_*
d_end_offsets – [in]
Random-access input iterator to the sequence of ending offsets of length
num_segments
, such thatd_end_offsets[i] - 1
is the last element of the ith data segment ind_keys_*
andd_values_*
. Ifd_end_offsets[i] - 1 <= d_begin_offsets[i]
, the ith is considered empty.begin_bit – [in] [optional] The least-significant bit index (inclusive) needed for key comparison
end_bit – [in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g.,
sizeof(unsigned int) * 8
)stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.