cub::DeviceSegmentedRadixSort

Defined in cub/device/device_segmented_radix_sort.cuh

struct DeviceSegmentedRadixSort

DeviceSegmentedRadixSort provides device-wide, parallel operations for computing a batched radix sort across multiple, non-overlapping sequences of data items residing within device-accessible memory.

Overview

The radix sorting method arranges items into ascending (or descending) order. The algorithm relies upon a positional representation for keys, i.e., each key is comprised of an ordered sequence of symbols (e.g., digits, characters, etc.) specified from least-significant to most-significant. For a given input sequence of keys and a set of rules specifying a total ordering of the symbolic alphabet, the radix sorting method produces a lexicographic ordering of those keys.

See Also

DeviceSegmentedRadixSort shares its implementation with DeviceRadixSort. See that algorithm’s documentation for more information.

Segments are not required to be contiguous. Any element of input(s) or output(s) outside the specified segments will not be accessed nor modified.

Usage Considerations

  • Dynamic parallelism. DeviceSegmentedRadixSort methods can be called within kernel code on devices in which CUDA dynamic parallelism is supported.

Key-value pairs

template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairs(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of key-value pairs into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let in be one of {d_keys_in, d_values_in} and out be any of {d_keys_out, d_values_out}. The range [out, out + num_items) shall not overlap [in, in + num_items), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys_in[i], d_values_in[i], d_keys_out[i], d_values_out[i] will not be accessed nor modified.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
int  *d_values_out;      // e.g., [-, -, -, -, -, -, -]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairs(
    d_temp_storage, temp_storage_bytes,
    d_keys_in, d_keys_out, d_values_in, d_values_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairs(
    d_temp_storage, temp_storage_bytes,
    d_keys_in, d_keys_out, d_values_in, d_values_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys_out            <-- [6, 7, 8, 0, 3, 5, 9]
// d_values_out          <-- [1, 2, 0, 5, 4, 3, 6]

Template Parameters
  • KeyT[inferred] Key type

  • ValueT[inferred] Value type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Device-accessible pointer to the input data of key data to sort

  • d_keys_out[out] Device-accessible pointer to the sorted output sequence of key data

  • d_values_in[in] Device-accessible pointer to the corresponding input sequence of associated value items

  • d_values_out[out] Device-accessible pointer to the correspondingly-reordered output sequence of associated value items

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in] Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairs(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, DoubleBuffer<ValueT> &d_values, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of key-value pairs into ascending order. (~N auxiliary storage required)

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).

  • The contents of both buffers within each pair may be altered by the sorting operation.

  • Upon completion, the sorting operation will update the “current” indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let cur be one of {d_keys.Current(), d_values.Current()} and alt be any of {d_keys.Alternate(), d_values.Alternate()}. The range [cur, cur + num_items) shall not overlap [alt, alt + num_items). Both ranges shall not overlap [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys.Current()[i], d_values.Current()[i], d_keys.Alternate()[i], d_values.Alternate()[i] will not be accessed nor modified.

  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
int  *d_value_alt_buf;   // e.g., [-, -, -, -, -, -, -]
...

// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairs(
    d_temp_storage, temp_storage_bytes, d_keys, d_values,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairs(
    d_temp_storage, temp_storage_bytes, d_keys, d_values,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys.Current()      <-- [6, 7, 8, 0, 3, 5, 9]
// d_values.Current()    <-- [5, 4, 3, 1, 2, 0, 6]

Template Parameters
  • KeyT[inferred] Key type

  • ValueT[inferred] Value type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys[inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys

  • d_values[inout] Double-buffer of values whose “current” device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairsDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, const ValueT *d_values_in, ValueT *d_values_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of key-value pairs into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let in be one of {d_keys_in, d_values_in} and out be any of {d_keys_out, d_values_out}. The range [out, out + num_items) shall not overlap [in, in + num_items), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys_in[i], d_values_in[i], d_keys_out[i], d_values_out[i] will not be accessed nor modified.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
int  *d_values_in;       // e.g., [0, 1, 2, 3, 4, 5, 6]
int  *d_values_out;      // e.g., [-, -, -, -, -, -, -]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairsDescending(
    d_temp_storage, temp_storage_bytes,
    d_keys_in, d_keys_out, d_values_in, d_values_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairsDescending(
    d_temp_storage, temp_storage_bytes,
    d_keys_in, d_keys_out, d_values_in, d_values_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys_out            <-- [8, 7, 6, 9, 5, 3, 0]
// d_values_out          <-- [0, 2, 1, 6, 3, 4, 5]

Template Parameters
  • KeyT[inferred] Key type

  • ValueT[inferred] Value type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Device-accessible pointer to the input data of key data to sort

  • d_keys_out[out] Device-accessible pointer to the sorted output sequence of key data

  • d_values_in[in] Device-accessible pointer to the corresponding input sequence of associated value items

  • d_values_out[out] Device-accessible pointer to the correspondingly-reordered output sequence of associated value items

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename ValueT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortPairsDescending(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, DoubleBuffer<ValueT> &d_values, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of key-value pairs into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers and a corresponding pair of associated value buffers. Each pair is managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).

  • The contents of both buffers within each pair may be altered by the sorting operation.

  • Upon completion, the sorting operation will update the “current” indicator within each DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let cur be one of {d_keys.Current(), d_values.Current()} and alt be any of {d_keys.Alternate(), d_values.Alternate()}. The range [cur, cur + num_items) shall not overlap [alt, alt + num_items). Both ranges shall not overlap [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys.Current()[i], d_values.Current()[i], d_keys.Alternate()[i], d_values.Alternate()[i] will not be accessed nor modified. not to be modified.

  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys with associated vector of int values.

#include <cub/cub.cuh> // or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
int  *d_value_buf;       // e.g., [0, 1, 2, 3, 4, 5, 6]
int  *d_value_alt_buf;   // e.g., [-, -, -, -, -, -, -]
...

// Create a set of DoubleBuffers to wrap pairs of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);
cub::DoubleBuffer<int> d_values(d_value_buf, d_value_alt_buf);

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortPairsDescending(
    d_temp_storage, temp_storage_bytes, d_keys, d_values,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortPairsDescending(
    d_temp_storage, temp_storage_bytes, d_keys, d_values,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys.Current()      <-- [8, 7, 6, 9, 5, 3, 0]
// d_values.Current()    <-- [0, 2, 1, 6, 3, 4, 5]

Template Parameters
  • KeyT[inferred] Key type

  • ValueT[inferred] Value type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys[inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys

  • d_values[inout] Double-buffer of values whose “current” device-accessible buffer contains the unsorted input values and, upon return, is updated to point to the sorted output values

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

Keys-only

template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeys(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of keys into ascending order. (~2N auxiliary storage required)

  • The contents of the input data are not altered by the sorting operation

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • The range [d_keys_out, d_keys_out + num_items) shall not overlap [d_keys_in, d_keys_in + num_items), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys_in[i], d_keys_out[i] will not be accessed nor modified.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
...

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeys(
    d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeys(
    d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys_out            <-- [6, 7, 8, 0, 3, 5, 9]

Template Parameters
  • KeyT[inferred] Key type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Device-accessible pointer to the input data of key data to sort

  • d_keys_out[out] Device-accessible pointer to the sorted output sequence of key data

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeys(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of keys into ascending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).

  • The contents of both buffers may be altered by the sorting operation.

  • Upon completion, the sorting operation will update the “current” indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let cur = d_keys.Current() and alt = d_keys.Alternate(). The range [cur, cur + num_items) shall not overlap [alt, alt + num_items). Both ranges shall not overlap [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys.Current()[i], d_keys[i].Alternate()[i] will not be accessed nor modified.

  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers for
// sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
...

// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeys(
    d_temp_storage, temp_storage_bytes, d_keys,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeys(
    d_temp_storage, temp_storage_bytes, d_keys,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys.Current()      <-- [6, 7, 8, 0, 3, 5, 9]

Template Parameters
  • KeyT[inferred] Key type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys[inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeysDescending(void *d_temp_storage, size_t &temp_storage_bytes, const KeyT *d_keys_in, KeyT *d_keys_out, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of keys into descending order. (~2N auxiliary storage required).

  • The contents of the input data are not altered by the sorting operation

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • The range [d_keys_out, d_keys_out + num_items) shall not overlap [d_keys_in, d_keys_in + num_items), [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • This operation requires an allocation of temporary device storage that is O(N+P), where N is the length of the input and P is the number of streaming multiprocessors on the device. For sorting using only O(P) temporary storage, see the sorting interface using DoubleBuffer wrappers below.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys_in[i], d_keys_out[i] will not be accessed nor modified.

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]
...

// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeysDescending(
    d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeysDescending(
    d_temp_storage, temp_storage_bytes, d_keys_in, d_keys_out,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys_out            <-- [8, 7, 6, 9, 5, 3, 0]

Template Parameters
  • KeyT[inferred] Key type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys_in[in] Device-accessible pointer to the input data of key data to sort

  • d_keys_out[out] Device-accessible pointer to the sorted output sequence of key data

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.

template<typename KeyT, typename BeginOffsetIteratorT, typename EndOffsetIteratorT>
static inline cudaError_t SortKeysDescending(void *d_temp_storage, size_t &temp_storage_bytes, DoubleBuffer<KeyT> &d_keys, int num_items, int num_segments, BeginOffsetIteratorT d_begin_offsets, EndOffsetIteratorT d_end_offsets, int begin_bit = 0, int end_bit = sizeof(KeyT) * 8, cudaStream_t stream = 0)

Sorts segments of keys into descending order. (~N auxiliary storage required).

  • The sorting operation is given a pair of key buffers managed by a DoubleBuffer structure that indicates which of the two buffers is “current” (and thus contains the input data to be sorted).

  • The contents of both buffers may be altered by the sorting operation.

  • Upon completion, the sorting operation will update the “current” indicator within the DoubleBuffer wrapper to reference which of the two buffers now contains the sorted output sequence (a function of the number of key bits specified and the targeted device architecture).

  • When input a contiguous sequence of segments, a single sequence segment_offsets (of length num_segments + 1) can be aliased for both the d_begin_offsets and d_end_offsets parameters (where the latter is specified as segment_offsets + 1).

  • An optional bit subrange [begin_bit, end_bit) of differentiating key bits can be specified. This can reduce overall sorting overhead and yield a corresponding performance improvement.

  • Let cur = d_keys.Current() and alt = d_keys.Alternate(). The range [cur, cur + num_items) shall not overlap [alt, alt + num_items). Both ranges shall not overlap [d_begin_offsets, d_begin_offsets + num_segments) nor [d_end_offsets, d_end_offsets + num_segments) in any way.

  • Segments are not required to be contiguous. For all index values i outside the specified segments d_keys.Current()[i], d_keys[i].Alternate()[i] will not be accessed nor modified.

  • This operation requires a relatively small allocation of temporary device storage that is O(P), where P is the number of streaming multiprocessors on the device (and is typically a small constant relative to the input size N).

  • When d_temp_storage is nullptr, no work is done and the required allocation size is returned in temp_storage_bytes.

Snippet

The code snippet below illustrates the batched sorting of three segments (with one zero-length segment) of int keys.

#include <cub/cub.cuh>
// or equivalently <cub/device/device_segmented_radix_sort.cuh>

// Declare, allocate, and initialize device-accessible pointers
// for sorting data
int  num_items;          // e.g., 7
int  num_segments;       // e.g., 3
int  *d_offsets;         // e.g., [0, 3, 3, 7]
int  *d_key_buf;         // e.g., [8, 6, 7, 5, 3, 0, 9]
int  *d_key_alt_buf;     // e.g., [-, -, -, -, -, -, -]
...

// Create a DoubleBuffer to wrap the pair of device pointers
cub::DoubleBuffer<int> d_keys(d_key_buf, d_key_alt_buf);

// Determine temporary device storage requirements
void     *d_temp_storage = nullptr;
size_t   temp_storage_bytes = 0;
cub::DeviceSegmentedRadixSort::SortKeysDescending(
    d_temp_storage, temp_storage_bytes, d_keys,
    num_items, num_segments, d_offsets, d_offsets + 1);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run sorting operation
cub::DeviceSegmentedRadixSort::SortKeysDescending(
    d_temp_storage, temp_storage_bytes, d_keys,
    num_items, num_segments, d_offsets, d_offsets + 1);

// d_keys.Current()      <-- [8, 7, 6, 9, 5, 3, 0]

Template Parameters
  • KeyT[inferred] Key type

  • BeginOffsetIteratorT[inferred] Random-access input iterator type for reading segment beginning offsets (may be a simple pointer type)

  • EndOffsetIteratorT[inferred] Random-access input iterator type for reading segment ending offsets (may be a simple pointer type)

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • d_keys[inout] Reference to the double-buffer of keys whose “current” device-accessible buffer contains the unsorted input keys and, upon return, is updated to point to the sorted output keys

  • num_items[in] The total number of items within the segmented array, including items not covered by segments. num_items should match the largest element within the range [d_end_offsets, d_end_offsets + num_segments).

  • num_segments[in] The number of segments that comprise the sorting data

  • d_begin_offsets[in]

    Random-access input iterator to the sequence of beginning offsets of length num_segments, such that d_begin_offsets[i] is the first element of the ith data segment in d_keys_* and d_values_*

  • d_end_offsets[in]

    Random-access input iterator to the sequence of ending offsets of length num_segments, such that d_end_offsets[i] - 1 is the last element of the ith data segment in d_keys_* and d_values_*. If d_end_offsets[i] - 1 <= d_begin_offsets[i], the ith is considered empty.

  • begin_bit[in] [optional] The least-significant bit index (inclusive) needed for key comparison

  • end_bit[in] [optional] The most-significant bit index (exclusive) needed for key comparison (e.g., sizeof(unsigned int) * 8)

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.