cub::DeviceCopy
Defined in cub/device/device_copy.cuh
-
struct DeviceCopy
cub::DeviceCopy provides device-wide, parallel operations for copying data.
Public Static Functions
-
template<typename InputIt, typename OutputIt, typename SizeIteratorT>
static inline cudaError_t Batched(void *d_temp_storage, size_t &temp_storage_bytes, InputIt input_it, OutputIt output_it, SizeIteratorT sizes, uint32_t num_ranges, cudaStream_t stream = 0) Copies data from a batch of given source ranges to their corresponding destination ranges.
Snippet
The code snippet below illustrates usage of DeviceCopy::Batched to perform a DeviceRunLength Decode operation.
struct GetIteratorToRange { __host__ __device__ __forceinline__ auto operator()(uint32_t index) { return thrust::make_constant_iterator(d_data_in[index]); } int32_t *d_data_in; }; struct GetPtrToRange { __host__ __device__ __forceinline__ auto operator()(uint32_t index) { return d_data_out + d_offsets[index]; } int32_t *d_data_out; uint32_t *d_offsets; }; struct GetRunLength { __host__ __device__ __forceinline__ uint32_t operator()(uint32_t index) { return d_offsets[index + 1] - d_offsets[index]; } uint32_t *d_offsets; }; uint32_t num_ranges = 5; int32_t *d_data_in; // e.g., [4, 2, 7, 3, 1] int32_t *d_data_out; // e.g., [0, ... ] uint32_t *d_offsets; // e.g., [0, 2, 5, 6, 9, 14] // Returns a constant iterator to the element of the i-th run thrust::counting_iterator<uint32_t> iota(0); auto iterators_in = thrust::make_transform_iterator(iota, GetIteratorToRange{d_data_in}); // Returns the run length of the i-th run auto sizes = thrust::make_transform_iterator(iota, GetRunLength{d_offsets}); // Returns pointers to the output range for each run auto ptrs_out = thrust::make_transform_iterator(iota, GetPtrToRange{d_data_out, d_offsets}); // Determine temporary device storage requirements void *d_temp_storage = nullptr; size_t temp_storage_bytes = 0; cub::DeviceCopy::Batched(d_temp_storage, temp_storage_bytes, iterators_in, ptrs_out, sizes, num_ranges); // Allocate temporary storage cudaMalloc(&d_temp_storage, temp_storage_bytes); // Run batched copy algorithm (used to perform runlength decoding) cub::DeviceCopy::Batched(d_temp_storage, temp_storage_bytes, iterators_in, ptrs_out, sizes, num_ranges); // d_data_out <-- [4, 4, 2, 2, 2, 7, 3, 3, 3, 1, 1, 1, 1, 1]
Note
If any input range aliases any output range the behavior is undefined. If any output range aliases another output range the behavior is undefined. Input ranges can alias one another.
- Template Parameters
InputIt – [inferred] Device-accessible random-access input iterator type providing the iterators to the source ranges
OutputIt – [inferred] Device-accessible random-access input iterator type providing the iterators to the destination ranges
SizeIteratorT – [inferred] Device-accessible random-access input iterator type providing the number of items to be copied for each pair of ranges
- Parameters
d_temp_storage – [in] Device-accessible allocation of temporary storage. When
nullptr
, the required allocation size is written totemp_storage_bytes
and no work is done.temp_storage_bytes – [inout] Reference to size in bytes of
d_temp_storage
allocationinput_it – [in] Device-accessible iterator providing the iterators to the source ranges
output_it – [in] Device-accessible iterator providing the iterators to the destination ranges
sizes – [in] Device-accessible iterator providing the number of elements to be copied for each pair of ranges
num_ranges – [in] The total number of range pairs
stream – [in]
[optional] CUDA stream to launch kernels within. Default is stream0.
-
template<typename InputIt, typename OutputIt, typename SizeIteratorT>