cub::DeviceCopy

Defined in cub/device/device_copy.cuh

struct DeviceCopy

cub::DeviceCopy provides device-wide, parallel operations for copying data.

Public Static Functions

template<typename InputIt, typename OutputIt, typename SizeIteratorT>
static inline cudaError_t Batched(void *d_temp_storage, size_t &temp_storage_bytes, InputIt input_it, OutputIt output_it, SizeIteratorT sizes, uint32_t num_ranges, cudaStream_t stream = 0)

Copies data from a batch of given source ranges to their corresponding destination ranges.

Snippet

The code snippet below illustrates usage of DeviceCopy::Batched to perform a DeviceRunLength Decode operation.

struct GetIteratorToRange
{
  __host__ __device__ __forceinline__ auto operator()(uint32_t index)
  {
    return thrust::make_constant_iterator(d_data_in[index]);
  }
  int32_t *d_data_in;
};

struct GetPtrToRange
{
  __host__ __device__ __forceinline__ auto operator()(uint32_t index)
  {
    return d_data_out + d_offsets[index];
  }
  int32_t *d_data_out;
  uint32_t *d_offsets;
};

struct GetRunLength
{
  __host__ __device__ __forceinline__ uint32_t operator()(uint32_t index)
  {
    return d_offsets[index + 1] - d_offsets[index];
  }
  uint32_t *d_offsets;
};

uint32_t num_ranges = 5;
int32_t *d_data_in;           // e.g., [4, 2, 7, 3, 1]
int32_t *d_data_out;          // e.g., [0,                ...               ]
uint32_t *d_offsets;          // e.g., [0, 2, 5, 6, 9, 14]

// Returns a constant iterator to the element of the i-th run
thrust::counting_iterator<uint32_t> iota(0);
auto iterators_in = thrust::make_transform_iterator(iota, GetIteratorToRange{d_data_in});

// Returns the run length of the i-th run
auto sizes = thrust::make_transform_iterator(iota, GetRunLength{d_offsets});

// Returns pointers to the output range for each run
auto ptrs_out = thrust::make_transform_iterator(iota, GetPtrToRange{d_data_out, d_offsets});

// Determine temporary device storage requirements
void *d_temp_storage      = nullptr;
size_t temp_storage_bytes = 0;
cub::DeviceCopy::Batched(d_temp_storage, temp_storage_bytes, iterators_in, ptrs_out, sizes,
num_ranges);

// Allocate temporary storage
cudaMalloc(&d_temp_storage, temp_storage_bytes);

// Run batched copy algorithm (used to perform runlength decoding)
cub::DeviceCopy::Batched(d_temp_storage, temp_storage_bytes, iterators_in, ptrs_out, sizes,
num_ranges);

// d_data_out       <-- [4, 4, 2, 2, 2, 7, 3, 3, 3, 1, 1, 1, 1, 1]

Note

If any input range aliases any output range the behavior is undefined. If any output range aliases another output range the behavior is undefined. Input ranges can alias one another.

Template Parameters
  • InputIt[inferred] Device-accessible random-access input iterator type providing the iterators to the source ranges

  • OutputIt[inferred] Device-accessible random-access input iterator type providing the iterators to the destination ranges

  • SizeIteratorT[inferred] Device-accessible random-access input iterator type providing the number of items to be copied for each pair of ranges

Parameters
  • d_temp_storage[in] Device-accessible allocation of temporary storage. When nullptr, the required allocation size is written to temp_storage_bytes and no work is done.

  • temp_storage_bytes[inout] Reference to size in bytes of d_temp_storage allocation

  • input_it[in] Device-accessible iterator providing the iterators to the source ranges

  • output_it[in] Device-accessible iterator providing the iterators to the destination ranges

  • sizes[in] Device-accessible iterator providing the number of elements to be copied for each pair of ranges

  • num_ranges[in] The total number of range pairs

  • stream[in]

    [optional] CUDA stream to launch kernels within. Default is stream0.