cub::BlockAdjacentDifference

Defined in /home/runner/work/cccl/cccl/cub/cub/block/block_adjacent_difference.cuh

template<typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockAdjacentDifference

BlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block.

Overview

BlockAdjacentDifference calculates the differences of adjacent elements in the elements partitioned across a CUDA thread block. Because the binary operation could be noncommutative, there are two sets of methods. Methods named SubtractLeft subtract left element i - 1 of input sequence from current element i. Methods named SubtractRight subtract the right element i + 1 from the current one i:

int values[4]; // [1, 2, 3, 4]
//...
int subtract_left_result[4];  <-- [  1,  1,  1,  1 ]
int subtract_right_result[4]; <-- [ -1, -1, -1,  4 ]
  • For SubtractLeft, if the left element is out of bounds, the input value is assigned to output[0] without modification.

  • For SubtractRight, if the right element is out of bounds, the input value is assigned to the current output value without modification.

  • The block/example_block_reduce_dyn_smem.cu example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockAdjacentDifference.

A Simple Example

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    int result[4];

    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        result,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Collective constructors

inline BlockAdjacentDifference()

Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockAdjacentDifference(TempStorage &temp_storage)

Collective constructor using the specified memory allocation as temporary storage.

Parameters

temp_storage[in] Reference to memory allocation having layout type TempStorage

Read left operations

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeft(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op)

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block
    // of 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        thread_data,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractLeft(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_predecessor_item)

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // The last item in the previous tile:
    int tile_predecessor_item = ...;

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        thread_data,
        CustomDifference(),
        tile_predecessor_item);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. and that tile_predecessor_item is 3. The corresponding output result in those threads will be { [1,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

  • tile_predecessor_item[in]

    thread0 only item which is going to be subtracted from the first tile item (input0 from thread0).

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeftPartialTile(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items)

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
  // Specialize BlockAdjacentDifference for a 1D block of
  // 128 threads of type int
  using BlockAdjacentDifferenceT =
     cub::BlockAdjacentDifference<int, 128>;

  // Allocate shared memory for BlockAdjacentDifference
  __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

  // Obtain a segment of consecutive items that are blocked across threads
  int thread_data[4];
  ...
  int valid_items = 9;

  // Collectively compute adjacent_difference
  BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile(
      thread_data,
      thread_data,
      CustomDifference(),
      valid_items);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

  • valid_items[in] Number of valid items in thread block

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeftPartialTile(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items, T tile_predecessor_item)

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
  // Specialize BlockAdjacentDifference for a 1D block of
  // 128 threads of type int
  using BlockAdjacentDifferenceT =
     cub::BlockAdjacentDifference<int, 128>;

  // Allocate shared memory for BlockAdjacentDifference
  __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

  // Obtain a segment of consecutive items that are blocked across threads
  int thread_data[4];
  ...
  int valid_items = 9;
  int tile_predecessor_item = 4;

  // Collectively compute adjacent_difference
  BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile(
      thread_data,
      thread_data,
      CustomDifference(),
      valid_items,
      tile_predecessor_item);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [0,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

  • valid_items[in] Number of valid items in thread block

  • tile_predecessor_item[in]

    thread0 only item which is going to be subtracted from the first tile item (input0 from thread0).

Read right operations

Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRight(
        thread_data,
        thread_data,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,4] }.

param output

[out] Calling thread’s adjacent difference result

param input

[in] Calling thread’s input items (may be aliased to output)

param difference_op

[in] Binary difference operator

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRight(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op)
template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRight(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_successor_item)

Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // The first item in the next tile:
    int tile_successor_item = ...;

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRight(
        thread_data,
        thread_data,
        CustomDifference(),
        tile_successor_item);

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }, and that tile_successor_item is 3. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,1] }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

  • tile_successor_item[in]

    threadBLOCK_THREADS only item which is going to be subtracted from the last tile item (inputITEMS_PER_THREAD from threadBLOCK_THREADS).

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRightPartialTile(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items)

Subtracts the right element of each adjacent pair in range of elements partitioned across a CUDA thread block.

  • For multi-dimensional blocks, threads are linearly ranked in row-major order.

  • A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRightPartialTile(
        thread_data,
        thread_data,
        CustomDifference(),
        valid_items);

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }. and that valid_items is 507. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,3,3], [3,4,1,4] }.

Parameters
  • output[out] Calling thread’s adjacent difference result

  • input[in] Calling thread’s input items (may be aliased to output)

  • difference_op[in] Binary difference operator

  • valid_items[in] Number of valid items in thread block

struct TempStorage : public Uninitialized<_TempStorage>

The operations exposed by BlockAdjacentDifference require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.