cub::BlockAdjacentDifference#

template<typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1> class BlockAdjacentDifference#

BlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block.

Overview#

BlockAdjacentDifference calculates the differences of adjacent elements in the elements partitioned across a CUDA thread block. Because the binary operation could be noncommutative, there are two sets of methods. Methods named SubtractLeft subtract left element i - 1 of input sequence from current element i. Methods named SubtractRight subtract the right element i + 1 from the current one i:

int values[4]; // [1, 2, 3, 4]
//...
int subtract_left_result[4];  <-- [  1,  1,  1,  1 ]
int subtract_right_result[4]; <-- [ -1, -1, -1,  4 ]

For SubtractLeft, if the left element is out of bounds, the input value is assigned to output[0] without modification.
For SubtractRight, if the right element is out of bounds, the input value is assigned to the current output value without modification.
The block/example_block_reduce_dyn_smem.cu example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockAdjacentDifference.

A Simple Example#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    int result[4];

    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        result,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Collective constructors

inline BlockAdjacentDifference()#: Collective constructor using a private static allocation of shared memory as temporary storage.

inline BlockAdjacentDifference(TempStorage &temp_storage)#

Collective constructor using the specified memory allocation as temporary storage.

Parameters:: temp_storage – [in] Reference to memory allocation having layout type TempStorage

Read left operations

end member group

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT> inline void SubtractLeft( T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, )#

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block
    // of 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        thread_data,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT> inline void SubtractLeft( T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_predecessor_item, )#

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // The last item in the previous tile:
    int tile_predecessor_item = ...;

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractLeft(
        thread_data,
        thread_data,
        CustomDifference(),
        tile_predecessor_item);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. and that tile_predecessor_item is 3. The corresponding output result in those threads will be { [1,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator
tile_predecessor_item – [in]
thread₀ only item which is going to be subtracted from the first tile item (input₀ from thread₀).

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT> inline void SubtractLeftPartialTile( T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items, )#

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
  // Specialize BlockAdjacentDifference for a 1D block of
  // 128 threads of type int
  using BlockAdjacentDifferenceT =
     cub::BlockAdjacentDifference<int, 128>;

  // Allocate shared memory for BlockAdjacentDifference
  __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

  // Obtain a segment of consecutive items that are blocked across threads
  int thread_data[4];
  ...
  int valid_items = 9;

  // Collectively compute adjacent_difference
  BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile(
      thread_data,
      thread_data,
      CustomDifference(),
      valid_items);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [4,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block

template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT> inline void SubtractLeftPartialTile( T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items, T tile_predecessor_item, )#

Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
  // Specialize BlockAdjacentDifference for a 1D block of
  // 128 threads of type int
  using BlockAdjacentDifferenceT =
     cub::BlockAdjacentDifference<int, 128>;

  // Allocate shared memory for BlockAdjacentDifference
  __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

  // Obtain a segment of consecutive items that are blocked across threads
  int thread_data[4];
  ...
  int valid_items = 9;
  int tile_predecessor_item = 4;

  // Collectively compute adjacent_difference
  BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile(
      thread_data,
      thread_data,
      CustomDifference(),
      valid_items,
      tile_predecessor_item);

Suppose the set of input thread_data across the block of threads is { [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }. The corresponding output result in those threads will be { [0,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block
tile_predecessor_item – [in]
thread₀ only item which is going to be subtracted from the first tile item (input₀ from thread₀).

Read right operations

end member group

Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRight(
        thread_data,
        thread_data,
        CustomDifference());

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,4] }.

param output:: [out] Calling thread’s adjacent difference result
param input:: [in] Calling thread’s input items (may be aliased to output)
param difference_op:: [in] Binary difference operator

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT> inline void SubtractRight( T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, )#

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT> inline void SubtractRight( T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_successor_item, )#

Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // The first item in the next tile:
    int tile_successor_item = ...;

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRight(
        thread_data,
        thread_data,
        CustomDifference(),
        tile_successor_item);

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }, and that tile_successor_item is 3. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,1] }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator
tile_successor_item – [in]
thread_{BLOCK_THREADS} only item which is going to be subtracted from the last tile item (input_{ITEMS_PER_THREAD} from thread_{BLOCK_THREADS}).

template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT> inline void SubtractRightPartialTile( T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items, )#

Subtracts the right element of each adjacent pair in range of elements partitioned across a CUDA thread block.

For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent __syncthreads() threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g., temp_storage) is to be reused or repurposed.

Snippet#

The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.

#include <cub/cub.cuh>
// or equivalently <cub/block/block_adjacent_difference.cuh>

struct CustomDifference
{
  template <typename DataType>
  __host__ DataType operator()(DataType &lhs, DataType &rhs)
  {
    return lhs - rhs;
  }
};

__global__ void ExampleKernel(...)
{
    // Specialize BlockAdjacentDifference for a 1D block of
    // 128 threads of type int
    using BlockAdjacentDifferenceT =
       cub::BlockAdjacentDifference<int, 128>;

    // Allocate shared memory for BlockAdjacentDifference
    __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage;

    // Obtain a segment of consecutive items that are blocked across threads
    int thread_data[4];
    ...

    // Collectively compute adjacent_difference
    BlockAdjacentDifferenceT(temp_storage).SubtractRightPartialTile(
        thread_data,
        thread_data,
        CustomDifference(),
        valid_items);

Suppose the set of input thread_data across the block of threads is { ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }. and that valid_items is 507. The corresponding output result in those threads will be { ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,3,3], [3,4,1,4] }.

Parameters:

output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to output)
difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block

struct TempStorage : public Uninitialized<_TempStorage>#: The operations exposed by BlockAdjacentDifference require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the __shared__ keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) or union’d with other storage allocation types to facilitate memory reuse.