cub::BlockAdjacentDifference
Defined in cub/block/block_adjacent_difference.cuh
-
template<typename T, int BLOCK_DIM_X, int BLOCK_DIM_Y = 1, int BLOCK_DIM_Z = 1, int LEGACY_PTX_ARCH = 0>
class BlockAdjacentDifference BlockAdjacentDifference provides collective methods for computing the differences of adjacent elements partitioned across a CUDA thread block.
Overview
BlockAdjacentDifference calculates the differences of adjacent elements in the elements partitioned across a CUDA thread block. Because the binary operation could be noncommutative, there are two sets of methods. Methods named SubtractLeft subtract left element
i - 1
of input sequence from current elementi
. Methods named SubtractRight subtract the right elementi + 1
from the current onei
:int values[4]; // [1, 2, 3, 4] //... int subtract_left_result[4]; <-- [ 1, 1, 1, 1 ] int subtract_right_result[4]; <-- [ -1, -1, -1, 4 ]
For SubtractLeft, if the left element is out of bounds, the input value is assigned to
output[0]
without modification.For SubtractRight, if the right element is out of bounds, the input value is assigned to the current output value without modification.
The block/example_block_reduce_dyn_smem.cu example under the examples/block folder illustrates usage of dynamically shared memory with BlockReduce and how to re-purpose the same memory region. This example can be easily adapted to the storage required by BlockAdjacentDifference.
A Simple Example
The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute adjacent_difference int result[4]; BlockAdjacentDifferenceT(temp_storage).SubtractLeft( thread_data, result, CustomDifference());
Suppose the set of input thread_data across the block of threads is
{ [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }
. The corresponding outputresult
in those threads will be{ [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }
.Collective constructors
-
inline BlockAdjacentDifference()
Collective constructor using a private static allocation of shared memory as temporary storage.
-
inline BlockAdjacentDifference(TempStorage &temp_storage)
Collective constructor using the specified memory allocation as temporary storage.
- Parameters
temp_storage – [in] Reference to memory allocation having layout type TempStorage
Read left operations
-
template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeft(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op) Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block // of 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractLeft( thread_data, thread_data, CustomDifference());
Suppose the set of input
thread_data
across the block of threads is{ [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }
. The corresponding outputresult
in those threads will be{ [4,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
-
template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractLeft(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_predecessor_item) Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // The last item in the previous tile: int tile_predecessor_item = ...; // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractLeft( thread_data, thread_data, CustomDifference(), tile_predecessor_item);
Suppose the set of input
thread_data
across the block of threads is{ [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }
. and that tile_predecessor_item is 3. The corresponding outputresult
in those threads will be{ [1,-2,-1,0], [0,0,0,0], [1,1,0,0], [0,1,-3,3], ... }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
tile_predecessor_item – [in]
thread0 only item which is going to be subtracted from the first tile item (input0 from thread0).
-
template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeftPartialTile(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items) Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... int valid_items = 9; // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile( thread_data, thread_data, CustomDifference(), valid_items);
Suppose the set of input
thread_data
across the block of threads is{ [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }
. The corresponding outputresult
in those threads will be{ [4,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block
-
template<int ITEMS_PER_THREAD, typename OutputType, typename DifferenceOpT>
inline void SubtractLeftPartialTile(T (&input)[ITEMS_PER_THREAD], OutputType (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items, T tile_predecessor_item) Subtracts the left element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the left difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... int valid_items = 9; int tile_predecessor_item = 4; // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractLeftPartialTile( thread_data, thread_data, CustomDifference(), valid_items, tile_predecessor_item);
Suppose the set of input
thread_data
across the block of threads is{ [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4], ... }
. The corresponding outputresult
in those threads will be{ [0,-2,-1,0], [0,0,0,0], [1,3,3,3], [3,4,1,4], ... }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block
tile_predecessor_item – [in]
thread0 only item which is going to be subtracted from the first tile item (input0 from thread0).
Read right operations
Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractRight( thread_data, thread_data, CustomDifference());
Suppose the set of input
thread_data
across the block of threads is{ ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }
. The corresponding outputresult
in those threads will be{ ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,4] }
.- param output
[out] Calling thread’s adjacent difference result
- param input
[in] Calling thread’s input items (may be aliased to
output
)- param difference_op
[in] Binary difference operator
-
template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRight(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op)
-
template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRight(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, T tile_successor_item) Subtracts the right element of each adjacent pair of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // The first item in the next tile: int tile_successor_item = ...; // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractRight( thread_data, thread_data, CustomDifference(), tile_successor_item);
Suppose the set of input
thread_data
across the block of threads is{ ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }
, and thattile_successor_item
is3
. The corresponding outputresult
in those threads will be{ ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,0,0], [-1,3,-3,1] }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
tile_successor_item – [in]
threadBLOCK_THREADS only item which is going to be subtracted from the last tile item (inputITEMS_PER_THREAD from threadBLOCK_THREADS).
-
template<int ITEMS_PER_THREAD, typename OutputT, typename DifferenceOpT>
inline void SubtractRightPartialTile(T (&input)[ITEMS_PER_THREAD], OutputT (&output)[ITEMS_PER_THREAD], DifferenceOpT difference_op, int valid_items) Subtracts the right element of each adjacent pair in range of elements partitioned across a CUDA thread block.
For multi-dimensional blocks, threads are linearly ranked in row-major order.
A subsequent
__syncthreads()
threadblock barrier should be invoked after calling this method if the collective’s temporary storage (e.g.,temp_storage
) is to be reused or repurposed.
Snippet
The code snippet below illustrates how to use BlockAdjacentDifference to compute the right difference between adjacent elements.
#include <cub/cub.cuh> // or equivalently <cub/block/block_adjacent_difference.cuh> struct CustomDifference { template <typename DataType> __host__ DataType operator()(DataType &lhs, DataType &rhs) { return lhs - rhs; } }; __global__ void ExampleKernel(...) { // Specialize BlockAdjacentDifference for a 1D block of // 128 threads of type int using BlockAdjacentDifferenceT = cub::BlockAdjacentDifference<int, 128>; // Allocate shared memory for BlockAdjacentDifference __shared__ typename BlockAdjacentDifferenceT::TempStorage temp_storage; // Obtain a segment of consecutive items that are blocked across threads int thread_data[4]; ... // Collectively compute adjacent_difference BlockAdjacentDifferenceT(temp_storage).SubtractRightPartialTile( thread_data, thread_data, CustomDifference(), valid_items);
Suppose the set of input
thread_data
across the block of threads is{ ...3], [4,2,1,1], [1,1,1,1], [2,3,3,3], [3,4,1,4] }
. and thatvalid_items
is507
. The corresponding outputresult
in those threads will be{ ...-1, [2,1,0,0], [0,0,0,-1], [-1,0,3,3], [3,4,1,4] }
.- Parameters
output – [out] Calling thread’s adjacent difference result
input – [in] Calling thread’s input items (may be aliased to
output
)difference_op – [in] Binary difference operator
valid_items – [in] Number of valid items in thread block
-
struct TempStorage : public Uninitialized<_TempStorage>
The operations exposed by BlockAdjacentDifference require a temporary memory allocation of this nested type for thread communication. This opaque storage can be allocated directly using the
__shared__
keyword. Alternatively, it can be aliased to externally allocated memory (shared or global) orunion
’d with other storage allocation types to facilitate memory reuse.