CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers

#include <predicated_tile_iterator_2dthreadtile.h>
PredicatedTileIterator2dThreadTile
Satisfies: ForwardTileIteratorConcept  ReadableContiguousTileIteratorConcept  WriteableContiguousTileIteratorConcept  MaskedTileIteratorConcept
Regular tile iterator using a precomputed control structure to minimize register liveness and integer arithmetic.
Layout is assumed to be invariant at the time the precomputed "Params" object is constructed.
Base pointer and tensor extents may be specified at the time the iterator is constructed. Subsequently, they are assumed to be immutable.
Adding a logical coordinate offset may be performed at the time the iterator is constructed. Subsequent additions to logical coordinate offset may be performed but are relatively expensive.
Vistitation order is intended to first visit a "residual" tile that may be partially full in both the advance dimension and the steadystate dimension. This is assumed to be the last tile in the iteration sequence. Advancing an iterator that has just been constructed moves to the first tile that is full in the advance dimension and recomputes predicates. Subsequent accesses may be performed without updating internal predicates and are efficient in terms of live register state and pointer arithmetic instructions.
To be efficient, this assumes the iteraor will be dereferenced and advanced at least once outside any looping structure to minimize integer arithmetic.
Acceses out of bounds are safe so long as clear_mask()
is called prior to dereferencing the iterator.
Example:
An efficient pipeline structure may be constructed as follows: