CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers

#include <pitch_linear_thread_map.h>
Stripmines a pitchlinear tile among a given number of threads, first along the contiguous dimension then along the strided dimension, while each thread access a 2D threadtile.
The tile must be divisible by the thread count such that all threads may execute the same number of iterations with the same delta to exhaustively cover the tile.
This class satisfies the "RegularThreadMapping" concept.