CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers
|
Epilogue for threadblock scoped GEMMs. More...
Go to the source code of this file.
Classes | |
class | cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ > |
struct | cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::Params |
Parameters structure. More... | |
struct | cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::SharedStorage |
Shared storage allocation needed by the epilogue. More... | |
Namespaces | |
cutlass | |
cutlass::epilogue | |
This does not attempt to target any particular output layout. Instead, each threadblock streams out its accumulator elements using 128b store operations. This assumes all threadblocks have unique output tiles.
The target data layout is:
This enables very fast streaming of data, completely limited by the memory system. No predication or data exchange is performed, and each threadblock is assumed to have a full region of memory to write to.
This epilogue establishes an upper bound for epilogue performance and is suitable for reductions across the GEMM K dimension which require a separate workspace.