CUDA Templates for Linear Algebra Subroutines and Solvers
Classes | Namespaces
epilogue_workspace.h File Reference

Epilogue for threadblock scoped GEMMs. More...

#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/array.h"
Include dependency graph for epilogue_workspace.h:

Go to the source code of this file.


class  cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >
struct  cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::Params
 Parameters structure. More...
struct  cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::SharedStorage
 Shared storage allocation needed by the epilogue. More...



Detailed Description

This does not attempt to target any particular output layout. Instead, each threadblock streams out its accumulator elements using 128b store operations. This assumes all threadblocks have unique output tiles.

The target data layout is:

This enables very fast streaming of data, completely limited by the memory system. No predication or data exchange is performed, and each threadblock is assumed to have a full region of memory to write to.

This epilogue establishes an upper bound for epilogue performance and is suitable for reductions across the GEMM K dimension which require a separate workspace.