Epilogue for threadblock scoped GEMMs. More...

#include "cutlass/cutlass.h"
#include "cutlass/numeric_types.h"
#include "cutlass/array.h"

Include dependency graph for epilogue_workspace.h:

Classes
class	cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >

struct	cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::Params
	Parameters structure. More...

struct	cutlass::epilogue::EpilogueWorkspace< Shape_, WarpCount, FragmentC_ >::SharedStorage
	Shared storage allocation needed by the epilogue. More...

Namespaces
	cutlass

	cutlass::epilogue

Detailed Description

This does not attempt to target any particular output layout. Instead, each threadblock streams out its accumulator elements using 128b store operations. This assumes all threadblocks have unique output tiles.

The target data layout is:

threadblock indices mapped to linear offsets as (m, n, k), where m is fastest-changing
threadblock output space partitioned into warps; each warp's region is contiguous
per-thread accumulators partitioned into 128b accesses
output memory striped across the threads of a warp

This enables very fast streaming of data, completely limited by the memory system. No predication or data exchange is performed, and each threadblock is assumed to have a full region of memory to write to.

This epilogue establishes an upper bound for epilogue performance and is suitable for reductions across the GEMM K dimension which require a separate workspace.

Classes

Namespaces

Detailed Description