►Ncutlass | |
►Narch | |
CMma | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, complex< double >, LayoutA, complex< double >, LayoutB, complex< double >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, complex< double >, LayoutA, double, LayoutB, complex< double >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, complex< float >, LayoutA, complex< float >, LayoutB, complex< float >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, complex< float >, LayoutA, float, LayoutB, complex< float >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, double, LayoutA, complex< double >, LayoutB, complex< double >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, double, LayoutA, double, LayoutB, double, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, Operator > | Matrix multiply-add operation - specialized for 1x1x1x1 matrix multiply operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, float, LayoutA, complex< float >, LayoutB, complex< float >, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, float, LayoutA, float, LayoutB, float, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, half_t, LayoutA, half_t, LayoutB, float, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 1 >, 1, int, LayoutA, int, LayoutB, int, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 2 >, 1, int16_t, layout::RowMajor, int16_t, layout::ColumnMajor, int, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 1, 4 >, 1, int8_t, LayoutA, int8_t, LayoutB, int, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 1, 2, 1 >, 1, half_t, LayoutA, half_t, LayoutB, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 16, 16, 4 >, 32, half_t, LayoutA, half_t, LayoutB, ElementC, LayoutC, Operator > | Matrix multiply-add operation specialized for the entire warp |
CMma< gemm::GemmShape< 16, 8, 8 >, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F32 = F16 * F16 + F32 |
CMma< gemm::GemmShape< 16, 8, 8 >, 32, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation - F16 = F16 * F16 + F16 |
CMma< gemm::GemmShape< 2, 1, 1 >, 1, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 2, 2, 1 >, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::ColumnMajor, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 2, 2, 1 >, 1, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 8, 8, 128 >, 32, uint1b_t, layout::RowMajor, uint1b_t, layout::ColumnMajor, int, layout::RowMajor, OpXorPopc > | Matrix multiply-add operation |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = S8 * S8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = S8 * S8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = S8 * U8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, int8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = S8 * U8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = U8 * S8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, uint8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = U8 * S8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = S8 * U8 + S32 |
CMma< gemm::GemmShape< 8, 8, 16 >, 32, uint8_t, layout::RowMajor, uint8_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = S8 * U8 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = S4 * S4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, int4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = S4 * S4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = S4 * U4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, int4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = S4 * U4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = U4 * S4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, uint4b_t, layout::RowMajor, int4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = U4 * S4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: S32 = U4 * U4 + S32 |
CMma< gemm::GemmShape< 8, 8, 32 >, 32, uint4b_t, layout::RowMajor, uint4b_t, layout::ColumnMajor, int, layout::RowMajor, OpMultiplyAddSaturate > | Matrix multiply-add operation: S32 = U4 * U4 + S32 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F32 = F16 * F16 + F32 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::ColumnMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F16 = F16 * F16 + F16 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F32 = F16 * F16 + F32 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::ColumnMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F16 = F16 * F16 + F16 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, float, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F32 = F16 * F16 + F32 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::RowMajor, half_t, layout::ColumnMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F16 = F16 * F16 + F16 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, float, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F32 = F16 * F16 + F32 |
CMma< gemm::GemmShape< 8, 8, 4 >, 8, half_t, layout::RowMajor, half_t, layout::RowMajor, half_t, layout::RowMajor, OpMultiplyAdd > | Matrix multiply-add operation: F16 = F16 * F16 + F16 |
CPtxWmma | WMMA Matrix multiply-add operation |
CPtxWmmaLoadA | WMMA PTX string load for A, B, and C matrices |
CPtxWmmaLoadB | |
CPtxWmmaLoadC | |
CPtxWmmaStoreD | WMMA store for matrix D |
CSm50 | |
CSm60 | |
CSm61 | |
CSm70 | |
CSm72 | |
CSm75 | |
CWmma< Shape_, cutlass::half_t, LayoutA_, cutlass::half_t, LayoutB_, ElementC_, LayoutC_, cutlass::arch::OpMultiplyAdd > | |
CWmma< Shape_, cutlass::int4b_t, LayoutA_, cutlass::int4b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd > | |
CWmma< Shape_, cutlass::uint1b_t, LayoutA_, cutlass::uint1b_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpXorPopc > | |
CWmma< Shape_, int8_t, LayoutA_, int8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd > | |
CWmma< Shape_, uint8_t, LayoutA_, uint8_t, LayoutB_, int32_t, LayoutC_, cutlass::arch::OpMultiplyAdd > | |
►Ndevice_memory | |
►Callocation | Device allocation abstraction that tracks size and capacity |
Cdeleter | Delete functor for CUDA device memory |
►Nepilogue | |
►Nthread | |
►CConvert | |
CParams | Host-constructable parameters structure |
►CLinearCombination | |
CParams | Host-constructable parameters structure |
►CLinearCombinationClamp | |
CParams | Host-constructable parameters structure |
►CLinearCombinationRelu | |
CParams | Host-constructable parameters structure |
►CLinearCombinationRelu< ElementOutput_, Count, int, float, Round > | |
CParams | Host-constructable parameters structure |
►CReductionOpPlus | |
CParams | Host-constructable parameters structure |
►Nthreadblock | |
►Ndetail | |
CRowArrangement | RowArrangement determines how one or more warps cover a region of consecutive rows |
CRowArrangement< Shape, WarpsRemaining, ElementsPerAccess, ElementSize, false > | RowArrangement in which each warp's access is a 1D tiled arrangement |
►CRowArrangement< Shape, WarpsRemaining, ElementsPerAccess, ElementSize, true > | RowArrangement in which each warp's access is a 2D tiled arrangement |
CDetail | |
CDefaultEpilogueComplexTensorOp | Defines sensible defaults for epilogues for TensorOps |
CDefaultEpilogueSimt | Defines sensible defaults for epilogues for SimtOps |
CDefaultEpilogueTensorOp | Defines sensible defaults for epilogues for TensorOps |
CDefaultEpilogueVoltaTensorOp | Defines sensible defaults for epilogues for TensorOps |
CDefaultEpilogueWmmaTensorOp | Defines sensible defaults for epilogues for WMMA TensorOps |
CDefaultInterleavedEpilogueTensorOp | |
►CDefaultInterleavedThreadMapTensorOp | Defines the optimal thread map for TensorOp accumulator layouts |
CDetail | |
►CDefaultThreadMapSimt | Defines the optimal thread map for SIMT accumulator layouts |
CDetail | |
►CDefaultThreadMapTensorOp | Defines the optimal thread map for TensorOp accumulator layouts |
CDetail | |
CDefaultThreadMapVoltaTensorOp | Defines the optimal thread map for TensorOp accumulator layouts |
►CDefaultThreadMapVoltaTensorOp< ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, float > | Defines the optimal thread map for TensorOp accumulator layouts |
CDetail | |
►CDefaultThreadMapVoltaTensorOp< ThreadblockShape_, WarpShape_, PartitionsK, ElementOutput_, ElementsPerAccess, half_t > | Defines the optimal thread map for TensorOp accumulator layouts |
CDetail | |
►CDefaultThreadMapWmmaTensorOp | Defines the optimal thread map for Wmma TensorOp accumulator layouts |
CDetail | |
►CDirectEpilogueTensorOp | Epilogue operator |
CParams | Parameters structure for host-constructible state |
CSharedStorage | Shared storage allocation needed by the epilogue |
CEpilogue | Epilogue operator without splitk |
►CEpilogueBase | Base class for epilogues defining warp-level |
CSharedStorage | Shared storage allocation needed by the epilogue |
►CInterleavedEpilogue | Epilogue operator without splitk |
CSharedStorage | Shared storage allocation needed by the epilogue |
►CInterleavedOutputTileThreadMap | |
CDetail | |
►CInterleavedPredicatedTileIterator | |
CMask | Mask object |
CParams | |
►COutputTileOptimalThreadMap | |
CCompactedThreadMap | Compacted thread map in which the 4D region is contiguous |
CDetail | |
COutputTileShape | Tuple defining point in output tile |
COutputTileThreadMap | |
►CPredicatedTileIterator | |
CMask | Mask object |
CParams | |
CSharedLoadIterator | |
►Nwarp | |
CFragmentIteratorComplexTensorOp | |
CFragmentIteratorComplexTensorOp< WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor > | Partial specialization for row-major shared memory |
CFragmentIteratorSimt | Fragment iterator for SIMT accumulator arrangements |
CFragmentIteratorSimt< WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ > | Partial specialization for row-major shared memory |
CFragmentIteratorTensorOp | |
CFragmentIteratorTensorOp< WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::ColumnMajorInterleaved< InterleavedK > > | Dedicated to interleaved layout |
CFragmentIteratorTensorOp< WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor > | Partial specialization for row-major shared memory |
CFragmentIteratorVoltaTensorOp | |
CFragmentIteratorVoltaTensorOp< WarpShape_, gemm::GemmShape< 32, 32, 4 >, float, layout::RowMajor > | Partial specialization for row-major shared memory |
CFragmentIteratorVoltaTensorOp< WarpShape_, gemm::GemmShape< 32, 32, 4 >, half_t, layout::RowMajor > | Partial specialization for row-major shared memory |
CFragmentIteratorWmmaTensorOp | |
CFragmentIteratorWmmaTensorOp< WarpShape_, OperatorShape_, OperatorElementC_, OperatorFragmentC_, layout::RowMajor > | Partial specialization for row-major shared memory |
CSimtPolicy | |
CSimtPolicy< WarpShape_, Operator_, layout::RowMajor, MmaSimtPolicy_ > | Partial specialization for row-major |
CTensorOpPolicy | Policy details related to the epilogue |
CTensorOpPolicy< WarpShape, OperatorShape, layout::ColumnMajorInterleaved< InterleavedK > > | Partial specialization for column-major-interleaved |
CTensorOpPolicy< WarpShape, OperatorShape, layout::RowMajor > | Partial specialization for row-major |
CTileIteratorSimt | Template for reading and writing tiles of accumulators to shared memory |
CTileIteratorSimt< WarpShape_, Operator_, Element_, layout::RowMajor, MmaSimtPolicy_ > | Template for reading and writing tiles of accumulators to shared memory |
CTileIteratorTensorOp | Template for reading and writing tiles of accumulators to shared memory |
►CTileIteratorTensorOp< WarpShape_, OperatorShape_, Element_, layout::RowMajor > | Template for reading and writing tiles of accumulators to shared memory |
CDetail | |
CTileIteratorVoltaTensorOp | Template for reading and writing tiles of accumulators to shared memory |
►CTileIteratorVoltaTensorOp< WarpShape_, gemm::GemmShape< 32, 32, 4 >, float, layout::RowMajor > | Template for reading and writing tiles of accumulators to shared memory |
CDetail | |
►CTileIteratorVoltaTensorOp< WarpShape_, gemm::GemmShape< 32, 32, 4 >, half_t, layout::RowMajor > | Template for reading and writing tiles of accumulators to shared memory |
CDetail | |
CTileIteratorWmmaTensorOp | Template for reading and writing tiles of accumulators to shared memory |
CTileIteratorWmmaTensorOp< WarpShape_, OperatorShape_, OperatorFragment_, layout::RowMajor > | Template for reading and writing tiles of accumulators to shared memory |
CVoltaTensorOpPolicy | Policy details related to the epilogue |
CVoltaTensorOpPolicy< WarpShape_, gemm::GemmShape< 32, 32, 4 >, float, layout::RowMajor > | Partial specialization for row-major |
CVoltaTensorOpPolicy< WarpShape_, gemm::GemmShape< 32, 32, 4 >, half_t, layout::RowMajor > | Partial specialization for row-major |
►CEpilogueWorkspace | |
CParams | Parameters structure |
CSharedStorage | Shared storage allocation needed by the epilogue |
►Ngemm | |
►Ndevice | |
CDefaultGemmConfiguration | |
CDefaultGemmConfiguration< arch::OpClassSimt, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator > | |
CDefaultGemmConfiguration< arch::OpClassSimt, ArchTag, int8_t, int8_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm70, ElementA, ElementB, ElementC, ElementAccumulator > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, ElementA, ElementB, ElementC, ElementAccumulator > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, int4b_t, int4b_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, int4b_t, uint4b_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, int8_t, int8_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, int8_t, uint8_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, uint4b_t, int4b_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, uint4b_t, uint4b_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, uint8_t, int8_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassTensorOp, arch::Sm75, uint8_t, uint8_t, ElementC, int32_t > | |
CDefaultGemmConfiguration< arch::OpClassWmmaTensorOp, ArchTag, ElementA, ElementB, ElementC, ElementAccumulator > | |
►CGemm | |
CArguments | Argument structure |
►CGemm< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, SplitKSerial, Operator_, IsBetaZero > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmBatched | |
CArguments | Argument structure |
►CGemmBatched< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, AlignmentA, AlignmentB, Operator_ > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmComplex | |
CArguments | Argument structure |
►CGemmComplex< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ThreadblockSwizzle_, Stages, TransformA, TransformB, SplitKSerial > | Partial specialization for column-major output exchanges problem size and operand |
CArguments | Argument structure |
►CGemmSplitKParallel | |
CArguments | Argument structure |
►CGemmSplitKParallel< ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, layout::ColumnMajor, ElementAccumulator_, OperatorClass_, ArchTag_, ThreadblockShape_, WarpShape_, InstructionShape_, EpilogueOutputOp_, ConvertScaledOp_, ReductionOp_, ThreadblockSwizzle_, Stages, kAlignmentA, kAlignmentB, Operator_ > | Partial specialization for column-major output |
CArguments | Argument structure |
►Nkernel | |
►Ndetail | |
CGemvBatchedStridedEpilogueScaling | |
CDefaultGemm | |
CDefaultGemm< ElementA, layout::ColumnMajorInterleaved< InterleavedK >, kAlignmentA, ElementB, layout::RowMajorInterleaved< InterleavedK >, kAlignmentB, ElementC, layout::ColumnMajorInterleaved< InterleavedK >, int32_t, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, IsBetaZero > | Partial specialization for Turing Integer Matrix Multiply Interleaved layout |
CDefaultGemm< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape< 1, 1, 1 >, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator > | Partial specialization for SIMT |
CDefaultGemm< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm70, ThreadblockShape, WarpShape, GemmShape< 8, 8, 4 >, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator > | Partial specialization for Volta architecture |
CDefaultGemm< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementC, layout::RowMajor, ElementAccumulator, arch::OpClassTensorOp, arch::Sm75, ThreadblockShape, WarpShape, InstructionShape, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator > | Partial specialization for Turing Architecture |
CDefaultGemm< int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementC, LayoutC, ElementAccumulator, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape< 1, 1, 4 >, EpilogueOutputOp, ThreadblockSwizzle, 2, SplitKSerial, Operator, false > | Partial specialization for SIMT DP4A |
CDefaultGemmSplitKParallel | |
CDefaultGemv | |
►CGemm | |
CParams | Parameters structure |
CSharedStorage | Shared memory storage structure |
►CGemmBatched | |
CParams | Parameters structure |
CSharedStorage | Shared memory storage structure |
►CGemmSplitKParallel | |
CParams | Parameters structure |
CSharedStorage | Shared memory storage structure |
►Nthread | |
►Ndetail | |
CEnableMma_Crow_SM60 | Determines whether to enable thread::Gemm<> specializations compatible with SM50 |
CMma_HFMA2 | Structure to compute the matrix product for HFMA |
CMma_HFMA2< Shape, layout::ColumnMajor, layout::ColumnMajor, layout::ColumnMajor, true > | |
CMma_HFMA2< Shape, layout::ColumnMajor, layout::ColumnMajor, layout::RowMajor, true > | |
CMma_HFMA2< Shape, layout::ColumnMajor, layout::RowMajor, layout::ColumnMajor, true > | |
CMma_HFMA2< Shape, layout::ColumnMajor, layout::RowMajor, layout::RowMajor, true > | |
CMma_HFMA2< Shape, layout::RowMajor, layout::ColumnMajor, layout::ColumnMajor, true > | |
CMma_HFMA2< Shape, layout::RowMajor, layout::ColumnMajor, layout::RowMajor, true > | |
CMma_HFMA2< Shape, layout::RowMajor, layout::RowMajor, layout::ColumnMajor, true > | |
CMma_HFMA2< Shape, layout::RowMajor, layout::RowMajor, layout::RowMajor, true > | |
CMma_HFMA2< Shape, LayoutA, LayoutB, layout::ColumnMajor, false > | |
CMma_HFMA2< Shape, LayoutA, LayoutB, layout::RowMajor, false > | |
CMma | Structure to compute the matrix product |
CMma< Shape_, ElementA_, LayoutA_, ElementB_, LayoutB_, ElementC_, LayoutC_, arch::OpMultiplyAdd, bool > | Gemplate that handles conventional layouts for FFMA and DFMA GEMM |
CMma< Shape_, half_t, LayoutA, half_t, LayoutB, half_t, LayoutC, arch::OpMultiplyAdd > | Structure to compute the matrix product |
CMma< Shape_, half_t, LayoutA_, half_t, LayoutB_, half_t, layout::RowMajor, arch::OpMultiplyAdd, typename platform::enable_if< detail::EnableMma_Crow_SM60< LayoutA_, LayoutB_ >::value >::type > | Computes matrix product when C is row-major |
CMma< Shape_, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, int8_t > | Gemplate that handles conventional layouts for IDP4A |
CMma< Shape_, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, int32_t, LayoutC_, arch::OpMultiplyAdd, bool > | Gemplate that handles conventional layouts for IDP4A |
CMmaGeneric | Gemplate that handles all packed matrix layouts |
►Nthreadblock | |
CDefaultGemvCore | |
CDefaultMma | |
CDefaultMma< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::ColumnMajorInterleaved< InterleavedK >, OperatorClass, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, true > | Specialization for column-major-interleaved output |
CDefaultMma< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false > | Specialization for row-major output (OperatorClass Simt) |
CDefaultMma< ElementA, LayoutA, kAlignmentA, ElementB, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassTensorOp, ArchTag, ThreadblockShape, WarpShape, InstructionShape, 2, Operator, false > | Specialization for row-major output (OperatorClass Simt) |
CDefaultMma< int8_t, LayoutA, kAlignmentA, int8_t, LayoutB, kAlignmentB, ElementAccumulator, layout::RowMajor, arch::OpClassSimt, ArchTag, ThreadblockShape, WarpShape, GemmShape< 1, 1, 4 >, 2, Operator, false > | |
CDefaultMmaCore | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 1 >, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 1 >, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 1 >, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_, > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 1 >, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 1 >, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 4 >, int8_t, layout::ColumnMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | Partial specialization: |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 4 >, int8_t, layout::ColumnMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 4 >, int8_t, layout::RowMajor, int8_t, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | Partial specialization: |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 1, 1, 4 >, int8_t, layout::RowMajor, int8_t, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassSimt, 2, Operator_ > | Partial specialization: |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 8, 8, 4 >, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 8, 8, 4 >, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 8, 8, 4 >, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, GemmShape< 8, 8, 4 >, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, InstructionShape_, ElementA_, layout::ColumnMajorInterleaved< InterleavedK >, ElementB_, layout::RowMajorInterleaved< InterleavedK >, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_, AccumulatorsInRowMajor > | |
CDefaultMmaCore< Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::ColumnMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CDefaultMmaCore< Shape_, WarpShape_, InstructionShape_, ElementA_, layout::RowMajor, ElementB_, layout::RowMajor, ElementC_, LayoutC_, arch::OpClassTensorOp, 2, Operator_ > | |
CGemmBatchedIdentityThreadblockSwizzle | Threadblock swizzling function for batched GEMMs |
CGemmHorizontalThreadblockSwizzle | Threadblock swizzling function for GEMMs |
CGemmIdentityThreadblockSwizzle | Threadblock swizzling function for GEMMs |
CGemmSplitKHorizontalThreadblockSwizzle | Threadblock swizzling function for split-K GEMMs |
CGemmSplitKIdentityThreadblockSwizzle | Threadblock swizzling function for split-K GEMMs |
CGemv | Structure to compute the matrix-vector product using SIMT math instructions |
CGemvBatchedStridedThreadblockDefaultSwizzle | Threadblock swizzling function for batched GEMVs |
►CMmaBase | |
CSharedStorage | Shared storage object needed by threadblock-scoped GEMM |
CMmaPipelined | Structure to compute the matrix product targeting CUDA cores and SIMT math instructions |
CMmaPolicy | Policy object describing MmaTensorOp |
CMmaSingleStage | Structure to compute the matrix product targeting CUDA cores and SIMT math instructions |
►Nwarp | |
CDefaultMmaTensorOp | Partial specialization for m-by-n-by-kgroup |
CMmaComplexTensorOp | |
CMmaComplexTensorOp< Shape_, complex< RealElementA >, LayoutA_, complex< RealElementB >, LayoutB_, complex< RealElementC >, LayoutC_, Policy_, TransformA, TransformB, Enable > | Partial specialization for complex*complex+complex => complex using real-valued TensorOps |
CMmaSimt | Structure to compute the matrix product targeting CUDA cores and SIMT math instructions |
CMmaSimtPolicy | Describes the arrangement and configuration of per-lane operations in warp-level matrix multiply |
CMmaSimtTileIterator | |
CMmaSimtTileIterator< Shape_, Operand::kA, Element_, layout::ColumnMajor, Policy_, PartitionsK, PartitionGroupSize > | |
CMmaSimtTileIterator< Shape_, Operand::kA, Element_, layout::ColumnMajorInterleaved< 4 >, Policy_, PartitionsK, PartitionGroupSize > | |
CMmaSimtTileIterator< Shape_, Operand::kB, Element_, layout::RowMajor, Policy_, PartitionsK, PartitionGroupSize > | |
CMmaSimtTileIterator< Shape_, Operand::kB, Element_, layout::RowMajorInterleaved< 4 >, Policy_, PartitionsK, PartitionGroupSize > | |
CMmaSimtTileIterator< Shape_, Operand::kC, Element_, layout::ColumnMajor, Policy_ > | |
CMmaSimtTileIterator< Shape_, Operand::kC, Element_, layout::RowMajor, Policy_ > | |
CMmaTensorOp | Structure to compute the matrix product targeting CUDA cores and SIMT math instructions |
CMmaTensorOpAccumulatorTileIterator | |
►CMmaTensorOpAccumulatorTileIterator< Shape_, Element_, cutlass::layout::ColumnMajor, InstructionShape_, OpDelta_ > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
►CMmaTensorOpAccumulatorTileIterator< Shape_, Element_, cutlass::layout::ColumnMajorInterleaved< InterleavedN >, InstructionShape_, OpDelta_ > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
►CMmaTensorOpAccumulatorTileIterator< Shape_, Element_, cutlass::layout::RowMajor, InstructionShape_, OpDelta_ > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CMmaTensorOpMultiplicandTileIterator | |
CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::ColumnMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::RowMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
►CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, 64 >, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
►CMmaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::TensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, InstructionShape_, OpDelta_, 32, PartitionsK_ > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CMmaTensorOpPolicy | Policy |
CMmaVoltaTensorOp | Structure to compute the matrix product targeting CUDA cores and SIMT math instructions |
►CMmaVoltaTensorOpAccumulatorTileIterator | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CMmaVoltaTensorOpMultiplicandTileIterator | |
CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand::kA, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value >, InstructionShape_, OpDelta_, 32 > | |
►CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand::kA, Element_, cutlass::layout::VoltaTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value >, InstructionShape_, OpDelta_, 32 > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand::kB, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandBCongruous< sizeof_bits< Element_ >::value >, InstructionShape_, OpDelta_, 32 > | |
►CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand::kB, Element_, cutlass::layout::VoltaTensorOpMultiplicandBCongruous< sizeof_bits< Element_ >::value >, InstructionShape_, OpDelta_, 32 > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, KBlock >, InstructionShape_, OpDelta_, 32 > | |
CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::RowMajorVoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, KBlock >, InstructionShape_, OpDelta_, 32 > | |
►CMmaVoltaTensorOpMultiplicandTileIterator< Shape_, Operand_, Element_, cutlass::layout::VoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, KBlock >, InstructionShape_, OpDelta_, 32 > | |
CPolicy | Internal structure of iterator - made public to enable introspection |
CWarpSize | Query the number of threads per warp |
CBatchedGemmCoord | |
CGemmCoord | |
CGemmShape | Shape of a matrix multiply-add operation |
►Nlayout | |
CColumnMajor | Mapping function for column-major matrices |
CColumnMajorBlockLinear | |
CColumnMajorInterleaved | |
CColumnMajorTensorOpMultiplicandCongruous | |
CColumnMajorTensorOpMultiplicandCrosswise | |
CColumnMajorVoltaTensorOpMultiplicandBCongruous | Template mapping a column-major view of pitch-linear memory to VoltaTensorOpMultiplicandCongruous |
CColumnMajorVoltaTensorOpMultiplicandCongruous | Template mapping a column-major view of pitch-linear memory to VoltaTensorOpMultiplicandCongruous |
CColumnMajorVoltaTensorOpMultiplicandCrosswise | |
CContiguousMatrix | |
CGeneralMatrix | |
CLayoutTranspose | Defines transposes of matrix layouts |
CLayoutTranspose< layout::ColumnMajor > | Transpose of column-major is row-major |
CLayoutTranspose< layout::RowMajor > | Transpose of row-major is column-major |
CPackedVectorLayout | Tensor layout for densely packed vectors |
CPitchLinear | Mapping function for pitch-linear memory |
CPitchLinearCoord | Coordinate in pitch-linear space |
CPitchLinearShape | Template defining a shape used by pitch-linear operators |
CRowMajor | Mapping function for row-major matrices |
CRowMajorBlockLinear | |
CRowMajorInterleaved | |
CRowMajorTensorOpMultiplicandCongruous | |
CRowMajorTensorOpMultiplicandCrosswise | |
CRowMajorVoltaTensorOpMultiplicandBCongruous | Template mapping a row-major view of pitch-linear memory to VoltaTensorOpMultiplicandCongruous |
CRowMajorVoltaTensorOpMultiplicandCongruous | Template mapping a row-major view of pitch-linear memory to VoltaTensorOpMultiplicandCongruous |
CRowMajorVoltaTensorOpMultiplicandCrosswise | |
CTensorCxRSKx | Mapping function for 4-D CxRSKx tensors |
CTensorNCHW | Mapping function for 4-D NCHW tensors |
CTensorNCxHWx | Mapping function for 4-D NC/xHWx tensors |
CTensorNHWC | Mapping function for 4-D NHWC tensors |
CTensorOpMultiplicand | |
CTensorOpMultiplicandColumnMajorInterleaved | Template based on element size (in bits) - defined in terms of pitch-linear memory |
CTensorOpMultiplicandCongruous | |
CTensorOpMultiplicandCongruous< 32, Crosswise > | |
CTensorOpMultiplicandCrosswise | |
CTensorOpMultiplicandRowMajorInterleaved | Template based on element size (in bits) - defined in terms of pitch-linear memory |
CVoltaTensorOpMultiplicandBCongruous | Template based on element size (in bits) - defined in terms of pitch-linear memory |
CVoltaTensorOpMultiplicandCongruous | Template based on element size (in bits) - defined in terms of pitch-linear memory |
CVoltaTensorOpMultiplicandCrosswise | |
►Nlibrary | |
CGemmArguments | Arguments for GEMM |
CGemmArrayArguments | Arguments for GEMM - used by all the GEMM operations |
CGemmArrayConfiguration | Configuration for batched GEMM in which multiple matrix products are computed |
CGemmBatchedConfiguration | Configuration for batched GEMM in which multiple matrix products are computed |
CGemmConfiguration | Configuration for basic GEMM operations |
CGemmDescription | Description of all GEMM computations |
CGemmPlanarComplexBatchedConfiguration | Batched complex valued GEMM in which real and imaginary parts are separated by a stride |
CGemmPlanarComplexConfiguration | Complex valued GEMM in which real and imaginary parts are separated by a stride |
CManifest | Manifest of CUTLASS Library |
CMathInstructionDescription | |
COperation | Base class for all device-wide operations |
COperationDescription | High-level description of an operation |
CTensorDescription | Structure describing the properties of a tensor |
CTileDescription | Structure describing the tiled structure of a GEMM-like computation |
►Nplatform | |
Caligned_chunk | |
Caligned_storage | Std::aligned_storage |
►Calignment_of | Std::alignment_of |
Cpad | |
Calignment_of< const value_t > | |
Calignment_of< const volatile value_t > | |
Calignment_of< double2 > | |
Calignment_of< double4 > | |
Calignment_of< float4 > | |
Calignment_of< int4 > | |
Calignment_of< long4 > | |
Calignment_of< longlong2 > | |
Calignment_of< longlong4 > | |
Calignment_of< uint4 > | |
Calignment_of< ulong4 > | |
Calignment_of< ulonglong2 > | |
Calignment_of< ulonglong4 > | |
Calignment_of< volatile value_t > | |
Cbool_constant | Std::bool_constant |
Cconditional | Std::conditional (true specialization) |
Cconditional< false, T, F > | Std::conditional (false specialization) |
Cdefault_delete | Default deleter |
Cdefault_delete< T[]> | Partial specialization for deleting array types |
Cenable_if | Std::enable_if (true specialization) |
Cenable_if< false, T > | Std::enable_if (false specialization) |
Cintegral_constant | Std::integral_constant |
Cis_arithmetic | Std::is_arithmetic |
Cis_base_of | Std::is_base_of |
►Cis_base_of_helper | Helper for std::is_base_of |
Cdummy | |
Cis_floating_point | Std::is_floating_point |
Cis_fundamental | Std::is_fundamental |
Cis_integral | Std::is_integral |
Cis_integral< char > | |
Cis_integral< const T > | |
Cis_integral< const volatile T > | |
Cis_integral< int > | |
Cis_integral< long > | |
Cis_integral< long long > | |
Cis_integral< short > | |
Cis_integral< signed char > | |
Cis_integral< unsigned char > | |
Cis_integral< unsigned int > | |
Cis_integral< unsigned long > | |
Cis_integral< unsigned long long > | |
Cis_integral< unsigned short > | |
Cis_integral< volatile T > | |
Cis_pointer | Std::is_pointer |
Cis_pointer_helper | Helper for std::is_pointer (false specialization) |
Cis_pointer_helper< T * > | Helper for std::is_pointer (true specialization) |
Cis_same | Std::is_same (false specialization) |
Cis_same< A, A > | Std::is_same (true specialization) |
Cis_trivially_copyable | |
Cis_void | Std::is_void |
Cis_volatile | Std::is_volatile |
Cis_volatile< volatile T > | |
Cnullptr_t | Std::nullptr_t |
Cremove_const | Std::remove_const (non-const specialization) |
Cremove_const< const T > | Std::remove_const (const specialization) |
Cremove_cv | Std::remove_cv |
Cremove_volatile | Std::remove_volatile (non-volatile specialization) |
Cremove_volatile< volatile T > | Std::remove_volatile (volatile specialization) |
Cunique_ptr | Std::unique_ptr |
►Nreduction | |
►Nkernel | |
►CReduceSplitK | |
CParams | Params structure |
CSharedStorage | |
►Nthread | |
CReduce | Structure to compute the thread level reduction |
CReduce< plus< half_t >, AlignedArray< half_t, N > > | Partial specializations of Reduce for AlignedArray<half_t, N> |
CReduce< plus< half_t >, Array< half_t, N > > | Partial specializations of Reduce for Array<half_t, N> |
CReduce< plus< T >, Array< T, N > > | Partial specialization of Reduce for Array<T, N> |
CReduce< plus< T >, T > | Partial Specialization of Reduce for "plus" (a functional operator) |
►CReduceAdd | Mixed-precision reduction |
CParams | |
CBatchedReduction | |
►CBatchedReductionTraits | |
CParams | |
CDefaultBlockSwizzle | |
►Nreference | |
►Ndetail | |
CCast | |
CCast< float, int8_t > | |
CCast< float, uint8_t > | |
►Ndevice | |
►Ndetail | |
►CRandomGaussianFunc | |
CParams | Parameters structure |
►CRandomUniformFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorCopyDiagonalInFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorCopyDiagonalOutFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorFillDiagonalFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorFillLinearFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorFillRandomGaussianFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorFillRandomUniformFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorUpdateDiagonalFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►CTensorUpdateOffDiagonalFunc | Computes a random Gaussian distribution |
CParams | Parameters structure |
►Nkernel | |
►Ndetail | Defines several helpers |
CTensorForEachHelper | Helper to perform for-each operation |
CTensorForEachHelper< Func, Rank, 0 > | Helper to perform for-each operation |
►Nthread | |
CGemm | Thread-level blocked general matrix product |
CBlockForEach | |
CGemm | |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, AccumulatorType, arch::OpXorPopc > | Partial specialization for XOR-popc |
CTensorDiagonalForEach | Launches a kernel calling a functor for each element along a tensor's diagonal |
CTensorForEach | Launches a kernel calling a functor for each element in a tensor's index space |
►Nhost | |
►Ndetail | Defines several helpers |
CRandomGaussianFunc | |
CRandomGaussianFunc< complex< Element > > | Partial specialization for initializing a complex value |
CRandomUniformFunc | |
CRandomUniformFunc< complex< Element > > | Partial specialization for initializing a complex value |
CTensorContainsFunc | < Layout function |
CTensorCopyIf | Helper to conditionally copy between tensor views |
CTensorEqualsFunc | < Layout function |
CTensorFillDiagonalFunc | < Layout function |
CTensorFillFunc | < Layout function |
CTensorFillGaussianFunc | Computes a random Gaussian distribution |
CTensorFillLinearFunc | < Layout function |
CTensorFillRandomUniformFunc | Computes a random Gaussian distribution |
CTensorForEachHelper | Helper to perform for-each operation |
CTensorForEachHelper< Func, Rank, 0 > | Helper to perform for-each operation |
CTensorFuncBinaryOp | Helper to apply a binary operator in place |
CTensorUpdateOffDiagonalFunc | < Layout function |
CTrivialConvert | Helper to convert between types |
CBlockForEach | |
CGemm | |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAdd > | Partial specialization for multiply-add |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpMultiplyAddSaturate > | Partial specialization for multiply-add-saturate |
CGemm< ElementA, LayoutA, ElementB, LayoutB, ElementC, LayoutC, ScalarType, ComputeType, arch::OpXorPopc > | Partial specialization for XOR-popc |
►Nthread | |
CMatrix | Per-thread matrix object storing a packed matrix |
►Ntransform | |
►Nthread | |
CTranspose | Transforms a fragment by doing a transpose |
CTranspose< ElementCount_, layout::PitchLinearShape< 4, 4 >, int8_t > | Specialization for int8_t 4x4 transpose |
►Nthreadblock | |
CPredicatedTileAccessIterator | |
CPredicatedTileAccessIterator2dThreadTile | |
►CPredicatedTileAccessIterator2dThreadTile< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator2dThreadTile< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator2dThreadTile< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator< Shape_, Element_, layout::ColumnMajorInterleaved< InterleavedK >, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileAccessIterator< Shape_, Element_, layout::RowMajorInterleaved< InterleavedK >, AdvanceRank, ThreadMap_, AccessType_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
CPredicatedTileIterator | |
CPredicatedTileIterator2dThreadTile | |
►CPredicatedTileIterator2dThreadTile< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Transpose_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator2dThreadTile< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Transpose_ > | |
CAccessType | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator2dThreadTile< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Transpose_ > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, AccessSize > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator< Shape_, Element_, layout::ColumnMajorInterleaved< InterleavedK >, AdvanceRank, ThreadMap_, AccessSize > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, AccessSize > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, AccessSize > | |
CParams | Parameters object is precomputed state and is host-constructible |
►CPredicatedTileIterator< Shape_, Element_, layout::RowMajorInterleaved< InterleavedK >, AdvanceRank, ThreadMap_, AccessSize > | |
CParams | Parameters object is precomputed state and is host-constructible |
CRegularTileAccessIterator | |
CRegularTileAccessIterator< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileAccessIterator< Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
►CRegularTileAccessIterator< Shape_, Element_, layout::TensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
►CRegularTileAccessIterator< Shape_, Element_, layout::TensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
CRegularTileIterator | |
CRegularTileIterator2dThreadTile | |
CRegularTileIterator2dThreadTile< Shape_, Element_, layout::ColumnMajorInterleaved< 4 >, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping |
CRegularTileIterator2dThreadTile< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for pitch-linear + 2d thread-tiled threadmapping |
CRegularTileIterator2dThreadTile< Shape_, Element_, layout::RowMajorInterleaved< 4 >, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for interleaved layout + 2d thread-tiled threadmapping |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajor, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for pitch-linear |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandBCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::ColumnMajorVoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Shape_::kRow >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::PitchLinear, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for pitch-linear |
CRegularTileIterator< Shape_, Element_, layout::RowMajor, AdvanceRank, ThreadMap_, Alignment > | Regular tile iterator specialized for pitch-linear |
CRegularTileIterator< Shape_, Element_, layout::RowMajorTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::RowMajorTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandBCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CRegularTileIterator< Shape_, Element_, layout::RowMajorVoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Shape_::kColumn >, AdvanceRank, ThreadMap_, Alignment > | |
►CRegularTileIterator< Shape_, Element_, layout::TensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value, int(128/sizeof(Element_))>, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
►CRegularTileIterator< Shape_, Element_, layout::TensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Crosswise >, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
►CRegularTileIterator< Shape_, Element_, layout::VoltaTensorOpMultiplicandBCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
►CRegularTileIterator< Shape_, Element_, layout::VoltaTensorOpMultiplicandCongruous< sizeof_bits< Element_ >::value >, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
►CRegularTileIterator< Shape_, Element_, layout::VoltaTensorOpMultiplicandCrosswise< sizeof_bits< Element_ >::value, Shape_::kContiguous >, AdvanceRank, ThreadMap_, Alignment > | |
CDetail | Internal details made public to facilitate introspection |
CPitchLinear2DThreadTileStripminedThreadMap | |
►CPitchLinear2DThreadTileStripminedThreadMap< Shape_, Threads, cutlass::layout::PitchLinearShape< 4, 4 > > | |
CDetail | Internal implementation details |
►CPitchLinearStripminedThreadMap | |
CDetail | Internal implementation details |
CPitchLinearTilePolicyStripminedThreadContiguous | |
CPitchLinearTilePolicyStripminedThreadStrided | |
►CPitchLinearWarpRakedThreadMap | |
CDetail | Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) |
►CPitchLinearWarpStripedThreadMap | |
CDetail | Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) |
►CTransposePitchLinearThreadMap | |
CDetail | Internal details made public to facilitate introspection Iterations along each dimension (concept: PitchLinearShape) |
CTransposePitchLinearThreadMap2DThreadTile | Thread Mapping a 2D threadtiled mapping as a transposed Pitchlinear2DThreadTile mapping |
CTransposePitchLinearThreadMapSimt | |
CAlignedArray | Aligned array type |
CAlignedBuffer | Modifies semantics of cutlass::Array<> to provide guaranteed alignment |
►CArray< T, N, false > | Statically sized array for any data type |
Cconst_iterator | Bidirectional constant iterator over elements |
Cconst_reference | Reference object extracts sub-byte items |
Cconst_reverse_iterator | Bidirectional constant iterator over elements |
Citerator | Bidirectional iterator over elements |
Creference | Reference object inserts or extracts sub-byte items |
Creverse_iterator | Bidirectional iterator over elements |
►CArray< T, N, true > | Statically sized array for any data type |
Cconst_iterator | Bidirectional constant iterator over elements |
Cconst_reverse_iterator | Bidirectional constant iterator over elements |
Citerator | Bidirectional iterator over elements |
Creverse_iterator | Bidirectional iterator over elements |
CCommandLine | |
Ccomplex | |
CConstSubbyteReference | |
CCoord | Statically-sized array specifying Coords within a tensor |
Ccuda_exception | C++ exception wrapper for CUDA cudaError_t |
CDistribution | Distribution type |
Cdivide_assert | |
Cdivides | |
Cdivides< Array< half_t, N > > | |
Cdivides< Array< T, N > > | |
CFloatType | Defines a floating-point type based on the number of exponent and mantissa bits |
CFloatType< 11, 52 > | |
CFloatType< 5, 10 > | |
CFloatType< 8, 23 > | |
Chalf_t | IEEE half-precision floating-point type |
CHostTensor | Host tensor |
CIdentityTensorLayout | |
Cinteger_subbyte | 4-bit signed integer type |
CIntegerType | Defines integers based on size and whether they are signed |
CIntegerType< 1, false > | |
CIntegerType< 1, true > | |
CIntegerType< 16, false > | |
CIntegerType< 16, true > | |
CIntegerType< 32, false > | |
CIntegerType< 32, true > | |
CIntegerType< 4, false > | |
CIntegerType< 4, true > | |
CIntegerType< 64, false > | |
CIntegerType< 64, true > | |
CIntegerType< 8, false > | |
CIntegerType< 8, true > | |
Cis_pow2 | |
CKernelLaunchConfiguration | Structure containing the basic launch configuration of a CUDA kernel |
Clog2_down | |
Clog2_down< N, 1, Count > | |
Clog2_up | |
Clog2_up< N, 1, Count > | |
CMatrixCoord | |
CMatrixShape | Describes the size of a matrix tile |
CMax | |
Cmaximum | |
Cmaximum< Array< T, N > > | |
Cmaximum< float > | |
CMin | |
Cminimum | |
Cminimum< Array< T, N > > | |
Cminimum< float > | |
Cminus | |
Cminus< Array< half_t, N > > | |
Cminus< Array< T, N > > | |
Cmultiplies | |
Cmultiplies< Array< half_t, N > > | |
Cmultiplies< Array< T, N > > | |
Cmultiply_add | Fused multiply-add |
Cmultiply_add< Array< half_t, N >, Array< half_t, N >, Array< half_t, N > > | Fused multiply-add |
Cmultiply_add< Array< T, N >, Array< T, N >, Array< T, N > > | Fused multiply-add |
Cmultiply_add< complex< T >, complex< T >, complex< T > > | Fused multiply-add |
Cmultiply_add< complex< T >, T, complex< T > > | Fused multiply-add |
Cmultiply_add< T, complex< T >, complex< T > > | Fused multiply-add |
Cnegate | |
Cnegate< Array< half_t, N > > | |
Cnegate< Array< T, N > > | |
CNumericArrayConverter | Conversion operator for Array |
CNumericArrayConverter< float, half_t, 2, Round > | Partial specialization for Array<float, 2> <= Array<half_t, 2>, round to nearest |
CNumericArrayConverter< float, half_t, N, Round > | Partial specialization for Array<half> <= Array<float> |
CNumericArrayConverter< half_t, float, 2, FloatRoundStyle::round_to_nearest > | Partial specialization for Array<half, 2> <= Array<float, 2>, round to nearest |
CNumericArrayConverter< half_t, float, N, Round > | Partial specialization for Array<half> <= Array<float> |
CNumericConverter | |
CNumericConverter< float, half_t, Round > | Partial specialization for float <= half_t |
CNumericConverter< half_t, float, FloatRoundStyle::round_to_nearest > | Specialization for round-to-nearest |
CNumericConverter< half_t, float, FloatRoundStyle::round_toward_zero > | Specialization for round-toward-zero |
CNumericConverter< int8_t, float, Round > | |
CNumericConverter< T, T, Round > | Partial specialization for float <= half_t |
CNumericConverterClamp | |
Cplus | |
Cplus< Array< half_t, N > > | |
Cplus< Array< T, N > > | |
►CPredicateVector | Statically sized array of bits implementing |
CConstIterator | An iterator implementing Predicate Iterator Concept enabling sequential read and write access to predicates |
CIterator | An iterator implementing Predicate Iterator Concept enabling sequential read and write access to predicates |
CTrivialIterator | Iterator that always returns true |
CRealType | Used to determine the real-valued underlying type of a numeric type T |
CRealType< complex< T > > | Partial specialization for complex-valued type |
CReferenceFactory | |
CReferenceFactory< Element, false > | |
CReferenceFactory< Element, true > | |
CScalarIO | Helper to enable formatted printing of CUTLASS scalar types to an ostream |
CSemaphore | CTA-wide semaphore for inter-CTA synchronization |
Csizeof_bits | Defines the size of an element in bits |
Csizeof_bits< Array< T, N, RegisterSized > > | Statically sized array for any data type |
Csizeof_bits< bin1_t > | Defines the size of an element in bits - specialized for bin1_t |
Csizeof_bits< int4b_t > | Defines the size of an element in bits - specialized for int4b_t |
Csizeof_bits< uint1b_t > | Defines the size of an element in bits - specialized for uint1b_t |
Csizeof_bits< uint4b_t > | Defines the size of an element in bits - specialized for uint4b_t |
Csqrt_est | |
CSubbyteReference | |
CTensor4DCoord | Defines a canonical 4D coordinate used by tensor operations |
CTensorRef | |
CTensorView | |
CTypeTraits | |
►CTypeTraits< complex< double > > | |
Cinteger_type | |
Cunsigned_type | |
CTypeTraits< complex< float > > | |
CTypeTraits< complex< half > > | |
CTypeTraits< complex< half_t > > | |
CTypeTraits< double > | |
CTypeTraits< float > | |
CTypeTraits< half_t > | |
CTypeTraits< int > | |
CTypeTraits< int64_t > | |
CTypeTraits< int8_t > | |
CTypeTraits< uint64_t > | |
CTypeTraits< uint8_t > | |
CTypeTraits< unsigned > | |
Cxor_add | Fused multiply-add |
►Nstd | STL namespace |
Cnumeric_limits< cutlass::half_t > | Numeric limits |
CDebugType | |
CDebugValue | |