CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers
File List
Here is a list of all files with brief descriptions:
 aligned_buffer.hAlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory
 arch.hDefines tags for architecture-specific configurations
 array.hStatically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
 array_subbyte.hStatically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
 batched_reduction.hImplements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C
 batched_reduction_traits.hDefines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C
 command_line.h
 complex.h
 conversion_op.hFunctor performing conversion operations used by epilogues
 coord.hA Coord is a coordinate of arbitrary rank into a tensor or matrix
 core_io.hHelpers for printing cutlass/core objects
 cutlass.hBasic include for CUTLASS
 include/cutlass/util/debug.hDebugging and logging functionality
 tools/util/include/cutlass/util/debug.hContains code for debugging cutlass code
 default_epilogue_complex_tensor_op.hEpilogue for threadblock scoped complex GEMMs using Tensor Ops
 default_epilogue_simt.hEpilogue for threadblock scoped GEMMs using SIMT
 default_epilogue_tensor_op.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 default_epilogue_volta_tensor_op.hEpilogue for threadblock scoped GEMMs using Tensor Ops on Volta
 default_epilogue_wmma_tensor_op.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 default_gemm.hDefault kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue
 default_gemm_configuration.hDefinitions for GEMM structures
 default_gemm_splitk_parallel.hDefault kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue
 default_gemv.h
 default_gemv_core.hDefines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 default_mma_core.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_core_simt.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_core_sm50.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_core_sm70.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_core_sm75.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_core_wmma.hDefines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
 default_mma_tensor_op.hDefault warp-level GEMM operators selected by data type, size, and layouts of operands
 default_mma_wmma_tensor_op.hDefault warp-level GEMM operators selected by data type, size, and layouts of operands
 default_thread_map_simt.h
 default_thread_map_tensor_op.h
 default_thread_map_volta_tensor_op.h
 default_thread_map_wmma_tensor_op.h
 device_dump.hC++ interface to dump fragments and shared memory contents for debugging
 device_kernel.hTemplate for generic CUTLASS kernel
 device_memory.hC++ interface to CUDA device memory management functions
 direct_epilogue_tensor_op.hEpilogue for tensor operations
 distribution.hThis header contains a class to parametrize a statistical distribution function
 epilogue.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 epilogue_base.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 epilogue_workspace.hEpilogue for threadblock scoped GEMMs
 exceptions.hC++ exception semantics for CUDA error codes
 fast_math.hMath utilities
 fragment_iterator_complex_tensor_op.hThis defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
 fragment_iterator_simt.hThis defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
 fragment_iterator_tensor_op.hThis defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
 fragment_iterator_volta_tensor_op.hThis defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
 fragment_iterator_wmma_tensor_op.hThis defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
 functional.hDefine basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible
 include/cutlass/gemm/device/gemm.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 include/cutlass/gemm/gemm.hDefines common types used for all GEMM-like operators
 include/cutlass/gemm/kernel/gemm.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 tools/util/include/cutlass/util/reference/device/gemm.hReference implementation for GEMM in device-side code
 tools/util/include/cutlass/util/reference/device/kernel/gemm.hReference implementation for GEMM in host-side code
 tools/util/include/cutlass/util/reference/device/thread/gemm.hReference implementation for GEMM in host-side code
 tools/util/include/cutlass/util/reference/host/gemm.hReference implementation for GEMM in host-side code
 device/gemm_batched.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 kernel/gemm_batched.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 include/cutlass/gemm/device/gemm_complex.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 tools/util/include/cutlass/util/reference/host/gemm_complex.hReference implementation for complex-valued GEMM in host-side code
 gemm_pipelined.hTemplate for a pipelined GEMM kernel. Does not compute batching or support split-K
 device/gemm_splitk_parallel.hTemplate for GEMM performing a reduction over K partitions in parallel
 kernel/gemm_splitk_parallel.hTemplate for GEMM performing a reduction over K partitions in parallel
 gemv.hTemplate for a threadblock-scoped GEMV kernel
 gemv_batched_strided.h
 half.hDefines a class for using IEEE half-precision floating-point types in host or device code
 host_reorder.hReorder data from the host side
 host_tensor.hHostTensor contributes management for both host and device memory
 inner_product.hReference implementation for GEMM in host-side code
 integer_subbyte.hDefines a class for using integer types smaller than one byte in host or device code
 interleaved_epilogue.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 kernel_launch.hDefines structures and helpers to launch CUDA kernels within CUTLASS
 layout.hDefines layout functions used by TensorRef and derived classes
 library.hCUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS
 linear_combination.hFunctor performing linear combination operations used by epilogues
 linear_combination_clamp.hFunctor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type
 linear_combination_relu.hFunctor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type
 manifest.hManifest of CUTLASS Library
 layout/matrix.hDefines layout functions used by TensorRef and derived classes
 thread/matrix.hDefines a matrix object intended for storing data in registers and operations within a CUDA thread
 matrix_coord.hDefines a canonical coordinate for rank=2 matrices offering named indices
 matrix_shape.hDefines a Shape template for matrix tiles
 matrix_traits.hDefines properties of matrices used to denote layout and operands to GEMM kernels
 memory.hArchitecture-specific operators on memory
 memory_sm75.hArchitecture-specific operators on memory added for SM75
 arch/mma.hTemplates exposing architecture support for multiply-add operations
 gemm/thread/mma.hTemplates exposing architecture support for warp-level multiply-add operations
 gemm/warp/mma.hTemplates exposing architecture support for warp-level multiply-add operations
 mma_base.hTemplate for a double-buffered threadblock-scoped GEMM kernel
 mma_complex_tensor_op.hTemplates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
 mma_pipelined.hTemplate for a double-buffered threadblock-scoped GEMM kernel
 mma_simt.hTemplates implementing warp-level matrix multiply-accumulate operations
 mma_simt_policy.hDescribes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions
 mma_simt_tile_iterator.hDescribes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions
 mma_singlestage.hTemplate for a double-buffered threadblock-scoped GEMM kernel
 arch/mma_sm50.hMatrix multiply
 gemm/thread/mma_sm50.hTemplates exposing architecture support for multiply-add operations
 arch/mma_sm60.hMatrix multiply
 gemm/thread/mma_sm60.hTemplates exposing architecture support for multiply-add operations
 arch/mma_sm61.hMatrix multiply
 gemm/thread/mma_sm61.hTemplates exposing architecture support for multiply-add operations
 mma_sm70.hMatrix multiply
 mma_sm75.hMatrix multiply for SM75
 mma_tensor_op.hTemplates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
 mma_tensor_op_policy.hPolicy describing implementation details of warp-level GEMM targeting Tensor Cores
 mma_tensor_op_sm70.hTemplates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
 mma_tensor_op_tile_iterator.hDefines iterators used by warp-level matrix multiply operations targeting Tensor Cores
 mma_tensor_op_tile_iterator_sm70.hDefines iterators used by warp-level matrix multiply operations targeting Tensor Cores
 mma_tensor_op_tile_iterator_wmma.hDefines iterators used by warp-level matrix multiply operations targeting Tensor Cores
 mma_tensor_op_wmma.hTemplates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
 numeric_conversion.hBoost-like numeric conversion operator for CUTLASS numeric types
 numeric_types.hTop-level include for all CUTLASS numeric types
 output_tile_thread_map.hMetaprogram for determining the mapping of output elements to threads for epilogue tiles
 pitch_linear.hDefines layout functions used by TensorRef and derived classes for pitch-linear memory
 pitch_linear_thread_map.hTemplates implementing how threads are mapped to a given tile
 platform.hC++ features that may be otherwise unimplemented for CUDA device functions
 predicate_vector.hDefines container classes and iterators for managing a statically sized vector of boolean predicates
 predicated_tile_access_iterator.hTemplates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors
 predicated_tile_access_iterator_2dthreadtile.hTemplates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors
 epilogue/threadblock/predicated_tile_iterator.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 transform/threadblock/predicated_tile_iterator.hTemplates implementing loading of tiles from pitch-linear rank=2 tensors
 predicated_tile_iterator_2dthreadtile.hTemplates implementing loading of tiles from pitch-linear rank=2 tensors
 real.h
 reduce.hDefines basic thread level reduction with specializations for Array<T, N>
 reduce_split_k.hKernel performing a reduction over densely packed tensors in global memory
 reduction_op.hFunctor performing reduction operations used by epilogues
 reduction_operators.hKernel performing a reduction over densely packed tensors in global memory
 regular_tile_access_iterator.hTemplates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors
 regular_tile_access_iterator_pitch_linear.hTemplates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors
 regular_tile_access_iterator_tensor_op.hTemplates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors
 regular_tile_iterator.hTemplates implementing storing of tiles from pitch-linear rank=2 tensors
 regular_tile_iterator_pitch_linear.hTemplates implementing loading of tiles from pitch-linear rank=2 tensors
 regular_tile_iterator_pitch_linear_2dthreadtile.hTemplates implementing loading of tiles from pitch-linear rank=2 tensors
 regular_tile_iterator_tensor_op.hTemplates implementing storing of tiles from pitch-linear rank=2 tensors
 regular_tile_iterator_tensor_op_sm70.hTemplates implementing loading of tiles from pitch-linear rank=2 tensors
 relatively_equal.h
 semaphore.hImplementation of a CTA-wide semaphore for inter-CTA synchronization
 shared_load_iterator.hEpilogue for threadblock scoped GEMMs using Tensor Ops
 simd.hTemplates exposing SIMD operators
 simd_sm60.hTemplates exposing SIMD operators for SM60
 simd_sm61.hTemplates exposing SIMD operators for SM60
 simt_policy.hDefines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration
 subbyte_reference.hProvides a mechanism for packing and unpacking elements smaller than one byte
 tensor.hDefines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats
 device/tensor_compare.h
 host/tensor_compare.h
 tensor_coord.hDefines a canonical coordinate for rank=4 tensors offering named indices
 tensor_copy.h
 device/kernel/tensor_elementwise.h
 host/tensor_elementwise.h
 device/tensor_fill.h
 host/tensor_fill.h
 device/kernel/tensor_foreach.h
 device/tensor_foreach.h
 host/tensor_foreach.h
 tensor_norm.h
 tensor_op_multiplicand_sm70.h
 tensor_op_multiplicand_sm75.h
 tensor_op_policy.hDefines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration
 tensor_ref.hDefines a structure containing strides, bounds, and a pointer to tensor data
 tensor_view.hDefines a structure containing strides and a pointer to tensor data
 tensor_view_io.h
 gemm/threadblock/threadblock_swizzle.hImplements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems
 reduction/threadblock_swizzle.hDefies functors for mapping blockIdx to partitions of the batched reduction computation
 tile_iterator_simt.h
 tile_iterator_tensor_op.h
 tile_iterator_volta_tensor_op.h
 tile_iterator_wmma_tensor_op.h
 transpose.hBasic copy routines for tensor views
 type_traits.hType traits for common CUDA types
 vector.hDefines layout functions used for rank=1 vectors
 volta_tensor_op_policy.hDefines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration
 wmma.hTemplates exposing architecture support for warp matrix multiply-add (WMMA) operations
 wmma_array.hStatically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
 wmma_ptx.hTemplates exposing warp matrix multiply-add (WMMA) operations
 wmma_sm70.hMatrix multiply
 wmma_sm72.hMatrix multiply
 wmma_sm75.hMatrix multiply
 wmma_tensor_op_policy.hDefines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration