Here is a list of all files with brief descriptions:

aligned_buffer.h	AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory
arch.h	Defines tags for architecture-specific configurations
array.h	Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
array_subbyte.h	Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
batched_reduction.h	Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C
batched_reduction_traits.h	Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C
command_line.h
complex.h
conversion_op.h	Functor performing conversion operations used by epilogues
coord.h	A Coord is a coordinate of arbitrary rank into a tensor or matrix
core_io.h	Helpers for printing cutlass/core objects
cutlass.h	Basic include for CUTLASS
include/cutlass/util/debug.h	Debugging and logging functionality
tools/util/include/cutlass/util/debug.h	Contains code for debugging cutlass code
default_epilogue_complex_tensor_op.h	Epilogue for threadblock scoped complex GEMMs using Tensor Ops
default_epilogue_simt.h	Epilogue for threadblock scoped GEMMs using SIMT
default_epilogue_tensor_op.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
default_epilogue_volta_tensor_op.h	Epilogue for threadblock scoped GEMMs using Tensor Ops on Volta
default_epilogue_wmma_tensor_op.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
default_gemm.h	Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue
default_gemm_configuration.h	Definitions for GEMM structures
default_gemm_splitk_parallel.h	Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue
default_gemv.h
default_gemv_core.h	Defines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
default_mma_core.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_core_simt.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_core_sm50.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_core_sm70.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_core_sm75.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_core_wmma.h	Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes
default_mma_tensor_op.h	Default warp-level GEMM operators selected by data type, size, and layouts of operands
default_mma_wmma_tensor_op.h	Default warp-level GEMM operators selected by data type, size, and layouts of operands
default_thread_map_simt.h
default_thread_map_tensor_op.h
default_thread_map_volta_tensor_op.h
default_thread_map_wmma_tensor_op.h
device_dump.h	C++ interface to dump fragments and shared memory contents for debugging
device_kernel.h	Template for generic CUTLASS kernel
device_memory.h	C++ interface to CUDA device memory management functions
direct_epilogue_tensor_op.h	Epilogue for tensor operations
distribution.h	This header contains a class to parametrize a statistical distribution function
epilogue.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
epilogue_base.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
epilogue_workspace.h	Epilogue for threadblock scoped GEMMs
exceptions.h	C++ exception semantics for CUDA error codes
fast_math.h	Math utilities
fragment_iterator_complex_tensor_op.h	This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
fragment_iterator_simt.h	This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
fragment_iterator_tensor_op.h	This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
fragment_iterator_volta_tensor_op.h	This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
fragment_iterator_wmma_tensor_op.h	This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation
functional.h	Define basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible
include/cutlass/gemm/device/gemm.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
include/cutlass/gemm/gemm.h	Defines common types used for all GEMM-like operators
include/cutlass/gemm/kernel/gemm.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
tools/util/include/cutlass/util/reference/device/gemm.h	Reference implementation for GEMM in device-side code
tools/util/include/cutlass/util/reference/device/kernel/gemm.h	Reference implementation for GEMM in host-side code
tools/util/include/cutlass/util/reference/device/thread/gemm.h	Reference implementation for GEMM in host-side code
tools/util/include/cutlass/util/reference/host/gemm.h	Reference implementation for GEMM in host-side code
device/gemm_batched.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
kernel/gemm_batched.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
include/cutlass/gemm/device/gemm_complex.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
tools/util/include/cutlass/util/reference/host/gemm_complex.h	Reference implementation for complex-valued GEMM in host-side code
gemm_pipelined.h	Template for a pipelined GEMM kernel. Does not compute batching or support split-K
device/gemm_splitk_parallel.h	Template for GEMM performing a reduction over K partitions in parallel
kernel/gemm_splitk_parallel.h	Template for GEMM performing a reduction over K partitions in parallel
gemv.h	Template for a threadblock-scoped GEMV kernel
gemv_batched_strided.h
half.h	Defines a class for using IEEE half-precision floating-point types in host or device code
host_reorder.h	Reorder data from the host side
host_tensor.h	HostTensor contributes management for both host and device memory
inner_product.h	Reference implementation for GEMM in host-side code
integer_subbyte.h	Defines a class for using integer types smaller than one byte in host or device code
interleaved_epilogue.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
kernel_launch.h	Defines structures and helpers to launch CUDA kernels within CUTLASS
layout.h	Defines layout functions used by TensorRef and derived classes
library.h	CUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS
linear_combination.h	Functor performing linear combination operations used by epilogues
linear_combination_clamp.h	Functor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type
linear_combination_relu.h	Functor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type
manifest.h	Manifest of CUTLASS Library
layout/matrix.h	Defines layout functions used by TensorRef and derived classes
thread/matrix.h	Defines a matrix object intended for storing data in registers and operations within a CUDA thread
matrix_coord.h	Defines a canonical coordinate for rank=2 matrices offering named indices
matrix_shape.h	Defines a Shape template for matrix tiles
matrix_traits.h	Defines properties of matrices used to denote layout and operands to GEMM kernels
memory.h	Architecture-specific operators on memory
memory_sm75.h	Architecture-specific operators on memory added for SM75
arch/mma.h	Templates exposing architecture support for multiply-add operations
gemm/thread/mma.h	Templates exposing architecture support for warp-level multiply-add operations
gemm/warp/mma.h	Templates exposing architecture support for warp-level multiply-add operations
mma_base.h	Template for a double-buffered threadblock-scoped GEMM kernel
mma_complex_tensor_op.h	Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
mma_pipelined.h	Template for a double-buffered threadblock-scoped GEMM kernel
mma_simt.h	Templates implementing warp-level matrix multiply-accumulate operations
mma_simt_policy.h	Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions
mma_simt_tile_iterator.h	Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions
mma_singlestage.h	Template for a double-buffered threadblock-scoped GEMM kernel
arch/mma_sm50.h	Matrix multiply
gemm/thread/mma_sm50.h	Templates exposing architecture support for multiply-add operations
arch/mma_sm60.h	Matrix multiply
gemm/thread/mma_sm60.h	Templates exposing architecture support for multiply-add operations
arch/mma_sm61.h	Matrix multiply
gemm/thread/mma_sm61.h	Templates exposing architecture support for multiply-add operations
mma_sm70.h	Matrix multiply
mma_sm75.h	Matrix multiply for SM75
mma_tensor_op.h	Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
mma_tensor_op_policy.h	Policy describing implementation details of warp-level GEMM targeting Tensor Cores
mma_tensor_op_sm70.h	Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
mma_tensor_op_tile_iterator.h	Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores
mma_tensor_op_tile_iterator_sm70.h	Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores
mma_tensor_op_tile_iterator_wmma.h	Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores
mma_tensor_op_wmma.h	Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores
numeric_conversion.h	Boost-like numeric conversion operator for CUTLASS numeric types
numeric_types.h	Top-level include for all CUTLASS numeric types
output_tile_thread_map.h	Metaprogram for determining the mapping of output elements to threads for epilogue tiles
pitch_linear.h	Defines layout functions used by TensorRef and derived classes for pitch-linear memory
pitch_linear_thread_map.h	Templates implementing how threads are mapped to a given tile
platform.h	C++ features that may be otherwise unimplemented for CUDA device functions
predicate_vector.h	Defines container classes and iterators for managing a statically sized vector of boolean predicates
predicated_tile_access_iterator.h	Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors
predicated_tile_access_iterator_2dthreadtile.h	Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors
epilogue/threadblock/predicated_tile_iterator.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
transform/threadblock/predicated_tile_iterator.h	Templates implementing loading of tiles from pitch-linear rank=2 tensors
predicated_tile_iterator_2dthreadtile.h	Templates implementing loading of tiles from pitch-linear rank=2 tensors
real.h
reduce.h	Defines basic thread level reduction with specializations for Array<T, N>
reduce_split_k.h	Kernel performing a reduction over densely packed tensors in global memory
reduction_op.h	Functor performing reduction operations used by epilogues
reduction_operators.h	Kernel performing a reduction over densely packed tensors in global memory
regular_tile_access_iterator.h	Templates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors
regular_tile_access_iterator_pitch_linear.h	Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors
regular_tile_access_iterator_tensor_op.h	Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors
regular_tile_iterator.h	Templates implementing storing of tiles from pitch-linear rank=2 tensors
regular_tile_iterator_pitch_linear.h	Templates implementing loading of tiles from pitch-linear rank=2 tensors
regular_tile_iterator_pitch_linear_2dthreadtile.h	Templates implementing loading of tiles from pitch-linear rank=2 tensors
regular_tile_iterator_tensor_op.h	Templates implementing storing of tiles from pitch-linear rank=2 tensors
regular_tile_iterator_tensor_op_sm70.h	Templates implementing loading of tiles from pitch-linear rank=2 tensors
relatively_equal.h
semaphore.h	Implementation of a CTA-wide semaphore for inter-CTA synchronization
shared_load_iterator.h	Epilogue for threadblock scoped GEMMs using Tensor Ops
simd.h	Templates exposing SIMD operators
simd_sm60.h	Templates exposing SIMD operators for SM60
simd_sm61.h	Templates exposing SIMD operators for SM60
simt_policy.h	Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration
subbyte_reference.h	Provides a mechanism for packing and unpacking elements smaller than one byte
tensor.h	Defines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats
device/tensor_compare.h
host/tensor_compare.h
tensor_coord.h	Defines a canonical coordinate for rank=4 tensors offering named indices
tensor_copy.h
device/kernel/tensor_elementwise.h
host/tensor_elementwise.h
device/tensor_fill.h
host/tensor_fill.h
device/kernel/tensor_foreach.h
device/tensor_foreach.h
host/tensor_foreach.h
tensor_norm.h
tensor_op_multiplicand_sm70.h
tensor_op_multiplicand_sm75.h
tensor_op_policy.h	Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration
tensor_ref.h	Defines a structure containing strides, bounds, and a pointer to tensor data
tensor_view.h	Defines a structure containing strides and a pointer to tensor data
tensor_view_io.h
gemm/threadblock/threadblock_swizzle.h	Implements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems
reduction/threadblock_swizzle.h	Defies functors for mapping blockIdx to partitions of the batched reduction computation
tile_iterator_simt.h
tile_iterator_tensor_op.h
tile_iterator_volta_tensor_op.h
tile_iterator_wmma_tensor_op.h
transpose.h	Basic copy routines for tensor views
type_traits.h	Type traits for common CUDA types
vector.h	Defines layout functions used for rank=1 vectors
volta_tensor_op_policy.h	Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration
wmma.h	Templates exposing architecture support for warp matrix multiply-add (WMMA) operations
wmma_array.h	Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union
wmma_ptx.h	Templates exposing warp matrix multiply-add (WMMA) operations
wmma_sm70.h	Matrix multiply
wmma_sm72.h	Matrix multiply
wmma_sm75.h	Matrix multiply
wmma_tensor_op_policy.h	Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration