CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers
|
aligned_buffer.h | AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory |
arch.h | Defines tags for architecture-specific configurations |
array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
array_subbyte.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
batched_reduction.h | Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C |
batched_reduction_traits.h | Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C |
command_line.h | |
complex.h | |
conversion_op.h | Functor performing conversion operations used by epilogues |
coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix |
core_io.h | Helpers for printing cutlass/core objects |
cutlass.h | Basic include for CUTLASS |
include/cutlass/util/debug.h | Debugging and logging functionality |
tools/util/include/cutlass/util/debug.h | Contains code for debugging cutlass code |
default_epilogue_complex_tensor_op.h | Epilogue for threadblock scoped complex GEMMs using Tensor Ops |
default_epilogue_simt.h | Epilogue for threadblock scoped GEMMs using SIMT |
default_epilogue_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
default_epilogue_volta_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops on Volta |
default_epilogue_wmma_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
default_gemm.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue |
default_gemm_configuration.h | Definitions for GEMM structures |
default_gemm_splitk_parallel.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue |
default_gemv.h | |
default_gemv_core.h | Defines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
default_mma_core.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_core_simt.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_core_sm50.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_core_sm70.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_core_sm75.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_core_wmma.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
default_mma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands |
default_mma_wmma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands |
default_thread_map_simt.h | |
default_thread_map_tensor_op.h | |
default_thread_map_volta_tensor_op.h | |
default_thread_map_wmma_tensor_op.h | |
device_dump.h | C++ interface to dump fragments and shared memory contents for debugging |
device_kernel.h | Template for generic CUTLASS kernel |
device_memory.h | C++ interface to CUDA device memory management functions |
direct_epilogue_tensor_op.h | Epilogue for tensor operations |
distribution.h | This header contains a class to parametrize a statistical distribution function |
epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
epilogue_base.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
epilogue_workspace.h | Epilogue for threadblock scoped GEMMs |
exceptions.h | C++ exception semantics for CUDA error codes |
fast_math.h | Math utilities |
fragment_iterator_complex_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
fragment_iterator_simt.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
fragment_iterator_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
fragment_iterator_volta_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
fragment_iterator_wmma_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
functional.h | Define basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible |
include/cutlass/gemm/device/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
include/cutlass/gemm/gemm.h | Defines common types used for all GEMM-like operators |
include/cutlass/gemm/kernel/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
tools/util/include/cutlass/util/reference/device/gemm.h | Reference implementation for GEMM in device-side code |
tools/util/include/cutlass/util/reference/device/kernel/gemm.h | Reference implementation for GEMM in host-side code |
tools/util/include/cutlass/util/reference/device/thread/gemm.h | Reference implementation for GEMM in host-side code |
tools/util/include/cutlass/util/reference/host/gemm.h | Reference implementation for GEMM in host-side code |
device/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
kernel/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
include/cutlass/gemm/device/gemm_complex.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
tools/util/include/cutlass/util/reference/host/gemm_complex.h | Reference implementation for complex-valued GEMM in host-side code |
gemm_pipelined.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
device/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel |
kernel/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel |
gemv.h | Template for a threadblock-scoped GEMV kernel |
gemv_batched_strided.h | |
half.h | Defines a class for using IEEE half-precision floating-point types in host or device code |
host_reorder.h | Reorder data from the host side |
host_tensor.h | HostTensor contributes management for both host and device memory |
inner_product.h | Reference implementation for GEMM in host-side code |
integer_subbyte.h | Defines a class for using integer types smaller than one byte in host or device code |
interleaved_epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
kernel_launch.h | Defines structures and helpers to launch CUDA kernels within CUTLASS |
layout.h | Defines layout functions used by TensorRef and derived classes |
library.h | CUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS |
linear_combination.h | Functor performing linear combination operations used by epilogues |
linear_combination_clamp.h | Functor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type |
linear_combination_relu.h | Functor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type |
manifest.h | Manifest of CUTLASS Library |
layout/matrix.h | Defines layout functions used by TensorRef and derived classes |
thread/matrix.h | Defines a matrix object intended for storing data in registers and operations within a CUDA thread |
matrix_coord.h | Defines a canonical coordinate for rank=2 matrices offering named indices |
matrix_shape.h | Defines a Shape template for matrix tiles |
matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels |
memory.h | Architecture-specific operators on memory |
memory_sm75.h | Architecture-specific operators on memory added for SM75 |
arch/mma.h | Templates exposing architecture support for multiply-add operations |
gemm/thread/mma.h | Templates exposing architecture support for warp-level multiply-add operations |
gemm/warp/mma.h | Templates exposing architecture support for warp-level multiply-add operations |
mma_base.h | Template for a double-buffered threadblock-scoped GEMM kernel |
mma_complex_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
mma_pipelined.h | Template for a double-buffered threadblock-scoped GEMM kernel |
mma_simt.h | Templates implementing warp-level matrix multiply-accumulate operations |
mma_simt_policy.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions |
mma_simt_tile_iterator.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions |
mma_singlestage.h | Template for a double-buffered threadblock-scoped GEMM kernel |
arch/mma_sm50.h | Matrix multiply |
gemm/thread/mma_sm50.h | Templates exposing architecture support for multiply-add operations |
arch/mma_sm60.h | Matrix multiply |
gemm/thread/mma_sm60.h | Templates exposing architecture support for multiply-add operations |
arch/mma_sm61.h | Matrix multiply |
gemm/thread/mma_sm61.h | Templates exposing architecture support for multiply-add operations |
mma_sm70.h | Matrix multiply |
mma_sm75.h | Matrix multiply for SM75 |
mma_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
mma_tensor_op_policy.h | Policy describing implementation details of warp-level GEMM targeting Tensor Cores |
mma_tensor_op_sm70.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
mma_tensor_op_tile_iterator.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
mma_tensor_op_tile_iterator_sm70.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
mma_tensor_op_tile_iterator_wmma.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
mma_tensor_op_wmma.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
numeric_conversion.h | Boost-like numeric conversion operator for CUTLASS numeric types |
numeric_types.h | Top-level include for all CUTLASS numeric types |
output_tile_thread_map.h | Metaprogram for determining the mapping of output elements to threads for epilogue tiles |
pitch_linear.h | Defines layout functions used by TensorRef and derived classes for pitch-linear memory |
pitch_linear_thread_map.h | Templates implementing how threads are mapped to a given tile |
platform.h | C++ features that may be otherwise unimplemented for CUDA device functions |
predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates |
predicated_tile_access_iterator.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors |
predicated_tile_access_iterator_2dthreadtile.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors |
epilogue/threadblock/predicated_tile_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
transform/threadblock/predicated_tile_iterator.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
predicated_tile_iterator_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
real.h | |
reduce.h | Defines basic thread level reduction with specializations for Array<T, N> |
reduce_split_k.h | Kernel performing a reduction over densely packed tensors in global memory |
reduction_op.h | Functor performing reduction operations used by epilogues |
reduction_operators.h | Kernel performing a reduction over densely packed tensors in global memory |
regular_tile_access_iterator.h | Templates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors |
regular_tile_access_iterator_pitch_linear.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors |
regular_tile_access_iterator_tensor_op.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors |
regular_tile_iterator.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors |
regular_tile_iterator_pitch_linear.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
regular_tile_iterator_pitch_linear_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
regular_tile_iterator_tensor_op.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors |
regular_tile_iterator_tensor_op_sm70.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
relatively_equal.h | |
semaphore.h | Implementation of a CTA-wide semaphore for inter-CTA synchronization |
shared_load_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
simd.h | Templates exposing SIMD operators |
simd_sm60.h | Templates exposing SIMD operators for SM60 |
simd_sm61.h | Templates exposing SIMD operators for SM60 |
simt_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration |
subbyte_reference.h | Provides a mechanism for packing and unpacking elements smaller than one byte |
tensor.h | Defines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats |
device/tensor_compare.h | |
host/tensor_compare.h | |
tensor_coord.h | Defines a canonical coordinate for rank=4 tensors offering named indices |
tensor_copy.h | |
device/kernel/tensor_elementwise.h | |
host/tensor_elementwise.h | |
device/tensor_fill.h | |
host/tensor_fill.h | |
device/kernel/tensor_foreach.h | |
device/tensor_foreach.h | |
host/tensor_foreach.h | |
tensor_norm.h | |
tensor_op_multiplicand_sm70.h | |
tensor_op_multiplicand_sm75.h | |
tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |
tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data |
tensor_view.h | Defines a structure containing strides and a pointer to tensor data |
tensor_view_io.h | |
gemm/threadblock/threadblock_swizzle.h | Implements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems |
reduction/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the batched reduction computation |
tile_iterator_simt.h | |
tile_iterator_tensor_op.h | |
tile_iterator_volta_tensor_op.h | |
tile_iterator_wmma_tensor_op.h | |
transpose.h | Basic copy routines for tensor views |
type_traits.h | Type traits for common CUDA types |
vector.h | Defines layout functions used for rank=1 vectors |
volta_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |
wmma.h | Templates exposing architecture support for warp matrix multiply-add (WMMA) operations |
wmma_array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
wmma_ptx.h | Templates exposing warp matrix multiply-add (WMMA) operations |
wmma_sm70.h | Matrix multiply |
wmma_sm72.h | Matrix multiply |
wmma_sm75.h | Matrix multiply |
wmma_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |