Creating Tensors#
Basic construction of tensors in MatX is intended to be very simple with minimal parameters. This allows users of other languages to pick up the syntax quickly without understanding the underlying architecture. While using the simple API provides good performance, it lacks flexibility and can prevent your code from running at the highest performance possible. This document walks through the different ways to construct tensors, and when you should use certain methods over others.
A Quick Primer On MatX Types#
The basic type of tensor used in most examples and tests is the tensor_t
object. tensor_t
is the highest-level tensor class, and
provides all of the abstractions for viewing and modifying data, holding storage, and any other metadata needed by a tensor. Because of
their relatively large size, tensor_t
objects are not meant to be passed to GPU devices. In fact, doing so will lead to a compiler error
since tensor_t
uses types that are not available on the device at this time.
Within a tensor_t
there is an abstract object called Storage
(more on that later), and another inherited class called tensor_impl_t
.
tensor_impl_t
is a lightweight class containing only the minimum amount of member variables needed to access the data from a GPU kernel. Currently the
member variables are a tensor descriptor and a data pointer. Tensor descriptors will be covered later in this document.
tensor_impl_t
also includes member functions for accessing and modifying the tensor. Examples are all operator()
functions
(both const and non-const), helper functions for the shape (Size()
and Stride()
), and utilities for printing on the host. tensor_impl_t
is the type that is passed into GPU kernels, and only contains types that are compatible with CUDA. Furthermore, the total size of the tensor_impl_t
object is as small as possible since these objects can be replicated many times within a single complex expression. Reducing the size of
tensor_impl_t
allows for fastest memory accesses, smaller copies before a kernel launch, and makes extending the code easier.
To convert between a tensor_t
and tensor_impl_t
a type trait called base_type
is available and is used like the following:
typename base_type<I1>::type in1_ = in;
where in
is the tensor_t
object and in1_
will be a tensor_impl_t
.
Tensor Constructor#
Where possible, tensors should always be created using make_tensor
. This abstracts the type away from the user should any template types
change in the future. The one exception to this is when tensors are used as class members (see below). When a tensor is created with the default
constructor it is in an uninitialized state. Any type of accesses to the tensor will result in undefined behavior, so it must be initialized
using make_tensor
before using it.
MatX Storage#
Within the tensor_t
class is an abstract template parameter called Storage
. Storage
objects are always created from a basic_storage
class, which provides all accessor functions common to the underlying storage. basic_storage
can wrap raw pointers using the raw_pointer_buffer
class, smart pointers using the smart_pointer_buffer
class, or any RAII object that provides the required interface. If no user-defined storage
is passed in, MatX will default to allocating a raw CUDA managed memory pointer, and back it using a shared_ptr
for garbage collection.
When not using implicitly-allocated memory, the user is free to define the storage container type, allocator, and ownership semantics. The container
type requires const and non-const iterators, an allocate function (when applicable), a data()
function to get the raw pointer, and a way to get
the size. Currently both std::array
and std::vector
from the STL follow these semantics, as do both the raw and smart pointer MatX containers.
The allocator type is used when the user passes in a shape without a pointer to existing data. By default, the allocator will use matx_allocator
,
which is a PMR-compatible allocator with stream semantics. The allocator is used for both allocation and deallocation when no user-provided pointer
is passed in and ownership semantics are requested. If a pointer is provided, only the deallocator is used when ownership semantics have been requested.
In general, creating a tensor allows you to choose ownership semantics with creation. By using the owning
type, MatX will take ownership of the pointer
and deallocate memory when the last tensor using the memory goes out of scope. By using the non_owning
type, MatX will use the pointer, but not
perform any reference counting or deallocations when out of scope.
Tensor Descriptors#
Tensor descriptors are a template type inside tensor_impl_t
that provide information about the size and strides of the tensor. While descriptors
are a simple concept, the implementation can have a large impact on performance if not tuned properly. Both the sizes and strides of the tensor are
a template class supporting iterators to access the metadata directly, and utility functions for accessing and computing other values from the metadata.
Descriptors are commonly stored as std::array
types given its compile-time features, but any class meeting the accessor properties can be used.
Dynamic Descriptors#
Dynamic descriptors use storage in memory to describe the shapes and strides of a tensor. They can have lower performance than static descriptors since more memory accesses and offset calculations are needed when accessing tensors, but have higher flexibility given the data is only needed at runtime.
Dynamic descriptors should be used when either the sizes are not known at compile-time, or when interoperating with existing code. As mentioned in the introduction, the descriptor size is very important for both kernel performance and launch time. For this reason, the data types used to store both the shape and size can vary depending on the size of the tensor parameters. While shape and stride storage types must match in length, the underlying types used to store them can be different. This is useful in scenarios where the shape can be expressed as a smaller type than the strides.
Static Descriptors#
If the shapes and strides are known at compile time, static descriptors should be used. Static descriptors compute and store the shape and strides in
constexpr variables, and provide constexpr functions to access both values. When used in a GPU kernel, calling either Size()
or Stride()
emits
an immediate rvalue that the compiler can use for address calculations. This removes all loads and complex pointer arithmetic that could affect the
runtime of a kernel
Creating Tensors#
With the tensor terminology out of the way, it’s time to discuss how to create tensors. If there’s one thing to take from this article, it’s that you
should use make_tensor
or make_static_tensor
wherever possible.
Note
Prefer make_tensor
or make_static_tensor
over constructing tensors directly
Using these helper functions has many benefits:
They remove the need to specify the rank of a tensor in the template parameters
They abstract away many of the complex template types of creating a tensor directly
They hide potentially irrelevant types from the user
All make_
-style functions return a tensor_t
object with the template parameters deduced or created as part of the input arguments. tensor_t
only has two required template parameters (type and rank). For simple cases where only implicitly-allocated memory is needed, the default constructor
will suffice. Some situations prevent using the make_
functions, such as when a tensor variable is a class member variable. In this case the type of
the member variable must be specified in the member list. In these scenaries it’s expected that the user knows what they are doing and can handle
spelling out the types themselves. For examples of this, see the simple_radar_pipeline files.
All make functions take the data type as the first template parameter.
Make Variants#
There are currently 4 different variants of the make_
helper functions:
- make_
for creating a tensor with a dynamic descriptor and returning by value
- make_static_
for creating a tensor with a static descriptor and returning by value
- make_X_p
for creating a tensor with a dynamic descriptor and returning a pointer
- make_static_X_p
for creating a tensor with a static descriptor and returning a pointer
The _p
variants return pointers allocated with new and are expected to be deleted by the caller when finished. Returning smart pointers would
have made this easier, but some users have their own smart pointer wrapper and wouldn’t want to unpack the standard library versions.
Within each of these types, there are usually versions both with and without user-defined pointers. These forms are used when an existing device pointer is passed to MatX rather than having the allocation done when the tensor is created.
Each of these 4 variants can be used with all of the construction types when applicable.
Tensor Class Members#
When creating a class that has tensors as member variables there’s an issue with the make_tensor
syntax above, in that it depends on
being able to use the auto
keyword to deduce the type. Since type deduction is not possible with member variables, the type must be
declared in the variable list. Once declared, a special version of make_tensor
can be used in the constructor or initialization function
of the class to create the tensor in-place. This allows the user to specify only the rank and type in the member list, and the size can be
specified at initialization without repeating the rank and type.
class MyClass {
public:
MyClass() {
make_tensor(t, {10, 20});
}
private:
tensor_t<float, 2> t;
};
In the example above make_tensor
takes an existing tensor as input to construct it in-place. Allocation is only performed once during initialization
and not when the tensor is declared.
Creating From C Array Or a Brace-Enclosed list#
Tensors can be created using a C-style shape array from an lvalue, or a brace-enclosed list as an rvalue. The following call the same make_
call:
int array[3] = {10, 20, 30};
auto t = make_tensor<float>(array);
and
auto t = make_tensor<float>({10, 20, 30});
In the former case the array is an lvalue that can be modified in memory before calling, whereas the latter case uses rvalues. For 0D tensors an empty braced list is required:
auto t0 = make_tensor<float>({});
When the sizes are known at compile time the static version of make_
should be used:
auto t = make_static_tensor<float, 10, 20, 30>();
Notice the sizes are now template parameters instead of function parameters. Both ways can be used interchangeable in MatX code, but the static version can lead to higher performance.
Similarly, all variants can be called with a user-defined pointer:
auto t = make_tensor<float>(ptr, {10, 20, 30}); // ptr is a valid device pointer
All cases shown above use the default stride parameters. If the strides are not linear in memory, they can be passed in as well:
int shape[3] = {10, 20, 30};
int strides[3] = {1200, 60, 2};
auto t = make_tensor<float>(shape, strides);
Creating From A Conforming Shape#
As mentioned in the descriptor section, any type that conforms to the shape semantics can be used inside of a descriptor, and can also be passed into the
make_
functions:
cuda::std::array<int, 3> = {10, 20, 30};
auto t = make_tensor<float>(array);
Creating From A Descriptor#
Descriptors (both shapes and sizes) can be used to construct tensors. This is useful when taking an existing tensor descriptor and creating a new tensor from it:
auto d = existingTensor.Descriptor();
auto t = make_tensor<float>(d);
t
is now a tensor with the same shapes and strides of existingTensor
.
0-D Tensors#
0-D tensors are different than higher ranks since they have no meaningful shape or strides, and therefor don’t need those parameters. Empty versions of the
make_
helpers existing to create these. Note the {} is important since the default constructor is used for an uninitialized tensor:
auto t0 = make_tensor<float>({});
auto t01 = make_tensor<float>(ptr, {});
Custom Storage, Descriptors, and Allocators#
Within most of the make_
functions, there are choices in the template parameters for custom storage, descriptor, and allocator types.
Storage#
Storage types can be created by wrapping a container object in the basic_storage
class. MatX has a container type built-in for both raw pointers and smart
pointers, but this can be extended to use any conforming container type. The basic_storage
class does not know about any underlying data structures or ownership;
this is encapsulated inside of the template type C
. For example, to create a custom storage object to wrap a raw pointer:
raw_pointer_buffer<T, owning, matx_allocator<T>> rp{ptr, static_cast<size_t>(desc.TotalSize()*sizeof(T))};
basic_storage<decltype(rp)> s{std::move(rp)};
The code above creates a new raw_pointer_buffer
object with ownership semantics and the matx_allocator
allocator. A constructor taking a pointer and a
size will not allocate any new data, but track the pointer internally using a smart pointer. If instead non_owning
had been passed as a template parameter, the
pointer would not be tracked or freed. With the container created, the next line passes the container into a basic_storage
object for use inside tensor_t
.
Descriptors#
Creating a descriptor can be done by using any conforming descriptor type (See descriptor explanation above). Within MatX, std::array
is used by default
when creating dynamic descriptors. Because of the variable size of the stride and shape, MatX provides helper types for creating descriptors of common types:
tensor_desc_cr_disi_dist<RANK>
for a dynamic descriptor withindex_t
strides and shapes. This is the default descriptor and can also be creating using the typeDefaultDescriptor
.index_t
is defined at compile-time, and defaults to 64-bittensor_desc_cr_ds_t<ShapeType, StrideType, RANK>
astd::array
-based descriptor with user-provided typestensor_desc_cr_ds_32_32_t<RANK>
is a descriptor with 32-bit sizes and stridestensor_desc_cr_ds_64_64_t<RANK>
is a descriptor with 64-bit sizes and stridestensor_desc_cr_ds_32_64_t<RANK>
is a descriptor with 32-bit sizes and 64-bit stridesstatic_tensor_desc_t<size_t I, Size_t Is...>
is a static-sized descriptor with the shape and stride created at compile time
To create a descriptor:
const index_t arr[3] = {10, 20, 30};
DefaultDescriptor<RANK> desc{arr};
In this case we create a default descriptor (based on index_t
sizes) using a C-style array.