3.4. Shared Tensor¶

A shared tensor (i.e., SharedTensor) is a tensor stored in the shared memory of the GPU.

dtype: the data type of the tensor elements, which can be any scalar type.
shape: the shape of the tensor, which is a tuple of integers representing the size of each dimension.
layout: (optional) the layout of the tensor, which defines how the tensor elements are stored in the linear shared memory.

3.4.1. Shared Tensor Instructions¶

We can use shared_tensor() to define a shared tensor in Tilus Script.

self.shared_tensor(dtype=float32, shape=[32, 64])

The above code defines a shared tensor with the data type of 32-bit float and a shape of (32, 64).

Unlike register tensor, every shared tensor must be explicitly allocated using shared_tensor(), and explicitly freed using free_shared() when it is no longer needed. Before freeing the shared tensor, we must ensure that there are no pending asynchronous operations on the shared tensor (see below).

We have a bunch of instructions related to shared tensors:

Allocate and Free

`shared_tensor`(*, dtype[, shape, layout])	Allocate a shared tensor.
`free_shared`(tensor)	Free a shared tensor.

Load and Store

`load_shared`(src, *[, layout, out])	Load a shared tensor into a register tensor.
`store_shared`(dst, src, *[, offsets, dims])	Store a register tensor into a shared tensor.

Asynchronous Copy from Global Tensor

`copy_async`(src, dst, offsets[, dims, evict, ...])	Copy from global to shared tensor asynchronously.
`copy_async_commit_group`()	Commit async copies into a group.
`copy_async_wait_group`(n)	Wait the completion of asynchronous copy groups.
`copy_async_wait_all`()	Wait for all copy_async instructions to complete.

We do not provide arithmetic instructions for shared tensors. To perform computation on shared tensors, we must first load the data into register tensors using the load_shared() instruction, perform the computation on the register tensors, and then store the results back to shared memory using the store_shared() instruction.

3.4.2. Shared Layout¶

The shared_tensor() instruction has an optional parameter layout that can be specified to define the layout of the tensor, but it is not required. When not specified, the layout will be inferred automatically based on the shape, data type and the instructions operating on the tensor.

A shared layout, SharedLayout, defines how the tensor elements are stored in the linear shared memory. You can think of the shared layout as a mapping from the multi-dimensional shape of the tensor to a linear memory address in the shared memory.

All threads in the thread block can access the shared memory. However, to achieve the best performance, we need to take care of the access patterns of the threads to the shared memory to avoid bank conflicts. The access patterns of the threads to the shared memory are determined by both the layout of the shared tensor and the layout of the register tensor that will interact with the shared tensor. Our layout inference system will try to infer the best layout for the shared tensor based on the access patterns of the threads (like automatically employing a swizzle layout). But it’s okay if the user wants to control the layout of the shared tensor manually for more fine-grained control of their kernel,

Shared Tensor

Contents

3.4. Shared Tensor¶

3.4.1. Shared Tensor Instructions¶

3.4.2. Shared Layout¶