tilus.Script

class tilus.Script[source]

The Script class represents a tilus script, which defines a GPU kernel through a sequence of block-level instructions. See Tilus Script for an overview of the tilus script language.

__init__()[source]

Initializes the script. All subclass should call this __init__ method. In the __init__ method of the subclass, it can be used to perform compilation-time setup, such as defining hyper-parameters or pre-computing values that will be used in the kernel code.

__call__(*args, **kwargs)[source]

Defines the kernel code that will be executed on the GPU. This method should contain the logic of the kernel, including memory accesses, computations, and any other operations that need to be performed.

Attributes and Variables

blockIdx

Get the block index of the current thread block.

gridDim

Get the grid dimension of the kernel.

current_num_threads

Get the number of threads in the current thread group.

current_thread_begin

Get the beginning thread index of the current thread group.

current_thread_end

Get the ending thread index of the current thread group.

Language Constructs

assume(cond)

Compiler hint to assume a condition is true.

range(start[, end, step, unroll])

Create an iterator used in a for loop.

single_thread([thread])

Create a thread group context with only one thread.

single_warp([warp])

Create a thread group context with a single warp (32 threads).

static_assert(cond, msg)

Assert a compile-time condition.

thread_group(thread_begin, num_threads)

Create a thread group context.

warp_group(warp_begin, num_warps)

Create a thread group context with multiple warps.

Instructions

abs(x, *[, out])

Compute the element-wise absolute value.

add(lhs, rhs[, out])

Element-wise addition with broadcasting.

all(x, *[, dim, keepdim, out])

Test whether all elements are non-zero along the specified dimension(s).

annotate_layout(tensor, layout)

Annotate the layout of a register or shared tensor.

any(x, *[, dim, keepdim, out])

Test whether any element is non-zero along the specified dimension(s).

assign(dst, src)

Assign the value of src tensor to dst tensor.

cast(x, dtype)

Cast a register tensor to a different data type.

clip(x, min, max, *[, out])

Clip element values to the range [min, max].

cos(x, *[, out])

Compute the element-wise cosine.

copy_async(src, dst, offsets[, dims, evict, ...])

Asynchronously copy a tile from global memory to shared memory.

copy_async_commit_group()

Commit async copies into a group.

copy_async_wait_all()

Wait for all copy_async instructions to complete.

copy_async_wait_group(n)

Wait the completion of asynchronous copy groups.

dot(a, b[, c, acc_dtype, out])

Dot product.

exp(x, *[, out])

Compute the element-wise natural exponential (e^x).

exp2(x, *[, out])

Compute the element-wise base-2 exponential (2^x).

fast_divmod(a, b)

Fast integer division and modulo using precomputed magic multiplier.

flatten()

Flatten a register tensor into a 1-D tensor.

free_shared(tensor)

Free a shared tensor.

global_tensor(dtype, shape, *[, layout])

Allocate a global tensor.

global_view(ptr, *, dtype, shape[, strides])

Create a global tensor view.

load_global(src, /, *, offsets, shape[, ...])

Load a slice of global tensor into a register tensor.

load_shared(src, *[, out])

Load a shared tensor into a register tensor.

lock_semaphore(semaphore, value)

Lock semaphore with a specified value.

log(x, *[, out])

Compute the element-wise natural logarithm (ln x).

max(x, *[, dim, keepdim, out])

Compute the maximum along the specified dimension(s).

maximum(lhs, rhs[, out])

Element-wise maximum with broadcasting.

min(x, *[, dim, keepdim, out])

Compute the minimum along the specified dimension(s).

print_tensor(msg, tensor[, fmt])

Print a tensor with a message.

printf(fstring, *args)

Print a formatted string.

rand(seed, offset[, n_rounds])

Generate a block of random float32 in U(0, 1) using Philox-4x32 PRNG.

randint(seed, offset[, n_rounds])

Generate a block of random int32 using Philox-4x32 PRNG.

randint4x(seed, offset[, n_rounds])

Generate four blocks of random int32 using Philox-4x32 PRNG.

randn(seed, offset[, n_rounds])

Generate a block of random float32 in N(0, 1) using Philox-4x32 PRNG.

register_tensor(*, dtype, shape[, init])

Create a register tensor.

release_semaphore(semaphore, value)

Release semaphore with a specified value.

repeat(x, repeats, *[, out])

Repeat elements of a register tensor along its dimensions.

repeat_interleave(x, repeats, *[, out])

Repeat elements of a register tensor along its dimensions.

reshape_shared(tensor, shape)

Reshape a shared tensor.

round(x, *[, out])

Round each element to the nearest integer (round-to-nearest-even).

rsqrt(x, *[, out])

Compute the element-wise reciprocal square root (1/sqrt(x)).

shared_tensor(*, dtype, shape)

Allocate a shared tensor.

sin(x, *[, out])

Compute the element-wise sine.

sqrt(x, *[, out])

Compute the element-wise square root.

square(x, *[, out])

Compute the element-wise square (x^2).

squeeze(x, *, dim[, out])

Squeeze a dimension of a register tensor with size 1.

store_global(dst, src, *, offsets[, dims])

Store a register tensor into a slice of a global tensor.

store_shared(dst, src, *[, offsets, dims])

Store a register tensor into a shared tensor.

sum(x, *[, dim, keepdim, out])

Sum elements along the specified dimension(s).

sync()

Perform a synchronization.

transpose(x)

Transpose a 2-D register tensor.

unsqueeze(x, *, dim[, out])

Unsqueeze a dimension of a register tensor.

view(x, *[, layout, dtype])

View register tensor with a different layout or data type.

where(condition, x, y, *[, out])

Select elements from x or y based on a boolean condition.

Instruction Groups

mbarrier

Memory barrier instructions for synchronizing async memory transactions.

fence

Fence instructions for memory ordering between proxies.

tma

Tensor Memory Accelerator (TMA) async copy instructions.

tcgen05

Tensor Core Generation 05 (Blackwell) instructions.

clc

Cluster Launch Control instructions.

cluster

Block cluster synchronization and shared memory access.

wgmma

Warp Group Matrix Multiply-Accumulate (Hopper) instructions.

Script Attributes

attrs

Kernel launch configuration (blocks, warps, cluster).