tilus.Script¶

class tilus.Script[source]¶

The Script class represents a tilus script, which defines a GPU kernel through a sequence of block-level instructions. See Tilus Script for an overview of the tilus script language.

__init__()[source]¶: Initializes the script. All subclass should call this __init__ method. In the __init__ method of the subclass, it can be used to perform compilation-time setup, such as defining hyper-parameters or pre-computing values that will be used in the kernel code.

__call__(*args, **kwargs)[source]¶: Defines the kernel code that will be executed on the GPU. This method should contain the logic of the kernel, including memory accesses, computations, and any other operations that need to be performed.

Attributes and Variables¶

`blockIdx`	Get the block index of the current thread block.
`gridDim`	Get the grid dimension of the kernel.
`current_num_threads`	Get the number of threads in the current thread group.
`current_thread_begin`	Get the beginning thread index of the current thread group.
`current_thread_end`	Get the ending thread index of the current thread group.

Language Constructs¶

`assume`(cond)	Compiler hint to assume a condition is true.
`range`(start[, end, step, unroll])	Create an iterator used in a for loop.
`single_thread`([thread])	Create a thread group context with only one thread.
`single_warp`([warp])	Create a thread group context with a single warp (32 threads).
`static_assert`(cond, msg)	Assert a compile-time condition.
`thread_group`(thread_begin, num_threads)	Create a thread group context.
`warp_group`(warp_begin, num_warps)	Create a thread group context with multiple warps.

Instructions¶

`abs`(x, *[, out])	Compute the element-wise absolute value.
`add`(lhs, rhs[, out])	Element-wise addition with broadcasting.
`all`(x, *[, dim, keepdim, out])	Test whether all elements are non-zero along the specified dimension(s).
`annotate_layout`(tensor, layout)	Annotate the layout of a register or shared tensor.
`any`(x, *[, dim, keepdim, out])	Test whether any element is non-zero along the specified dimension(s).
`assign`(dst, src)	Assign the value of src tensor to dst tensor.
`cast`(x, dtype)	Cast a register tensor to a different data type.
`clip`(x, min, max, *[, out])	Clip element values to the range [min, max].
`cos`(x, *[, out])	Compute the element-wise cosine.
`copy_async`(src, dst, offsets[, dims, evict, ...])	Asynchronously copy a tile from global memory to shared memory.
`copy_async_commit_group`()	Commit async copies into a group.
`copy_async_wait_all`()	Wait for all copy_async instructions to complete.
`copy_async_wait_group`(n)	Wait the completion of asynchronous copy groups.
`cumprod`(x, *, dim[, exclusive, out])	Inclusive (or exclusive) cumulative product along `dim`.
`cumsum`(x, *, dim[, exclusive, out])	Inclusive (or exclusive) cumulative sum along `dim`.
`dot`(a, b[, c, acc_dtype, out])	Dot product.
`exp`(x, *[, out])	Compute the element-wise natural exponential (e^x).
`exp2`(x, *[, out])	Compute the element-wise base-2 exponential (2^x).
`fast_divmod`(a, b)	Fast integer division and modulo using precomputed magic multiplier.
`flatten`()	Flatten a register tensor into a 1-D tensor.
`free_shared`(tensor)	Free a shared tensor.
`global_tensor`(dtype, shape, *[, layout])	Allocate a global tensor.
`global_view`(ptr, *, dtype, shape[, strides])	Create a global tensor view.
`load_global`(src, /, *, offsets, shape[, ...])	Load a slice of global tensor into a register tensor.
`load_shared`(src, *[, out])	Load a shared tensor into a register tensor.
`lock_semaphore`(semaphore, value)	Lock semaphore with a specified value.
`log`(x, *[, out])	Compute the element-wise natural logarithm (ln x).
`max`(x, *[, dim, keepdim, out])	Compute the maximum along the specified dimension(s).
`maximum`(lhs, rhs[, out])	Element-wise maximum with broadcasting.
`min`(x, *[, dim, keepdim, out])	Compute the minimum along the specified dimension(s).
`print_tensor`(msg, tensor[, fmt])	Print a tensor with a message.
`printf`(fstring, *args)	Print a formatted string.
`rand`(seed, offset[, n_rounds])	Generate a block of random float32 in U(0, 1) using Philox-4x32 PRNG.
`randint`(seed, offset[, n_rounds])	Generate a block of random int32 using Philox-4x32 PRNG.
`randint4x`(seed, offset[, n_rounds])	Generate four blocks of random int32 using Philox-4x32 PRNG.
`randn`(seed, offset[, n_rounds])	Generate a block of random float32 in N(0, 1) using Philox-4x32 PRNG.
`register_tensor`(*, dtype, shape[, init])	Create a register tensor.
`release_semaphore`(semaphore, value)	Release semaphore with a specified value.
`repeat`(x, repeats, *[, out])	Repeat elements of a register tensor along its dimensions.
`repeat_interleave`(x, repeats, *[, out])	Repeat elements of a register tensor along its dimensions.
`reshape_shared`(tensor, shape)	Reshape a shared tensor.
`round`(x, *[, out])	Round each element to the nearest integer (round-to-nearest-even).
`rsqrt`(x, *[, out])	Compute the element-wise reciprocal square root (1/sqrt(x)).
`scan`(x, *, dim, op[, exclusive, out])	Prefix scan along `dim`.
`shared_tensor`(*, dtype, shape)	Allocate a shared tensor.
`sin`(x, *[, out])	Compute the element-wise sine.
`sqrt`(x, *[, out])	Compute the element-wise square root.
`square`(x, *[, out])	Compute the element-wise square (x^2).
`squeeze`(x, *, dim[, out])	Squeeze a dimension of a register tensor with size 1.
`store_global`(dst, src, *, offsets[, dims])	Store a register tensor into a slice of a global tensor.
`store_global_scatter`(dst, *, dim, indices, ...)	Non-atomic scatter store into a global tensor.
`store_shared`(dst, src, *[, offsets, dims])	Store a register tensor into a shared tensor.
`store_shared_scatter`(dst, *, dim, indices, ...)	Non-atomic scatter store into a shared tensor.
`sum`(x, *[, dim, keepdim, out])	Sum elements along the specified dimension(s).
`sync`()	Perform a synchronization.
`transpose`(x)	Transpose a 2-D register tensor.
`unsqueeze`(x, *, dim[, out])	Unsqueeze a dimension of a register tensor.
`view`(x, *[, layout, dtype])	View register tensor with a different layout or data type.
`where`(condition, x, y, *[, out])	Select elements from `x` or `y` based on a boolean condition.

Instruction Groups¶

mbarrier	Memory barrier instructions for synchronizing async memory transactions.
fence	Fence instructions for memory ordering between proxies.
tma	Tensor Memory Accelerator (TMA) async copy instructions.
tcgen05	Tensor Core Generation 05 (Blackwell) instructions.
clc	Cluster Launch Control instructions.
cluster	Block cluster synchronization and shared memory access.
wgmma	Warp Group Matrix Multiply-Accumulate (Hopper) instructions.
atomic	Tile-level atomic RMW (element-wise and scatter) on shared and global memory.

Script Attributes¶

attrs

Kernel launch configuration (blocks, warps, cluster).

tilus.Script

Contents