4. Instructions¶

Tilus provides a set of instructions for writing GPU kernels. Instructions are available as methods on the Script class and are called within the __call__ method of a script.

Instructions fall into two categories:

Generic instructions (self.<instruction>) — common operations available on all GPUs, such as tensor creation, load/store, arithmetic, and synchronization.
Instruction groups (self.<group>.<instruction>) — specialized hardware-specific operations organized by the hardware unit they target, such as TMA, WGMMA, and TCGEN05.

4.1. Generic Instructions¶

Hint

Please submit a feature request if your kernel requires additional instructions.

4.1.1. Tensor Creation and Free¶

Create and manage tensors in register, shared, and global memory. Register tensors hold per-thread data, shared tensors are visible to all threads in a block, and global tensors are accessible by all blocks.

`register_tensor`(*, dtype, shape[, init])	Create a register tensor.
`shared_tensor`(*, dtype, shape)	Allocate a shared tensor.
`global_tensor`(dtype, shape, *[, layout])	Allocate a global tensor.
`global_view`(ptr, *, dtype, shape[, strides])	Create a global tensor view.
`free_shared`(tensor)	Free a shared tensor.
`reshape_shared`(tensor, shape)	Reshape a shared tensor.

4.1.2. Load and Store¶

Transfer data between memory spaces. Load instructions copy data from global or shared memory into register tensors; store instructions write register data back.

`load_global`(src, /, *, offsets, shape[, ...])	Load a slice of global tensor into a register tensor.
`store_global`(dst, src, *, offsets[, dims])	Store a register tensor into a slice of a global tensor.
`load_shared`(src, *[, out])	Load a shared tensor into a register tensor.
`store_shared`(dst, src, *[, offsets, dims])	Store a register tensor into a shared tensor.

4.1.3. Asynchronous Copy (SM80+)¶

Copy data from global to shared memory asynchronously using the cp.async hardware path. Operations are grouped with copy_async_commit_group and waited on with copy_async_wait_group. For Hopper+ GPUs, prefer tma.global_to_shared which uses the TMA engine.

`copy_async`(src, dst, offsets[, dims, evict, ...])	Asynchronously copy a tile from global memory to shared memory.
`copy_async_commit_group`()	Commit async copies into a group.
`copy_async_wait_group`(n)	Wait the completion of asynchronous copy groups.
`copy_async_wait_all`()	Wait for all copy_async instructions to complete.

4.1.4. Linear Algebra¶

Matrix multiplication using tensor cores. The dot instruction automatically selects the appropriate MMA instruction based on the data types and GPU architecture. For explicit control over Hopper or Blackwell tensor cores, use wgmma.mma or tcgen05.mma instead.

dot(a, b[, c, acc_dtype, out])

Dot product.

4.1.5. Elementwise Arithmetic¶

Per-element unary and binary operations on register tensors. All elementwise operations support an optional out parameter to write results into an existing tensor, and binary operations support NumPy-style broadcasting.

`abs`(x, *[, out])	Compute the element-wise absolute value.
`add`(lhs, rhs[, out])	Element-wise addition with broadcasting.
`clip`(x, min, max, *[, out])	Clip element values to the range [min, max].
`exp`(x, *[, out])	Compute the element-wise natural exponential (e^x).
`exp2`(x, *[, out])	Compute the element-wise base-2 exponential (2^x).
`log`(x, *[, out])	Compute the element-wise natural logarithm (ln x).
`maximum`(lhs, rhs[, out])	Element-wise maximum with broadcasting.
`round`(x, *[, out])	Round each element to the nearest integer (round-to-nearest-even).
`rsqrt`(x, *[, out])	Compute the element-wise reciprocal square root (1/sqrt(x)).
`sqrt`(x, *[, out])	Compute the element-wise square root.
`square`(x, *[, out])	Compute the element-wise square (x^2).
`where`(condition, x, y, *[, out])	Select elements from `x` or `y` based on a boolean condition.

4.1.6. Reduction¶

Reduce a register tensor along one or more dimensions. Each reduction supports dim to specify which dimensions to reduce, keepdim to preserve the reduced dimension with size 1, and out for in-place output.

`all`(x, *[, dim, keepdim, out])	Test whether all elements are non-zero along the specified dimension(s).
`any`(x, *[, dim, keepdim, out])	Test whether any element is non-zero along the specified dimension(s).
`max`(x, *[, dim, keepdim, out])	Compute the maximum along the specified dimension(s).
`min`(x, *[, dim, keepdim, out])	Compute the minimum along the specified dimension(s).
`sum`(x, *[, dim, keepdim, out])	Sum elements along the specified dimension(s).

4.1.7. Transform¶

Reshape, reinterpret, or rearrange register tensor data without changing the underlying values.

`assign`(dst, src)	Assign the value of src tensor to dst tensor.
`cast`(x, dtype)	Cast a register tensor to a different data type.
`repeat`(x, repeats, *[, out])	Repeat elements of a register tensor along its dimensions.
`repeat_interleave`(x, repeats, *[, out])	Repeat elements of a register tensor along its dimensions.
`squeeze`(x, *, dim[, out])	Squeeze a dimension of a register tensor with size 1.
`transpose`(x)	Transpose a 2-D register tensor.
`unsqueeze`(x, *, dim[, out])	Unsqueeze a dimension of a register tensor.
`view`(x, *[, layout, dtype])	View register tensor with a different layout or data type.

4.1.8. Synchronization¶

Synchronize threads within a block or across a cluster. sync is the block-level barrier (equivalent to __syncthreads()). For cluster-wide synchronization, use self.cluster.sync().

sync()

Perform a synchronization.

4.1.9. Atomic and Semaphore¶

Inter-block synchronization using global memory semaphores. lock_semaphore spins until the semaphore reaches a target value; release_semaphore sets it to signal other blocks. Both must be called from a single thread (self.single_thread()).

`lock_semaphore`(semaphore, value)	Lock semaphore with a specified value.
`release_semaphore`(semaphore, value)	Release semaphore with a specified value.

4.1.10. Miscellaneous¶

Compiler hints, debugging aids, and layout annotations.

`assume`(cond)	Compiler hint to assume a condition is true.
`static_assert`(cond, msg)	Assert a compile-time condition.
`annotate_layout`(tensor, layout)	Annotate the layout of a register or shared tensor.
`fast_divmod`(a, b)	Fast integer division and modulo using precomputed magic multiplier.
`print_tensor`(msg, tensor[, fmt])	Print a tensor with a message.
`printf`(fstring, *args)	Print a formatted string.

4.2. Instruction Groups¶

Instruction groups provide access to specialized hardware units. Each group is accessed as an attribute of the script (e.g., self.tma.global_to_shared(...)).

4.2.1. Memory Barrier (`self.mbarrier`)¶

Mbarriers are synchronization primitives in shared memory that track pending arrivals and asynchronous transaction bytes (tx-count). They coordinate producer-consumer patterns in pipelined kernels, particularly with TMA and TCGEN05 async operations. See Script.mbarrier.

`alloc`(counts)	Allocate and initialize one or more mbarriers in shared memory.
`arrive`(barrier[, count, sem, scope])	Arrive at a barrier.
`arrive_and_expect_tx`(barrier, transaction_bytes)	Arrive at a barrier and declare expected asynchronous transaction bytes.
`arrive_and_expect_tx_multicast`(barrier, ...)	Arrive at barriers across multiple CTAs with expected async transactions.
`arrive_and_expect_tx_remote`(barrier, ...[, ...])	Arrive at a peer CTA's barrier with expected async transactions.
`wait`(barrier, phase[, sem, scope])	Wait for a barrier phase to complete.

4.2.2. Fence (`self.fence`)¶

Proxy fences ensure memory ordering between different memory access paths (generic proxy vs. async proxy). Required when generic writes (e.g., store_shared) must be visible to async reads (e.g., tma.shared_to_global). See Script.fence.

`proxy_async`([space])	Bidirectional async proxy fence.
`proxy_async_release`()	Unidirectional generic-to-async release proxy fence for shared memory.

4.2.3. TMA (`self.tma`)¶

The Tensor Memory Accelerator (TMA) on Hopper+ GPUs performs asynchronous bulk data transfers between global and shared memory without occupying SM compute resources. Completion is tracked via mbarriers. See Script.tma.

`global_to_shared`(*, src, dst, offsets[, ...])	Asynchronously copy a tile from global memory to shared memory via TMA.
`shared_to_global`(src, dst, offsets[, dims, ...])	Asynchronously copy a tile from shared memory to global memory via TMA.
`commit_group`()	Commit pending TMA async copy operations into a group.
`wait_group`(n[, read])	Wait for TMA async copy commit groups to complete.

4.2.4. WGMMA (`self.wgmma`)¶

Warp Group Matrix Multiply-Accumulate on Hopper GPUs. Executes asynchronous MMA using a warp group (4 warps, 128 threads) with operands in shared memory or registers. Requires a strict fence → mma → commit → wait protocol. See Script.wgmma.

`fence`()	Issue a warp group MMA fence.
`commit_group`()	Commit the previously issued warp group MMA operations.
`wait_group`(n)	Wait for warp group MMA commit groups to complete.
`mma`(a, b, d)	Perform warp group matrix multiply-accumulate (MMA) operation.

4.2.5. TCGEN05 (`self.tcgen05`)¶

Tensor Core Generation 05 on Blackwell GPUs. Introduces tensor memory (TMEM) — a dedicated on-chip accumulator space for MMA operations. Supports the full TMEM lifecycle: allocation, data movement (load/store/copy), MMA compute, and deallocation. See Script.tcgen05.

`alloc`(dtype, shape[, cta_group])	Allocate a tensor in tensor memory (TMEM).
`dealloc`(tensor)	Deallocate a tensor memory tensor.
`slice`(tensor, offsets, dims, shape)	Create a sliced view of a tensor memory tensor.
`view`(tensor, dtype, shape)	Reinterpret a tensor memory tensor with a different dtype and shape.
`relinquish_alloc_permit`(cta_group)	Relinquish the tensor memory allocation permit.
`load`(tensor)	Load data from tensor memory into registers.
`store`(tensor, src)	Store data from registers into tensor memory.
`wait_load`()	Wait for all pending tensor memory load operations to complete.
`wait_store`()	Wait for all pending tensor memory store operations to complete.
`copy`(src, dst)	Copy data from shared memory to tensor memory.
`commit`(mbarrier[, cta_group, multicast_mask])	Commit pending tcgen05 async operations and signal an mbarrier.
`mma`(a, b, d, enable_input_d[, cta_group])	Perform tensor core matrix multiply-accumulate with TMEM accumulator.

4.2.6. Cluster (`self.cluster`)¶

Block cluster operations for multi-CTA coordination on Hopper+ GPUs. Provides cluster-wide synchronization, introspection (block index/rank within the cluster), and cross-CTA shared memory addressing. See Script.cluster.

`sync`()	Synchronize all thread blocks in the current cluster.
`map_shared_addr`(addr, target_rank)	Map shared memory address(es) to the corresponding address(es) in another CTA's shared memory.
`blockIdx`	The block index within the cluster.
`blockRank`	The linear rank of the current block within the cluster.
`clusterDim`	The dimensions of the cluster.

4.2.7. CLC (`self.clc`)¶

Cluster Launch Control on Blackwell GPUs enables dynamic work scheduling by canceling not-yet-launched clusters. A scheduler CTA requests cancellation, then queries the result to take over the canceled cluster’s work. See Script.clc.

`try_cancel`(response, mbarrier, multicast)	Request cancellation of a cluster that has not yet been launched.
`query_response`(response)	Query the response from a cluster launch control try_cancel operation.

Instructions

Contents

4. Instructions¶

4.1. Generic Instructions¶

4.1.1. Tensor Creation and Free¶

4.1.2. Load and Store¶

4.1.3. Asynchronous Copy (SM80+)¶

4.1.4. Linear Algebra¶

4.1.5. Elementwise Arithmetic¶

4.1.6. Reduction¶

4.1.7. Transform¶

4.1.8. Synchronization¶

4.1.9. Atomic and Semaphore¶

4.1.10. Miscellaneous¶

4.2. Instruction Groups¶

4.2.1. Memory Barrier (self.mbarrier)¶

4.2.2. Fence (self.fence)¶

4.2.3. TMA (self.tma)¶

4.2.4. WGMMA (self.wgmma)¶

4.2.5. TCGEN05 (self.tcgen05)¶

4.2.6. Cluster (self.cluster)¶

4.2.7. CLC (self.clc)¶

4.2.1. Memory Barrier (`self.mbarrier`)¶

4.2.2. Fence (`self.fence`)¶

4.2.3. TMA (`self.tma`)¶

4.2.4. WGMMA (`self.wgmma`)¶

4.2.5. TCGEN05 (`self.tcgen05`)¶

4.2.6. Cluster (`self.cluster`)¶

4.2.7. CLC (`self.clc`)¶