Script.tma.shared_to_global¶

Script.tma.shared_to_global(src, dst, offsets, dims=None, cache_policy=None)[source]¶

Asynchronously copy a tile from shared memory to global memory via TMA.

Issues an asynchronous TMA transfer from src (shared) to a region of dst (global). The offsets specify where in the global tensor the tile is written to, and dims specifies which global dimensions map to the shared tensor dimensions.

Unlike global_to_shared(), this instruction does not use an mbarrier. Use commit_group() and wait_group() to synchronize completion.

Important

If the shared memory data was written via the generic proxy (e.g., store_shared()), a fence.proxy_async() or fence.proxy_async_release() must be called before this instruction to ensure the writes are visible to the TMA engine (async proxy).

Parameters:

src (SharedTensor) – The shared tensor to copy from.
dst (GlobalTensor) – The global tensor to copy to.
offsets (Sequence[Expr | int]) – Starting offsets for each dimension of the global tensor. Length must match the rank of the global tensor.
dims (Sequence[int], optional) – Which dimensions of the global tensor are being sliced. If not provided, defaults to all dimensions in order.
cache_policy (Optional[Expr]) – Cache eviction policy encoded as a uint64 value.

Return type:

None

Notes

Thread group: Must be executed by a warp-aligned thread group (i.e., a multiple of 32 threads).
Hardware: Requires compute capability 9.0+ (sm_90).
PTX: cp.async.bulk.tensor.shared::cta.global.tile

Script.tma.shared_to_global

Contents

Script.tma.shared_to_global¶