Script.tma.shared_to_global¶
- Script.tma.shared_to_global(src, dst, offsets, dims=None, cache_policy=None)[source]¶
Asynchronously copy a tile from shared memory to global memory via TMA.
Issues an asynchronous TMA transfer from
src(shared) to a region ofdst(global). Theoffsetsspecify where in the global tensor the tile is written to, anddimsspecifies which global dimensions map to the shared tensor dimensions.Unlike
global_to_shared(), this instruction does not use an mbarrier. Usecommit_group()andwait_group()to synchronize completion.Important
If the shared memory data was written via the generic proxy (e.g.,
store_shared()), afence.proxy_async()orfence.proxy_async_release()must be called before this instruction to ensure the writes are visible to the TMA engine (async proxy).- Parameters:
src (SharedTensor) – The shared tensor to copy from.
dst (GlobalTensor) – The global tensor to copy to.
offsets (Sequence[Expr | int]) – Starting offsets for each dimension of the global tensor. Length must match the rank of the global tensor.
dims (Sequence[int], optional) – Which dimensions of the global tensor are being sliced. If not provided, defaults to all dimensions in order.
cache_policy (Optional[Expr]) – Cache eviction policy encoded as a uint64 value.
- Return type:
None
Notes
Thread group: Must be executed by a warp-aligned thread group (i.e., a multiple of 32 threads).
Hardware: Requires compute capability 9.0+ (sm_90).
PTX:
cp.async.bulk.tensor.shared::cta.global.tile