Script.tma.global_to_shared¶
- Script.tma.global_to_shared(*, src, dst, offsets, dims=None, mbarrier, cta_group=1, multicast_mask=None, cache_policy=None)[source]¶
Asynchronously copy a tile from global memory to shared memory via TMA.
Issues an asynchronous TMA transfer from a region of
src(global) todst(shared). Theoffsetsspecify where in the global tensor the tile starts, anddimsspecifies which global dimensions map to the shared tensor dimensions.Completion is tracked via the
mbarrier: this instruction automatically increases the barrier’s tx-count by the transfer size in bytes. When the TMA engine finishes, it decrements the tx-count by the same amount. Usembarrier.wait()to block until done.Multicast and CTA groups:
cta_group=1, multicast_mask=None: single-CTA transfer. Bothdstandmbarriermust be in the current CTA.cta_group=1, multicast_mask != None: the loaded tile is delivered to shared memory of all CTAs specified by the mask.mbarriermust be in the current CTA.cta_group=2, multicast_mask=None:dstmust be in the current CTA, butmbarriercan be in the current or peer CTA.cta_group=2, multicast_mask != None: the tile is multicast, andmbarriercan be in the current or peer CTA. Barriers at the same shared memory offset in the target CTAs are signaled.
- Parameters:
src (GlobalTensor) – The global tensor to copy from.
dst (SharedTensor) – The shared tensor to copy to.
offsets (Sequence[Expr | int]) – Starting offsets for each dimension of the global tensor. Length must match the rank of the global tensor.
dims (Sequence[int], optional) – Which dimensions of the global tensor are being sliced.
dims[0]maps to the first dimension of the shared tensor,dims[1]to the second, etc. If not provided, defaults to all dimensions in order.mbarrier (Expr | RegisterTensor) – The barrier for tracking completion. A uint32 expression or single-element register tensor containing the barrier’s shared memory address.
cta_group (int) – CTA group size for the transfer. 1 (default) for single-CTA, 2 for two-CTA coordinated operations.
multicast_mask (Optional[Expr | int]) – A uint16 bitmask specifying which CTAs in the cluster receive the data. Bit i corresponds to the CTA with rank i. When
None, no multicast is performed.cache_policy (Optional[Expr]) – Cache eviction policy encoded as a uint64 value.
- Return type:
None
Notes
Thread group: Must be executed by a warp-aligned thread group (i.e., a multiple of 32 threads).
Hardware: Requires compute capability 9.0+ (sm_90).
PTX:
cp.async.bulk.tensor.global.shared::cta.tile.mbarrier::complete_tx::bytes