Script.tma.global_to_shared

Script.tma.global_to_shared

Script.tma.global_to_shared(*, src, dst, offsets, dims=None, mbarrier, cta_group=1, multicast_mask=None, cache_policy=None)[source]

Asynchronously copy a tile from global memory to shared memory via TMA.

Issues an asynchronous TMA transfer from a region of src (global) to dst (shared). The offsets specify where in the global tensor the tile starts, and dims specifies which global dimensions map to the shared tensor dimensions.

Completion is tracked via the mbarrier: this instruction automatically increases the barrier’s tx-count by the transfer size in bytes. When the TMA engine finishes, it decrements the tx-count by the same amount. Use mbarrier.wait() to block until done.

Multicast and CTA groups:

  • cta_group=1, multicast_mask=None: single-CTA transfer. Both dst and mbarrier must be in the current CTA.

  • cta_group=1, multicast_mask != None: the loaded tile is delivered to shared memory of all CTAs specified by the mask. mbarrier must be in the current CTA.

  • cta_group=2, multicast_mask=None: dst must be in the current CTA, but mbarrier can be in the current or peer CTA.

  • cta_group=2, multicast_mask != None: the tile is multicast, and mbarrier can be in the current or peer CTA. Barriers at the same shared memory offset in the target CTAs are signaled.

Parameters:
  • src (GlobalTensor) – The global tensor to copy from.

  • dst (SharedTensor) – The shared tensor to copy to.

  • offsets (Sequence[Expr | int]) – Starting offsets for each dimension of the global tensor. Length must match the rank of the global tensor.

  • dims (Sequence[int], optional) – Which dimensions of the global tensor are being sliced. dims[0] maps to the first dimension of the shared tensor, dims[1] to the second, etc. If not provided, defaults to all dimensions in order.

  • mbarrier (Expr | RegisterTensor) – The barrier for tracking completion. A uint32 expression or single-element register tensor containing the barrier’s shared memory address.

  • cta_group (int) – CTA group size for the transfer. 1 (default) for single-CTA, 2 for two-CTA coordinated operations.

  • multicast_mask (Optional[Expr | int]) – A uint16 bitmask specifying which CTAs in the cluster receive the data. Bit i corresponds to the CTA with rank i. When None, no multicast is performed.

  • cache_policy (Optional[Expr]) – Cache eviction policy encoded as a uint64 value.

Return type:

None

Notes

  • Thread group: Must be executed by a warp-aligned thread group (i.e., a multiple of 32 threads).

  • Hardware: Requires compute capability 9.0+ (sm_90).

  • PTX: cp.async.bulk.tensor.global.shared::cta.tile.mbarrier::complete_tx::bytes