Script.mbarrier.arrive_and_expect_tx_multicast

Script.mbarrier.arrive_and_expect_tx_multicast

Script.mbarrier.arrive_and_expect_tx_multicast(barrier, transaction_bytes, multicast_mask, sem='release', scope='cluster')[source]

Arrive at barriers across multiple CTAs with expected async transactions.

Unlike arrive() and arrive_and_expect_tx() where every thread in the group arrives on the same barrier, this instruction elects one thread per target CTA in multicast_mask. Each elected thread arrives on the barrier at the same shared memory offset in its assigned CTA. The arrival count is 1 and the tx-count is increased by transaction_bytes on each signaled barrier.

Parameters:
  • barrier (RegisterTensor) – A single-element uint32 register tensor with the barrier’s shared memory address in the current CTA. The same offset is used for peer CTAs.

  • transaction_bytes (Expr | int) – Expected async transfer size in bytes. Must evaluate to a non-negative int32.

  • multicast_mask (int) – Bitmask of CTAs to signal. Bit i corresponds to the CTA with rank i. E.g., 0b101 signals CTAs 0 and 2.

  • sem (str) – Memory ordering semantics. Candidates: 'relaxed', 'release'.

  • scope (str) – Synchronization scope. Candidates: 'cta', 'cluster'.

Return type:

None

Notes

  • Thread group: Must be executed by a thread group with at least 16 threads.

  • Hardware: Requires compute capability 9.0+ (sm_90).

  • PTX: mbarrier.arrive.expect_tx.shared::cluster.b64 with mapa.shared::cluster