Script.atomic.shared_add¶

Script.atomic.shared_add(dst, values, *, sem='relaxed', scope='cta', output=None)[source]¶

Element-wise dst[i] = dst[i] + values[i] atomically, on shared memory.

dst.shape and values.shape must be equal, and each lane contributes its own slice of values to the matching slice of dst with no broadcast or reduction.

Parameters:

dst (SharedTensor) – Destination tile in shared memory.
values (RegisterTensor) – Per-lane contribution; same shape as dst.
sem (str) – PTX memory-ordering qualifier. Candidates: 'relaxed', 'acquire', 'release', 'acq_rel'.
scope (str) – PTX sync scope. Candidates: 'cta', 'cluster', 'gpu', 'sys'.
output (RegisterTensor, optional) – If provided, the per-element pre-RMW value at each location is written into this register tile (same shape as dst).

Returns:

The pre-RMW register tile when output is consumed downstream; None when unused (the DCE pass rewrites the instruction to the cheaper red.* form in that case).

Return type:

RegisterTensor or None

Notes

Thread group: Can be executed by any sized thread group.
Hardware: Requires compute capability 7.0+ (sm_70).
PTX: atom.{sem}.{scope}.shared.add.s32 (or red.* when the output is unused).

Script.atomic.shared_add

Contents

Script.atomic.shared_add¶