Script.atomic.shared_add¶
- Script.atomic.shared_add(dst, values, *, sem='relaxed', scope='cta', output=None)[source]¶
Element-wise
dst[i] = dst[i] + values[i]atomically, on shared memory.dst.shapeandvalues.shapemust be equal, and each lane contributes its own slice ofvaluesto the matching slice ofdstwith no broadcast or reduction.- Parameters:
dst (SharedTensor) – Destination tile in shared memory.
values (RegisterTensor) – Per-lane contribution; same shape as
dst.sem (str) – PTX memory-ordering qualifier. Candidates:
'relaxed','acquire','release','acq_rel'.scope (str) – PTX sync scope. Candidates:
'cta','cluster','gpu','sys'.output (RegisterTensor, optional) – If provided, the per-element pre-RMW value at each location is written into this register tile (same shape as
dst).
- Returns:
The pre-RMW register tile when
outputis consumed downstream;Nonewhen unused (the DCE pass rewrites the instruction to the cheaperred.*form in that case).- Return type:
RegisterTensor or None
Notes
Thread group: Can be executed by any sized thread group.
Hardware: Requires compute capability 7.0+ (sm_70).
PTX:
atom.{sem}.{scope}.shared.add.s32(orred.*when the output is unused).