Script.atomic.shared_add

Script.atomic.shared_add

Script.atomic.shared_add(dst, values, *, sem='relaxed', scope='cta', output=None)[source]

Element-wise dst[i] = dst[i] + values[i] atomically, on shared memory.

dst.shape and values.shape must be equal, and each lane contributes its own slice of values to the matching slice of dst with no broadcast or reduction.

Parameters:
  • dst (SharedTensor) – Destination tile in shared memory.

  • values (RegisterTensor) – Per-lane contribution; same shape as dst.

  • sem (str) – PTX memory-ordering qualifier. Candidates: 'relaxed', 'acquire', 'release', 'acq_rel'.

  • scope (str) – PTX sync scope. Candidates: 'cta', 'cluster', 'gpu', 'sys'.

  • output (RegisterTensor, optional) – If provided, the per-element pre-RMW value at each location is written into this register tile (same shape as dst).

Returns:

The pre-RMW register tile when output is consumed downstream; None when unused (the DCE pass rewrites the instruction to the cheaper red.* form in that case).

Return type:

RegisterTensor or None

Notes

  • Thread group: Can be executed by any sized thread group.

  • Hardware: Requires compute capability 7.0+ (sm_70).

  • PTX: atom.{sem}.{scope}.shared.add.s32 (or red.* when the output is unused).