Script.atomic.shared_scatter_add

Script.atomic.shared_scatter_add

Script.atomic.shared_scatter_add(dst, *, dim, indices, values, sem='relaxed', scope='cta', output=None)[source]

Scatter-add into a shared tile along dim.

For each tile element k, performs dst[..., indices[k], ...] = dst[..., indices[k], ...] + values[k] atomically, where indices picks positions along dim and the non-scatter axes come from the lane’s own tile position.

indices.shape == values.shape strictly (identical RegisterLayout); dst’s non-dim axes must match indices exactly. Out-of-range index values are undefined — there is no runtime bounds check.

Parameters:
  • dst (SharedTensor) – Destination tile in shared memory.

  • dim (int) – Compile-time scatter axis into dst.

  • indices (RegisterTensor) – Per-lane integer indices along dim.

  • values (RegisterTensor) – Per-lane contributions; same shape and layout as indices.

  • sem (str) – PTX memory-ordering qualifier. See AtomicInstructionGroup for the accepted values.

  • scope (str) – PTX sync scope. See AtomicInstructionGroup.

  • output (RegisterTensor, optional) – If provided, receives the per-element pre-RMW value at each scattered location (same shape as indices).

Returns:

Pre-RMW values when output is consumed downstream; None when unused (the DCE pass rewrites the instruction to the cheaper red.* form).

Return type:

RegisterTensor or None

Notes

  • Thread group: Can be executed by any sized thread group.

  • Hardware: Requires compute capability 7.0+ (sm_70).

  • PTX: atom.{sem}.{scope}.shared.add.s32 (or red.* when the output is unused).