Script.atomic.shared_scatter_add¶

Script.atomic.shared_scatter_add(dst, *, dim, indices, values, sem='relaxed', scope='cta', output=None)[source]¶

Scatter-add into a shared tile along dim.

For each tile element k, performs dst[..., indices[k], ...] = dst[..., indices[k], ...] + values[k] atomically, where indices picks positions along dim and the non-scatter axes come from the lane’s own tile position.

indices.shape == values.shape strictly (identical RegisterLayout); dst’s non-dim axes must match indices exactly. Out-of-range index values are undefined — there is no runtime bounds check.

Parameters:

dst (SharedTensor) – Destination tile in shared memory.
dim (int) – Compile-time scatter axis into dst.
indices (RegisterTensor) – Per-lane integer indices along dim.
values (RegisterTensor) – Per-lane contributions; same shape and layout as indices.
sem (str) – PTX memory-ordering qualifier. See AtomicInstructionGroup for the accepted values.
scope (str) – PTX sync scope. See AtomicInstructionGroup.
output (RegisterTensor, optional) – If provided, receives the per-element pre-RMW value at each scattered location (same shape as indices).

Returns:

Pre-RMW values when output is consumed downstream; None when unused (the DCE pass rewrites the instruction to the cheaper red.* form).

Return type:

RegisterTensor or None

Notes

Thread group: Can be executed by any sized thread group.
Hardware: Requires compute capability 7.0+ (sm_70).
PTX: atom.{sem}.{scope}.shared.add.s32 (or red.* when the output is unused).

Script.atomic.shared_scatter_add

Contents

Script.atomic.shared_scatter_add¶