Script.atomic

Tile-level atomic read-modify-write instructions.

Two flavours are offered. The element-wise form (shared_add() and friends) operates tile-to-tile: each lane contributes its own element of a register tile into the matching element of a shared or global tile, under the requested PTX atom.* operation. The scatter form (shared_scatter_add() and friends) follows torch.scatter_add_ semantics — a compile-time dim axis plus a per-lane indices tile picks the destination offset; the non-scatter axes come from the tile’s global position.

The full element-wise family is add / sub / min / max / exch / cas. The scatter family drops exch and cas (their semantics under duplicate indices are unclear) and keeps just add / sub / min / max.

All methods accept two optional PTX qualifiers:

  • sem: the memory-ordering qualifier. Candidates: 'relaxed', 'acquire', 'release', 'acq_rel'. Defaults to 'relaxed'.

  • scope: the sync-scope qualifier. Candidates: 'cta', 'cluster', 'gpu', 'sys'. Defaults to 'cta' on shared ops and 'gpu' on global ops.

And an optional output register tensor that receives the per-element pre-RMW value at each target location. When no downstream code uses the returned register, the dead-code-elimination pass rewrites the instruction to carry output=None and codegen lowers to the destination-less red.* PTX form — so the return value is free when unused.

See also

tilus.Script.store_shared_scatter()

Non-atomic scatter stores that share the same indices / values contract. Use them when the scatter is guaranteed collision-free.

Element-wise vs. scatter

Two shapes of tile-level atomic RMW are exposed:

Element-wiseshared_add() and friends. dst.shape == values.shape and each lane contributes its own slice of values into the matching slice of dst under the chosen PTX atom.* op. There is no broadcast and no reduction: this is the right primitive when each address is written independently, e.g. updating a per-element counter or applying a reduction across thread groups.

# Each lane contributes its own value to dst[lane]; one atomic op per lane.
self.atomic.shared_add(dst, values)

Scattershared_scatter_add() and friends. Modelled after torch.scatter_add_: a compile-time dim plus an indices register tile picks the destination along that axis, while the non-scatter axes come from the lane’s own tile position. This is the right primitive for histograms and other data-dependent address patterns.

# Each lane picks bin = indices[lane] and atomic-adds 1 into that bin.
self.atomic.shared_scatter_add(
    hist, dim=0, indices=bins, values=ones)

Op family

The element-wise family is add / sub / min / max / exch / cas. The scatter family drops exch and cas — their semantics under duplicate indices are not well defined — and keeps add / sub / min / max.

PTX has no native atom.sub, so sub variants lower to atom.add with a negated operand at codegen time.

In v1 only the int32 dtype is supported. float32 and uint32 coverage can be added by extending the dtype table in the underlying hidet primitive layer.

sem and scope qualifiers

All instructions accept two PTX qualifiers:

  • sem: memory-ordering qualifier, one of 'relaxed', 'acquire', 'release', 'acq_rel'. Defaults to 'relaxed'. Matches the atom.{sem}.* PTX syntax.

  • scope: sync-scope qualifier, one of 'cta', 'cluster', 'gpu', 'sys'. Defaults to 'cta' on shared-memory ops and 'gpu' on global-memory ops.

Both are passed through to the generated atom.{sem}.{scope}.{space}. {op}.{dtype} (or red.*) instruction.

Optional output register

Every atomic method accepts an optional output register tile that receives the pre-RMW value at each destination location. When the returned register is not consumed by any downstream instruction, the DCE pass rewrites the instruction to carry output=None and codegen lowers it to the destination-less red.* PTX form instead of atom.*. The net effect is that you only pay for the register return when your code actually uses it:

# No caller reads the return value → lowers to `red.*`.
self.atomic.shared_scatter_add(hist, dim=0, indices=bins, values=ones)

# Caller consumes `old` → lowers to `atom.*` with a destination register.
old = self.atomic.shared_cas(lock, compare=zero, values=one)
with self.if_then(old == 0):
    ...

exch and cas have no red.* counterpart in PTX, so their output is effectively always bound — if unused the register simply goes to waste.

Instructions

Element-wise (shared memory)

shared_add(dst, values, *[, sem, scope, output])

Element-wise dst[i] = dst[i] + values[i] atomically, on shared memory.

shared_sub(dst, values, *[, sem, scope, output])

Element-wise dst[i] = dst[i] - values[i] atomically, on shared memory.

shared_min(dst, values, *[, sem, scope, output])

Element-wise dst[i] = min(dst[i], values[i]) atomically, on shared memory.

shared_max(dst, values, *[, sem, scope, output])

Element-wise dst[i] = max(dst[i], values[i]) atomically, on shared memory.

shared_exch(dst, values, *[, sem, scope, output])

Element-wise old = dst[i]; dst[i] = values[i] atomically (exchange).

shared_cas(dst, compare, values, *[, sem, ...])

Element-wise compare-and-swap on shared memory.

Element-wise (global memory)

global_add(dst, values, *[, sem, scope, output])

Element-wise dst[i] = dst[i] + values[i] atomically, on global memory.

global_sub(dst, values, *[, sem, scope, output])

Element-wise dst[i] = dst[i] - values[i] atomically, on global memory.

global_min(dst, values, *[, sem, scope, output])

Element-wise dst[i] = min(dst[i], values[i]) atomically, on global memory.

global_max(dst, values, *[, sem, scope, output])

Element-wise dst[i] = max(dst[i], values[i]) atomically, on global memory.

global_exch(dst, values, *[, sem, scope, output])

Element-wise atomic exchange on global memory.

global_cas(dst, compare, values, *[, sem, ...])

Element-wise compare-and-swap on global memory.

Scatter (shared memory)

shared_scatter_add(dst, *, dim, indices, values)

Scatter-add into a shared tile along dim.

shared_scatter_sub(dst, *, dim, indices, values)

Scatter-sub into a shared tile along dim; lowered to atom.add with a negated value.

shared_scatter_min(dst, *, dim, indices, values)

Scatter-min into a shared tile along dim.

shared_scatter_max(dst, *, dim, indices, values)

Scatter-max into a shared tile along dim.

Scatter (global memory)

global_scatter_add(dst, *, dim, indices, values)

Scatter-add into a global tile along dim.

global_scatter_sub(dst, *, dim, indices, values)

Scatter-sub into a global tile along dim; lowered to atom.add with a negated value.

global_scatter_min(dst, *, dim, indices, values)

Scatter-min into a global tile along dim.

global_scatter_max(dst, *, dim, indices, values)

Scatter-max into a global tile along dim.