Script.atomic¶
Tile-level atomic read-modify-write instructions.
Two flavours are offered. The element-wise form (shared_add() and
friends) operates tile-to-tile: each lane contributes its own element of a
register tile into the matching element of a shared or global tile, under
the requested PTX atom.* operation. The scatter form
(shared_scatter_add() and friends) follows torch.scatter_add_
semantics — a compile-time dim axis plus a per-lane indices tile
picks the destination offset; the non-scatter axes come from the tile’s
global position.
The full element-wise family is add / sub / min / max /
exch / cas. The scatter family drops exch and cas (their
semantics under duplicate indices are unclear) and keeps just add /
sub / min / max.
All methods accept two optional PTX qualifiers:
sem: the memory-ordering qualifier. Candidates:'relaxed','acquire','release','acq_rel'. Defaults to'relaxed'.scope: the sync-scope qualifier. Candidates:'cta','cluster','gpu','sys'. Defaults to'cta'on shared ops and'gpu'on global ops.
And an optional output register tensor that receives the per-element
pre-RMW value at each target location. When no downstream code uses the
returned register, the dead-code-elimination pass rewrites the instruction
to carry output=None and codegen lowers to the destination-less
red.* PTX form — so the return value is free when unused.
See also
tilus.Script.store_shared_scatter()Non-atomic scatter stores that share the same
indices/valuescontract. Use them when the scatter is guaranteed collision-free.
Element-wise vs. scatter¶
Two shapes of tile-level atomic RMW are exposed:
Element-wise — shared_add() and friends.
dst.shape == values.shape and each lane contributes its own slice of
values into the matching slice of dst under the chosen PTX
atom.* op. There is no broadcast and no reduction: this is the right
primitive when each address is written independently, e.g. updating a
per-element counter or applying a reduction across thread groups.
# Each lane contributes its own value to dst[lane]; one atomic op per lane.
self.atomic.shared_add(dst, values)
Scatter — shared_scatter_add() and
friends. Modelled after torch.scatter_add_: a compile-time dim plus
an indices register tile picks the destination along that axis, while
the non-scatter axes come from the lane’s own tile position. This is the
right primitive for histograms and other data-dependent address patterns.
# Each lane picks bin = indices[lane] and atomic-adds 1 into that bin.
self.atomic.shared_scatter_add(
hist, dim=0, indices=bins, values=ones)
Op family¶
The element-wise family is add / sub / min / max / exch
/ cas. The scatter family drops exch and cas — their
semantics under duplicate indices are not well defined — and keeps
add / sub / min / max.
PTX has no native atom.sub, so sub variants lower to atom.add
with a negated operand at codegen time.
In v1 only the int32 dtype is supported. float32 and uint32
coverage can be added by extending the dtype table in the underlying hidet
primitive layer.
sem and scope qualifiers¶
All instructions accept two PTX qualifiers:
sem: memory-ordering qualifier, one of'relaxed','acquire','release','acq_rel'. Defaults to'relaxed'. Matches theatom.{sem}.*PTX syntax.scope: sync-scope qualifier, one of'cta','cluster','gpu','sys'. Defaults to'cta'on shared-memory ops and'gpu'on global-memory ops.
Both are passed through to the generated atom.{sem}.{scope}.{space}.
{op}.{dtype} (or red.*) instruction.
Optional output register¶
Every atomic method accepts an optional output register tile that
receives the pre-RMW value at each destination location. When the
returned register is not consumed by any downstream instruction, the DCE
pass rewrites the instruction to carry output=None and codegen
lowers it to the destination-less red.* PTX form instead of
atom.*. The net effect is that you only pay for the register return
when your code actually uses it:
# No caller reads the return value → lowers to `red.*`.
self.atomic.shared_scatter_add(hist, dim=0, indices=bins, values=ones)
# Caller consumes `old` → lowers to `atom.*` with a destination register.
old = self.atomic.shared_cas(lock, compare=zero, values=one)
with self.if_then(old == 0):
...
exch and cas have no red.* counterpart in PTX, so their
output is effectively always bound — if unused the register simply goes
to waste.
Instructions¶
Element-wise (shared memory)
|
Element-wise |
|
Element-wise |
|
Element-wise |
|
Element-wise |
|
Element-wise |
|
Element-wise compare-and-swap on shared memory. |
Element-wise (global memory)
|
Element-wise |
|
Element-wise |
|
Element-wise |
|
Element-wise |
|
Element-wise atomic exchange on global memory. |
|
Element-wise compare-and-swap on global memory. |
Scatter (shared memory)
|
Scatter-add into a shared tile along |
|
Scatter-sub into a shared tile along |
|
Scatter-min into a shared tile along |
|
Scatter-max into a shared tile along |
Scatter (global memory)
|
Scatter-add into a global tile along |
|
Scatter-sub into a global tile along |
|
Scatter-min into a global tile along |
|
Scatter-max into a global tile along |