Script.fence

Script.fence

Fence instructions for memory ordering between memory proxies.

CUDA GPUs have multiple memory proxies (generic, async, alias, tensormap) that can access the same memory through different paths. When one proxy writes data that another proxy needs to read, a proxy fence is required to ensure the writes are visible.

The most common scenario is coordinating between:

  • Generic proxy writes: store_shared(), register-to-shared stores

  • Async proxy reads/writes: TMA operations (tma.global_to_shared(), tma.shared_to_global())

For example, after writing to shared memory with store_shared() and before issuing tma.shared_to_global(), a proxy_async() or proxy_async_release() fence is needed to ensure the TMA engine sees the updated data.

proxy_async_release() is a lighter-weight alternative to proxy_async() when only generic-to-async ordering is needed (not bidirectional).

Instructions

proxy_async([space])

Bidirectional async proxy fence.

proxy_async_release()

Unidirectional generic-to-async release proxy fence for shared memory.