Script.fence¶
Fence instructions for memory ordering between memory proxies.
CUDA GPUs have multiple memory proxies (generic, async, alias, tensormap) that can access the same memory through different paths. When one proxy writes data that another proxy needs to read, a proxy fence is required to ensure the writes are visible.
The most common scenario is coordinating between:
Generic proxy writes:
store_shared(), register-to-shared storesAsync proxy reads/writes: TMA operations (
tma.global_to_shared(),tma.shared_to_global())
For example, after writing to shared memory with store_shared() and before issuing
tma.shared_to_global(), a proxy_async() or proxy_async_release() fence is needed
to ensure the TMA engine sees the updated data.
proxy_async_release() is a lighter-weight alternative to proxy_async() when only
generic-to-async ordering is needed (not bidirectional).
Instructions
|
Bidirectional async proxy fence. |
Unidirectional generic-to-async release proxy fence for shared memory. |