`cuda::apply_access_property`#

Defined in header <cuda/annotated_ptr>.

template <typename ShapeT>
[[nodiscard]] __host__ __device__
void apply_access_property(const volatile void* ptr, ShapeT shape, cuda::access_property::persisting) noexcept;

template <typename ShapeT>
[[nodiscard]] __host__ __device__
void apply_access_property(const volatile void* ptr, ShapeT shape, cuda::access_property::normal) noexcept;

Prefetch memory in the L2 cache starting at ptr applying a residence control property.

Constraints

ShapeT is either size_t or cuda::aligned_size_t.
Two properties are supported:
- cuda::access_property::persisting
- cuda::access_property::normal

Preconditions

ptr points to a valid allocation for shape in the global memory address space.
- if ShapeT is aligned_size_t<N>(sz), then ptr is aligned to an N-bytes alignment boundary, and
- for all offsets i in the extent of shape, namely i in [0, shape), then the expression *(ptr + i) does not exhibit undefined behavior.

Note: currently apply_access_property is ignored on the host.

Example#

Given three input and output vectors x, y, and z, and two arrays of coefficients a and b, all of length N:

size_t N;
int* x, *y, *z;
int* a, *b;

the grid-strided kernel:

__global__ void update(const int* x, const int* a, const int* b, size_t N) {
    auto g = cooperative_groups::this_grid();
    for (int idx = g.thread_rank(); idx < N; idx += g.size()) {
        x[idx] = a[idx] * x[idx] + b[idx];
    }
}

updates x, y, and z as follows:

update<<<grid, block>>>(x, a, b, N);
update<<<grid, block>>>(y, a, b, N);
update<<<grid, block>>>(z, a, b, N);

The elements of a and b are used in all kernels. For certain values of N, this may prevent parts of a and b from being evicted from the L2 cache, avoiding reloading these from memory in the subsequent update kernel.

With cuda::access_property and cuda::apply_access_property, we can write kernels that specify that a and b are accessed more often in the pin kernel and with normal access in the unpin kernel:

__global__ void pin(int* a, int* b, size_t N) {
    auto g = cooperative_groups::this_grid();
    for (int idx = g.thread_rank(); idx < N; idx += g.size()) {
        cuda::apply_access_property(a + idx, sizeof(int), cuda::access_property::persisting{});
        cuda::apply_access_property(b + idx, sizeof(int), cuda::access_property::persisting{});
    }
}

__global__ void unpin(int* a, int* b, size_t N) {
    auto g = cooperative_groups::this_grid();
    for (int idx = g.thread_rank(); idx < N; idx += g.size()) {
        cuda::apply_access_property(a + idx, sizeof(int), cuda::access_property::normal{});
        cuda::apply_access_property(b + idx, sizeof(int), cuda::access_property::normal{});
    }
}

which we can launch before and after the update kernels:

pin<<<grid, block>>>(a, b, N);
update<<<grid, block>>>(x, a, b, N);
update<<<grid, block>>>(y, a, b, N);
update<<<grid, block>>>(z, a, b, N);
unpin<<<grid, block>>>(a, b, N);

This does not require modifying the update kernel, and for certain values of N prevents a and b from having to be re-loaded from memory.

The pin and unpin kernels can be fused into the kernels for the x and z updates by modifying these kernels.

cuda::apply_access_property#

Example#

`cuda::apply_access_property`#