Places #

Places are abstractions that represent where code executes and where data resides across the non-uniform memory of a CUDA system. They provide a unified interface for managing execution affinity, stream pools, memory allocation, and device context switching – independently of any task-based programming model.

Places come in two flavors:

Execution places (exec_place) determine where code is executed.
Data places (data_place) specify where data is located in memory.

The places API is part of the cuda::experimental::places C++ namespace and can be used standalone via the cuda/experimental/places.cuh header, without pulling in the full CUDASTF task-graph framework. For backward compatibility, all places types are also available in the cuda::experimental::stf namespace.

Execution places #

An execution place describes a location where computation can occur. The following factory methods create the most common execution places:

exec_place::device(id) – a specific CUDA device
exec_place::host() – the host CPU
exec_place::current_device() – the CUDA device that is currently active

When an execution place is activated, it sets the appropriate CUDA context (e.g. calls cudaSetDevice). Each execution place also has an affine data place: the memory location naturally associated with it. For a device execution place the affine data place is the device’s global memory; for the host it is pinned host memory (RAM).

Data places #

A data place describes a memory location where data can reside. The following factory methods are available:

data_place::device(id) – global memory of a specific CUDA device
data_place::host() – pinned host memory
data_place::managed() – CUDA managed (unified) memory
data_place::affine() – the data place naturally associated with the current execution place

The affine data place is the default: when no data place is specified, data is placed in the memory that is local to the execution place. For example, a task running on device 0 will access data in device 0’s global memory by default.

Non-affine placement is also supported: data can be placed on a different device or in host memory regardless of where the computation runs. This is useful for sparse accesses (leveraging CUDA Unified Memory page faulting) or for addressing memory capacity constraints. Non-affine placement assumes the hardware and OS support such accesses (NVLINK, UVM, etc.).

Places as container keys #

Both exec_place and data_place can be used as keys in standard associative containers. The library provides the required comparison and hash support:

``std::map`` and ``std::set`` use operator< (strict weak ordering) for keys. Both place types implement operator<, so they can be used as ordered map or set keys.
``std::unordered_map`` and ``std::unordered_set`` require a hash function and equality. The library specializes cuda::experimental::stf::hash for both place types (accessible from both the stf and places namespaces), and both implement operator==.

This allows, for example, maintaining per-place handles (e.g. CUBLAS or CUSOLVER handles keyed by exec_place) or per-place caches keyed by data_place, using either ordered or hash-based containers as needed. The following snippet shows lazy creation of a CUBLAS handle per execution place using an std::unordered_map keyed by exec_place:

#include <cuda/experimental/places.cuh>
#include <cublas_v2.h>

using namespace cuda::experimental::places;

cublasHandle_t& get_cublas_handle(const exec_place& ep = exec_place::current_device())
{
  static std::unordered_map<exec_place, cublasHandle_t, hash<exec_place>> handles;
  auto& h = handles[ep];
  if (h == cublasHandle_t{})
  {
    exec_place_scope scope(ep);
    cuda_safe_call(cublasCreate(&h));
  }
  return h;
}

Setting the current device or context #

The exec_place::activate() method provides a generic alternative to cudaSetDevice() that works uniformly across different execution place types. This is useful when you want to set the current CUDA device or context without using tasks.

The method returns an exec_place representing the previous state, which can be used to restore the original device or context.

Behavior by execution place type:

Device places (exec_place::device(id)): Calls cudaSetDevice(id)
Green context places: Sets the current CUDA driver context via cuCtxSetCurrent()
Host places: No-op

Basic usage with devices:

exec_place place = exec_place::device(1);
exec_place prev = place.activate();  // Switch to device 1

// ... perform operations on device 1 ...

place.deactivate(prev);  // Restore previous device

Alternative restoration pattern:

You can also restore by calling activate() on the returned place:

exec_place place = exec_place::device(1);
exec_place prev = place.activate();

// ... work on device 1 ...

prev.activate();  // Equivalent to place.deactivate(prev)

Usage with green contexts (CUDA 12.4+):

Green contexts provide SM-level partitioning of GPU resources. The activate()/deactivate() methods handle the underlying driver context management:

// Create green contexts with 8 SMs each
green_context_helper gc(8, device_id);
auto view = gc.get_view(0);

exec_place gc_place = exec_place::green_ctx(view);
exec_place prev = gc_place.activate();  // Sets green context as current

// ... GPU work runs with SM affinity ...

gc_place.deactivate(prev);  // Restore original context

RAII scope for scoped activation:

For exception-safe code or when you want automatic restoration, use the exec_place_scope RAII helper:

{
    exec_place_scope scope(exec_place::device(1));
    // Device 1 is now active
    // ... perform operations on device 1 ...
}
// Previous device is automatically restored when scope goes out of scope

The guard automatically restores the previous execution place when it goes out of scope, making it useful for exception-safe code.

Stream management with execution places #

Execution places can be used independently of any task system to manage CUDA streams in a structured way. This is useful when you want to use place abstractions (devices, green contexts) for stream management without the full task-based programming model.

Stream pools for pooled places (device(N), host()) live in an exec_place_resources registry that the caller owns. Pass the registry to exec_place::pick_stream to get a CUDA stream; the per-place pool inside the registry is created lazily on first request and is destroyed when the registry is destroyed.

The method accepts an optional for_computation hint (defaults to true) that may select between computation and data transfer stream pools to improve overlapping. This is purely a performance hint, and it does not affect correctness. Not all execution places enforce it.

#include <cuda/experimental/places.cuh>
using namespace cuda::experimental::places;

// Standalone use: own the registry yourself.
exec_place_resources resources;

// Get a stream from the current device
exec_place place = exec_place::current_device();
cudaStream_t stream = place.pick_stream(resources);

// Use the stream for CUDA operations
myKernel<<<grid, block, 0, stream>>>(d_data);

// Get streams from specific devices (sharing the same registry)
cudaStream_t stream_dev0 = exec_place::device(0).pick_stream(resources);
cudaStream_t stream_dev1 = exec_place::device(1).pick_stream(resources);

Inside a CUDASTF context, the context’s async_resources_handle already holds an exec_place_resources registry. Convenience overloads accept the handle directly so call sites do not have to dereference it:

cudaStream_t stream = place.pick_stream(ctx.async_resources());

Stream pools are populated lazily – CUDA streams are only created when first requested via pick_stream(resources) (or pick_stream(ctx.async_resources()) inside CUDASTF). Self-contained places (exec_place::cuda_stream(s), green-context places) ignore the registry and return their own embedded pool instead, so the user-provided cudaStream_t / CUgreenCtx must outlive any place that wraps it.

Memory allocation with data places #

Data places provide a unified interface for memory allocation that works across different memory types (host, device, managed) and place extensions (green contexts, user-defined places). This allows you to allocate memory while benefiting from the place abstraction.

The data_place::allocate() and data_place::deallocate() methods provide raw memory allocation. The stream parameter defaults to nullptr, which is convenient for non-stream-ordered allocations (host, managed) where the stream is ignored:

#include <cuda/experimental/places.cuh>
using namespace cuda::experimental::places;

// Allocate on host (pinned memory) - stream defaults to nullptr
void* host_ptr = data_place::host().allocate(1024);
// ... use host_ptr ...
data_place::host().deallocate(host_ptr, 1024);

// Allocate on a specific device (stream-ordered)
cudaStream_t stream;
cudaStreamCreate(&stream);
void* dev_ptr = data_place::device(0).allocate(1024, stream);
// ... use dev_ptr with stream ...
data_place::device(0).deallocate(dev_ptr, 1024, stream);
cudaStreamDestroy(stream);

// Allocate managed memory - stream defaults to nullptr
void* managed_ptr = data_place::managed().allocate(1024);
// ... use managed_ptr from host or device ...
data_place::managed().deallocate(managed_ptr, 1024);

Stream-ordered vs immediate allocations:

Different data places have different allocation behaviors:

Host (data_place::host()): Uses cudaMallocHost() / cudaFreeHost() - immediate, stream parameter is ignored
Managed (data_place::managed()): Uses cudaMallocManaged() / cudaFree() - immediate, stream parameter is ignored (note: cudaFree may introduce implicit synchronization)
Device (data_place::device(id)): Uses cudaMallocAsync() / cudaFreeAsync() - stream-ordered
Extensions (green contexts, etc.): Behavior depends on the extension implementation

You can query whether a place uses stream-ordered allocation with allocation_is_stream_ordered():

data_place place = data_place::device(0);
if (place.allocation_is_stream_ordered()) {
    // Allocation is stream-ordered - synchronize via the stream
    void* ptr = place.allocate(size, stream);
    myKernel<<<grid, block, 0, stream>>>(ptr);
    place.deallocate(ptr, size, stream);
    cudaStreamSynchronize(stream);
} else {
    // Allocation is immediate - stream is ignored, safe to use right away
    void* ptr = place.allocate(size);
    // ... use ptr ...
    place.deallocate(ptr, size);
}

This abstraction is particularly useful when writing generic code that needs to work with different types of places, including custom place extensions.

VMM-based allocation with mem_create#

For advanced use cases involving CUDA’s Virtual Memory Management (VMM) API, data_place also provides the mem_create() method. This is a lower-level interface used internally by localized arrays (composite_slice) to create physical memory segments that are then mapped into a contiguous virtual address space.

Unlike allocate(), which returns a usable pointer directly, mem_create() returns a CUmemGenericAllocationHandle that must be subsequently mapped with cuMemMap() before use:

#include <cuda/experimental/places.cuh>
using namespace cuda::experimental::places;

// Create a physical memory handle for device 0
CUmemGenericAllocationHandle handle;
data_place::device(0).mem_create(&handle, size);

// The handle must be mapped to a virtual address before use
// (see CUDA VMM documentation for cuMemMap, cuMemSetAccess, etc.)

When to use each method:

Use allocate() for most cases - it provides ready-to-use memory with stream-ordered semantics where applicable.
Use mem_create() only when you need explicit control over virtual memory mapping, such as creating localized arrays that span multiple devices with a unified virtual address space.

Limitations of mem_create:

Only supports device memory and host memory (pinned)
Managed memory is not supported by the VMM API
The returned handle requires additional VMM API calls to be usable

Custom place extensions can override mem_create() to provide specialized VMM allocation behavior (e.g., memory localization for hardware partitions).

Grid of places #

It is possible to manipulate places which are a collection of multiple places. In particular, it is possible to define an execution place which corresponds to multiple device execution places.

Creating grids of places#

A grid of execution places is an exec_place that contains multiple underlying places. Grids are created with the make_grid free function:

// Create a 1D grid from a vector of places
exec_place grid = make_grid(std::vector<exec_place>{
    exec_place::device(0), exec_place::device(1)
});

The exec_place::all_devices() helper creates a grid of all available CUDA devices:

exec_place all = exec_place::all_devices();

Similarly, exec_place::n_devices(n) creates a grid from the first n devices:

exec_place first_four = exec_place::n_devices(4);

It is possible to retrieve the total number of elements in a grid using the size() method, and individual places with get_place(i):

exec_place grid = exec_place::all_devices();
for (size_t i = 0; i < grid.size(); i++) {
    exec_place dev = grid.get_place(i);
    // ...
}

Shaped grids#

Grids of places need not be 1D arrays. They can be structured as a multi-dimensional grid described with a dim4 class by passing it to make_grid or n_devices:

// Create a shaped grid: 8 devices arranged as a 2x2x2 cube
exec_place cube = exec_place::n_devices(8, dim4(2, 2, 2));

// Or from an explicit vector
exec_place shaped = make_grid(my_places, dim4(4, 2));

Note that the total size of the dim4 must match the number of places.

It is possible to query the shape of the grid using get_dims(), which returns a dim4 object. Individual places can be accessed by multi-dimensional position using get_place(pos4).

Partitioning grids#

The place_partition class partitions an execution place at a given granularity. This is useful for splitting a multi-device grid into its constituent devices, or for partitioning a device into green contexts or CUDA streams.

The partitioning granularity is specified by place_partition_scope:

place_partition_scope::cuda_device – partition into individual devices
place_partition_scope::green_context – partition into green contexts (CUDA 12.4+)
place_partition_scope::cuda_stream – partition into CUDA streams

exec_place grid = exec_place::all_devices();

// Partition into individual devices
place_partition devices(grid, place_partition_scope::cuda_device);
for (auto& dev : devices) {
    // dev is an exec_place for a single device
}

// Convert back to an exec_place grid
exec_place new_grid = devices.to_exec_place();

The exec_place::partition_by_scope() method provides a shorthand that returns a new exec_place grid directly:

exec_place grid = exec_place::all_devices();
exec_place by_device = grid.partition_by_scope(place_partition_scope::cuda_device);

Data partitioning policies#

When using a grid of places with CUDASTF constructs such as parallel_for, data partitioning policies express how data and index spaces are dispatched over the different places of a grid.

class MyPartition : public partitioner_base {
public:
    template <typename S_out, typename S_in>
    static const S_out apply(const S_in& in, pos4 position, dim4 grid_dims);

    void get_executor(pos4* result, pos4 data_coords, dim4 data_dims, dim4 grid_dims);
};

A partitioning class must implement an apply method which takes:

a reference to a shape of type S_in
a position within a grid of execution places, described using an object of type pos4
the dimension of this grid expressed as a dim4 object

apply returns a shape which corresponds to the subset of the in shape associated to this entry of the grid. Note that the output shape type S_out may be different from the S_in type of the input shape.

To support different types of shapes, appropriate overloads of the apply method should be implemented.

This apply method is typically used by the parallel_for construct in order to dispatch indices over the different places.

A partitioning class must also implement the get_executor virtual method which allows localized data allocators. This method indicates, for each entry of a shape, on which place this entry should preferably be allocated.

get_executor writes a pos4 coordinate in the execution place grid into *result, and its input arguments are:

a coordinate within the shape described as a pos4 object
the dimension of the shape expressed as a dim4 object
the dimension of the execution place grid expressed as a dim4 object

Defining the get_executor makes it possible to map a piece of data over an execution place grid. The get_executor method of a partitioning policy in an execution place grid therefore defines the affine data place of a logical data accessed on that grid.

Predefined partitioning policies#

There are currently two policies readily available:

tiled_partition<TILE_SIZE> dispatches entries of a shape using a tiled layout. For multi-dimensional shapes, the outermost dimension is dispatched into contiguous tiles of size TILE_SIZE.
blocked_partition dispatches entries of the shape using a blocked layout, where each entry of the grid of places receives approximately the same contiguous portion of the shape, dispatched along the outermost dimension.

This illustrates how a 2D shape is dispatched over 3 places using the blocked layout:

 __________________________________
|           |           |         |
|           |           |         |
|           |           |         |
|    P 0    |    P 1    |   P 2   |
|           |           |         |
|           |           |         |
|___________|___________|_________|

This illustrates how a 2D shape is dispatched over 3 places using a tiled layout, where the dimension of the tiles is indicated by the TILE_SIZE parameter:

 ________________________________
|     |     |     |     |     |  |
|     |     |     |     |     |  |
|     |     |     |     |     |  |
| P 0 | P 1 | P 2 | P 0 | P 1 |P2|
|     |     |     |     |     |  |
|     |     |     |     |     |  |
|_____|_____|_____|_____|_____|__|