CUDA Tile support#

CUDA Tile introduces a new way to program GPUs at a higher level than SIMT.

Restrictions#

With the compiler taking more control over memory and threading there are a number of restrictions in a tile program:

C++ Concurrency support#

Currently the use of inline ptx / assembly is not allowed in a tile program. All of our threading features rely on inline assembly in some capacity. Consequently, the following headers are not supported in tile mode:

<cuda/atomic>

<cuda/barrier>

<cuda/latch>

<cuda/pipeline>

<cuda/semaphore>

<cuda/std/atomic>

<cuda/std/barrier>

<cuda/std/execution>

<cuda/std/latch>

<cuda/std/semaphore>

This also affects

<cuda/cmath>

<cuda/discard_memory>

<cuda/ptx>

C++ mathematical operations#

We rely heavily on compiler builtins or cuda runtime functions to implement C++ standard math functions such as cuda::std::exp. Those compiler builtins are not currently supported in tile mode, so the following header is mostly unsupported in a tile kernel:

<cuda/std/cmath>

cuda::std::complex is supported except for the various math function overloads that are specialized for complex.

C++ customization point objects#

The standard library uses __ Customization Point Objects __ to enable user-customization of the behavior of many algorithms and ranges. We rely heavily on those for most of our iterator machinery such as e.g cuda::std::begin.

Those CPOs are currently not accessible in a tile kernel. A potential workaround is to construct an instance of the empty type, e.g. decltype(cuda::std::begin){}(container);

C++ return statements in loops and switches#

Tile currently does not support return statements inside of a switch or a loop. We can work around most places but not all. This mainly affects algorithms and arch traits.

In <cuda/std/algorithm> the following algorithms are not supported in a tile kernel

cuda::std::equal_range

cuda::std::find_end

cuda::std::find_first_of

cuda::std::partition

cuda::std::search

cuda::std::search_n

Besides that the content of the following headers is unsupported in a tile program

<cuda/std/devices>

CUDA device intrinsics#

In tile mode the compiler handles threads, warps and blocks. Consequently, the access of CUDA device intrinsics such as threadIdx is currently not allowed in a tile program. Therefore the following headers are not supported in tile mode:

<cuda/access_property>

<cuda/annotated_ptr>

<cuda/discard_memory>

<cuda/hierarchy>

<cuda/ptx>

CUDA extended floating point types#

Tile programs treat the CUDA extended floating point types as compiler builtin types. This disallows accessing their internals which we require internally. Support for extended floating point types such as __half, __nv_bfloat16 is only partial in libcu++. Support for extended floating point types of size 8, 6 and 4-bit is disabled.

Taking the address of a function#

It is currently not supported to take the address of a function in a tile program. This affects our memory resource machinery, so the following headers are unsupported in tile mode:

<cuda/memory>

<cuda/memory_resource>

Standard C++ Tile Availability Summary#

Library	`Libcu++`	Supported since	Notes
<algorithm>	<cuda/std/algorithm>	🟨 3.4	Partial support. Some algorithms rely on return statements inside of loops
<array>	<cuda/std/array>	✅ 3.4
<atomic>	`<cuda/std/atomic>`	❌	Requires ptx
<barrier>	`<cuda/std/barrier>`	❌	Requires ptx
<bit>	<cuda/std/bit>	✅ 3.4
<bitset>	<cuda/std/bitset>	✅ 3.4
<cassert>	`<cuda/std/cassert>`	✅ 3.4
<ccomplex>	`<cuda/std/ccomplex>`	🟨 3.4
<cfloat>	`<cuda/std/cfloat>`	✅ 3.4
<charconv>	`<cuda/std/charconv>`	✅ 3.4
<chrono>	<cuda/std/chrono>	🟨 3.4	Partial support, requires ptx for clock timers
<climits>	`<cuda/std/climits>`	✅ 3.4
<cmath>	`<cuda/std/cmath>`	❌	Requires unimplemented compiler builtins for math functions.
<complex>	<cuda/std/complex>	🟨 3.4	Partial support, requires unimplemented compiler builtins for math functions
<concepts>	<cuda/std/concepts>	✅ 3.4
<cstddef>	`<cuda/std/cstddef>`	✅ 3.4
<cstdint>	`<cuda/std/cstdint>`	✅ 3.4
<cstdlib>	`<cuda/std/cstdlib>`	✅ 3.4
<cstring>	`<cuda/std/cstring>`	✅ 3.4
<ctime>	`<cuda/std/ctime>`	❌	Requires ptx for clock timers
<execution>	<cuda/std/execution>	❌	Requires taking the address of member functions
<expected>	<cuda/std/expected>	✅ 3.4
<functional>	<cuda/std/functional>	🟨 3.4	Partial support. Tile C++ disallows taking the address of functions, so `mem_fn` et al are not available
<initializer_list>	`<cuda/std/initializer_list>`	✅ 3.4
<inplace_vector>	<cuda/std/inplace_vector>	✅ 3.4
<iterator>	`<cuda/std/iterator>`	🟨 3.4	Partial support. We rely heavily on CPO’s. This affects e.g `cuda::std::begin`
<latch>	`<cuda/std/latch>`	❌	Requires ptx
<limits>	`<cuda/std/limits>`	✅ 3.4
<linalg>	<cuda/std/linalg>	✅ 3.4	Accessors, transposed layout, and related functions
<mdspan>	<cuda/std/mdspan>	✅ 3.4
<memory>	<cuda/std/memory>	✅ 3.4	`cuda::std::addressof`, `cuda::std::align`, `cuda::std::assume_aligned`
<numbers>	<cuda/std/numbers>	🟨 3.4	Partial support. `double` on windows is not supported. Extended floating point types are not supported.
<numeric>	<cuda/std/numeric>	✅ 3.4
<optional>	<cuda/std/optional>	✅ 3.4
<ranges>	<cuda/std/ranges>	🟨 3.4	Partial support. We rely heavily on CPO’s. This affects e.g `cuda::std::ranges::begin`
<ratio>	`<cuda/std/ratio>`	✅ 3.4
<random>	<cuda/std/random>	🟨 3.4	Partial support. `seed_seq` relies on dynamic memory allocations, so it is not available
<semaphore>	`<cuda/std/semaphore>`	❌	Requires ptx
<source_location>	`<cuda/std/source_location>`	❌	Requires compiler support
<span>	<cuda/std/span>	✅ 3.4
<tuple>	<cuda/std/tuple>	✅ 3.4
<type_traits>	<cuda/std/type_traits>	✅ 3.4
<utility>	<cuda/std/utility>	✅ 3.4
<variant>	<cuda/std/variant>	✅ 3.4
<version>	`<cuda/std/version>`	✅ 3.4