CUDA Tile support#

CUDA Tile introduces a new way to program GPUs at a higher level than SIMT.

Restrictions#

With the compiler taking more control over memory and threading there are a number of restrictions in a tile program:

C++ Concurrency support#

Currently the use of inline ptx / assembly is not allowed in a tile program. All of our threading features rely on inline assembly in some capacity. Consequently, the following headers are not supported in tile mode:

  • <cuda/atomic>

  • <cuda/barrier>

  • <cuda/latch>

  • <cuda/pipeline>

  • <cuda/semaphore>

  • <cuda/std/atomic>

  • <cuda/std/barrier>

  • <cuda/std/execution>

  • <cuda/std/latch>

  • <cuda/std/semaphore>

This also affects

  • <cuda/cmath>

  • <cuda/discard_memory>

  • <cuda/ptx>

C++ mathematical operations#

We rely heavily on compiler builtins or cuda runtime functions to implement C++ standard math functions such as cuda::std::exp. Those compiler builtins are not currently supported in tile mode, so the following header is mostly unsupported in a tile kernel:

  • <cuda/std/cmath>

cuda::std::complex is supported except for the various math function overloads that are specialized for complex.

C++ customization point objects#

The standard library uses __ Customization Point Objects __ to enable user-customization of the behavior of many algorithms and ranges. We rely heavily on those for most of our iterator machinery such as e.g cuda::std::begin.

Those CPOs are currently not accessible in a tile kernel. A potential workaround is to construct an instance of the empty type, e.g. decltype(cuda::std::begin){}(container);

C++ return statements in loops and switches#

Tile currently does not support return statements inside of a switch or a loop. We can work around most places but not all. This mainly affects algorithms and arch traits.

In <cuda/std/algorithm> the following algorithms are not supported in a tile kernel

  • cuda::std::equal_range

  • cuda::std::find_end

  • cuda::std::find_first_of

  • cuda::std::partition

  • cuda::std::search

  • cuda::std::search_n

Besides that the content of the following headers is unsupported in a tile program

  • <cuda/std/devices>

CUDA device intrinsics#

In tile mode the compiler handles threads, warps and blocks. Consequently, the access of CUDA device intrinsics such as threadIdx is currently not allowed in a tile program. Therefore the following headers are not supported in tile mode:

  • <cuda/access_property>

  • <cuda/annotated_ptr>

  • <cuda/discard_memory>

  • <cuda/hierarchy>

  • <cuda/ptx>

CUDA extended floating point types#

Tile programs treat the CUDA extended floating point types as compiler builtin types. This disallows accessing their internals which we require internally. Support for extended floating point types such as __half, __nv_bfloat16 is only partial in libcu++. Support for extended floating point types of size 8, 6 and 4-bit is disabled.

Taking the address of a function#

It is currently not supported to take the address of a function in a tile program. This affects our memory resource machinery, so the following headers are unsupported in tile mode:

  • <cuda/memory>

  • <cuda/memory_resource>


Standard C++ Tile Availability Summary#

Library

Libcu++

Supported since

Notes

<algorithm>

<cuda/std/algorithm>

🟨 3.4

Partial support. Some algorithms rely on return statements inside of loops

<array>

<cuda/std/array>

✅ 3.4

<atomic>

<cuda/std/atomic>

Requires ptx

<barrier>

<cuda/std/barrier>

Requires ptx

<bit>

<cuda/std/bit>

✅ 3.4

<bitset>

<cuda/std/bitset>

✅ 3.4

<cassert>

<cuda/std/cassert>

✅ 3.4

<ccomplex>

<cuda/std/ccomplex>

🟨 3.4

<cfloat>

<cuda/std/cfloat>

✅ 3.4

<charconv>

<cuda/std/charconv>

✅ 3.4

<chrono>

<cuda/std/chrono>

🟨 3.4

Partial support, requires ptx for clock timers

<climits>

<cuda/std/climits>

✅ 3.4

<cmath>

<cuda/std/cmath>

Requires unimplemented compiler builtins for math functions.

<complex>

<cuda/std/complex>

🟨 3.4

Partial support, requires unimplemented compiler builtins for math functions

<concepts>

<cuda/std/concepts>

✅ 3.4

<cstddef>

<cuda/std/cstddef>

✅ 3.4

<cstdint>

<cuda/std/cstdint>

✅ 3.4

<cstdlib>

<cuda/std/cstdlib>

✅ 3.4

<cstring>

<cuda/std/cstring>

✅ 3.4

<ctime>

<cuda/std/ctime>

Requires ptx for clock timers

<execution>

<cuda/std/execution>

Requires taking the address of member functions

<expected>

<cuda/std/expected>

✅ 3.4

<functional>

<cuda/std/functional>

🟨 3.4

Partial support. Tile C++ disallows taking the address of functions, so mem_fn et al are not available

<initializer_list>

<cuda/std/initializer_list>

✅ 3.4

<inplace_vector>

<cuda/std/inplace_vector>

✅ 3.4

<iterator>

<cuda/std/iterator>

🟨 3.4

Partial support. We rely heavily on CPO’s. This affects e.g cuda::std::begin

<latch>

<cuda/std/latch>

Requires ptx

<limits>

<cuda/std/limits>

✅ 3.4

<linalg>

<cuda/std/linalg>

✅ 3.4

Accessors, transposed layout, and related functions

<mdspan>

<cuda/std/mdspan>

✅ 3.4

<memory>

<cuda/std/memory>

✅ 3.4

cuda::std::addressof, cuda::std::align, cuda::std::assume_aligned

<numbers>

<cuda/std/numbers>

🟨 3.4

Partial support. double on windows is not supported. Extended floating point types are not supported.

<numeric>

<cuda/std/numeric>

✅ 3.4

<optional>

<cuda/std/optional>

✅ 3.4

<ranges>

<cuda/std/ranges>

🟨 3.4

Partial support. We rely heavily on CPO’s. This affects e.g cuda::std::ranges::begin

<ratio>

<cuda/std/ratio>

✅ 3.4

<random>

<cuda/std/random>

🟨 3.4

Partial support. seed_seq relies on dynamic memory allocations, so it is not available

<semaphore>

<cuda/std/semaphore>

Requires ptx

<source_location>

<cuda/std/source_location>

Requires compiler support

<span>

<cuda/std/span>

✅ 3.4

<tuple>

<cuda/std/tuple>

✅ 3.4

<type_traits>

<cuda/std/type_traits>

✅ 3.4

<utility>

<cuda/std/utility>

✅ 3.4

<variant>

<cuda/std/variant>

✅ 3.4

<version>

<cuda/std/version>

✅ 3.4