CCCL 2.x ‐ CCCL 3.0 migration guide#
The CCCL team plans breaking changes carefully and only conducts them at major releases. The CCCL 2.8 release came with many deprecations to prepare for the breaking changes conducted in CCCL 3.0. This page summarizes the changes and helps migrating from CCCL 2.x to CCCL 3.0.
See also the list of all deprecated APIs in CCCL 2.8 and the list of breaking changes in CCCL 3.0.
CUDA Toolkit changes#
CCCL is moving to its own include directory within the CUDA Toolkit. This may cause build failures and some initial confusion. This section will have some suggestions and mitigations to help maintain builds across both CUDA12 and future releases.
The CTK-provided includes are changing in the following ways:
Before CUDA 13.0 |
After CUDA 13.0 |
${CTK_ROOT}/include/cuda/ |
${CTK_ROOT}/include/cccl/cuda/ |
${CTK_ROOT}/include/cub/ |
${CTK_ROOT}/include/cccl/cub/ |
${CTK_ROOT}/include/thrust/ |
${CTK_ROOT}/include/cccl/thrust/ |
Due to these changes, and the fact that NVCC by default includes its own directories, you may encounter errors when including CCCL headers in source files that are compiled only by the host compiler.
For example, when compiling with GCC or MSVC alone, you may see <cuda/...>, <cub/...>, or <thrust/...> headers missing.
To mitigate this there are several solutions available depending on your build system:
DO NOT prefix missing includes with
<cccl/>– This will break.CMake: link
CCCL::CCCLto your target. - Example:target_link_library(${MY_TARGET} PRIVATE CCCL::CCCL)Non-CMake: Directly include the CUDA Toolkit’s CCCL directory. (Make/Other) - Example: Add CCCL as an include flag
-I${CTK_ROOT}/include/ccclUse a non-bundled CCCL. CCCL is available and maintained independently of the CTK. - See here for compatibility.
Removed macros#
CUB_IS_INT128_ENABLED: No replacementCUB_MAX(a, b): Use thecuda::std::max(a, b)function insteadCUB_MIN(a, b): Use thecuda::std::min(a, b)function insteadCUB_QUOTIENT_CEILING(a, b): Usecuda::ceil_div(a, b)insteadCUB_QUOTIENT_FLOOR(a, b): Use plain integer divisiona / binsteadCUB_ROUND_DOWN_NEAREST(a, b): Usecuda::round_down(a, b)insteadCUB_ROUND_UP_NEAREST(a, b): Usecuda::round_up(a, b)insteadCUB_RUNTIME_ENABLED: No replacementCUB_USE_COOPERATIVE_GROUPS: No replacementCUDA_CUB_RET_IF_FAIL: No replacement[THRUST|CUB]_CLANG_VERSION: No replacement[THRUST|CUB]_DEVICE_COMPILER*: No replacement[THRUST|CUB]_GCC_VERSION: No replacement[THRUST|CUB]_HOST_COMPILER*: No replacement[THRUST|CUB]_INCLUDE_DEVICE_CODE: No replacement[THRUST|CUB]_INCLUDE_HOST_CODE: No replacement[THRUST|CUB]_IS_DEVICE_CODE: No replacement[THRUST|CUB]_IS_HOST_CODE: No replacement[THRUST|CUB]_MSVC_VERSION_FULL: No replacement[THRUST|CUB]_MSVC_VERSION: No replacementTHRUST_CDP_DISPATCH: No replacement (Support for CUDA Dynamic Parallelism V1 (CDPv1) has been removed, see below)THRUST_DECLTYPE_RETURNS_WITH_SFINAE_CONDITION: No replacementTHRUST_DECLTYPE_RETURNS: No replacementTHRUST_DEVICE_CODE: No replacementTHRUST_HOST_BACKEND: UseTHRUST_HOST_SYSTEMinsteadTHRUST_INLINE_CONSTANT: Useinline constexprinsteadTHRUST_INLINE_INTEGRAL_MEMBER_CONSTANT: Usestatic constexprinsteadTHRUST_LEGACY_GCC: No replacementTHRUST_MODERN_GCC_REQUIRED_NO_ERROR: No replacementTHRUST_MODERN_GCC: No replacementTHRUST_MVCAP: No replacementTHRUST_NODISCARD: Use[[nodiscard]]insteadTHRUST_RETOF1: No replacementTHRUST_RETOF2: No replacementTHRUST_RETOF: No replacementTHRUST_TUNING_ARCH: No direct replacement. Use compiler-specific__CUDA_ARCH__(nvcc) or__NVCOMPILER_CUDA_ARCH__(nvc++) instead
Removed functions and classes#
_ReadWriteBarrierand__thrust_compiler_fence: Usecuda::atomicinsteadcub::*Kernel: Any CUB kernel entrypoint is considered an implementation detail. No public exposure is provided.cub::Agent*: CUB agents were considered implementation details and have all been moved to internal namespaces. No public exposure is provided.cub::AliasTemporaries: No replacementcub::ArrayWrapper: Usecuda::std::arrayinsteadcub::BAR: No replacementcub::BaseTraits::CATEGORY: Use the facilities from<cuda/std/type_traits>insteadcub::BaseTraits::NULL_TYPE: No replacementcub::BaseTraits::PRIMITIVE: Use the facilities from<cuda/std/type_traits>insteadcub::BFI: Usecuda::bitfield_insertinsteadcub::BinaryOpHasIdxParam::HAS_PARAM: Usecub::BinaryOpHasIdxParam::valueinsteadcub::ConstantInputIterator: Usethrust::constant_iteratorinsteadcub::CountingInputIterator: Usethrust::counting_iteratorinsteadcub::CTA_SYNC_AND: Use__syncthreads_and()insteadcub::CTA_SYNC_OR: Use__syncthreads_or()insteadcub::CTA_SYNC: Use__syncthreads()insteadcub::Device*Policy: Those policy hubs are considered implementation details. No public exposure is provided.cub::DeviceSpmv: Use cuSPARSE insteadcub::Difference: Usecuda::std::minusinsteadcub::DivideAndRoundUp: Usecuda::round_upinsteadcub::Division: Usecuda::std::dividesinsteadcub::Equality: Usecuda::std::equal_toinsteadcub::FFMA_RZ: No replacementcub::FMUL_RZ: No replacementcub::FpLimits<T>: Usecuda::std::numeric_limits<T>insteadcub::GridBarrier: Use the APIs from cooperative groups insteadcub::GridBarrierLifetime: Use the APIs from cooperative groups insteadcub::IADD3: No replacementcub::Inequality: Usecuda::std::not_equal_toinsteadcub::Int2Type: Usecuda::std::integral_constantinsteadcub::IterateThreadLoad: No replacementcub::IterateThreadStore: No replacementcub::KernelConfig: No replacementcub::LaneId(): Usecuda::ptx::get_sreg_laneid()insteadcub::LaneMaskGe(): Usecuda::ptx::get_sreg_lanemask_ge()insteadcub::LaneMaskGt(): Usecuda::ptx::get_sreg_lanemask_gt()insteadcub::LaneMaskLe(): Usecuda::ptx::get_sreg_lanemask_le()insteadcub::LaneMaskLt(): Usecuda::ptx::get_sreg_lanemask_lt()insteadcub::MakePolicyWrapper: No replacementcub::Max: Usecuda::maximuminsteadcub::max: Usecuda::std::maxinsteadcub::MemBoundScaling: No replacementcub::Min: Usecuda::minimuminsteadcub::min: Usecuda::std::mininsteadcub::Mutex: Usestd::mutexinsteadcub::PolicyWrapper: No replacementcub::PRMT: Usecuda::ptx::prmt()insteadcub::RegBoundScaling: No replacementcub::SHFL_IDX_SYNC: Use__shfl_sync()insteadcub::SHL_ADD: No replacementcub::SHR_ADD: No replacementcub::Sum: Usecuda::std::plusinsteadcub::Swap(a, b): Usecuda::std::swap(a, b)insteadcub::ThreadTrap(): Usecuda::std::terminate()insteadcub::TransformInputIterator: Usethrust::transform_iteratorinsteadcub::TripleChevronFactory: No replacement for now, we are working on a new kernel launch facilitycub::ValueCache: No replacementcub::WARP_ALL: Use__all_sync()insteadcub::WARP_ANY: Use__any_sync()insteadcub::WARP_BALLOT: Use__ballot_sync()insteadcub::WARP_SYNC: Use__syncwarp()insteadcub::WarpId(): Usecuda::ptx::get_sreg_warpid()insteadthrust::*::[first_argument_type|second_argument_type|result_type]: The nested aliases have been removed for all function object types:thrust::[plus|minus|multiplies|divides|modulus|negate|square|equal_to|not_equal_to|greater|less|greater_equal|less_equal|logical_and|logical_or|logical_not|bit_and|bit_or|bit_xor|identity|maximum|minimum|project1st|project2nd]. No replacement.thrust::[unary|binary]_function: No replacement. If you inherit from one of these types, just remove those base classes.thrust::[unary|binary]_traits: No replacement.thrust::async::*: No replacement for now. We are working on a C++26 senders implementation. For make a thrust algorithm skip syncing, usethrust::cuda::par_nosyncas execution policy.thrust::bidirectional_universal_iterator_tag: No replacementthrust::conjunction_value<Ts...>: Usecuda::std::bool_constant<(Ts && ...)>insteadthrust::conjunction_value_v<Ts...>: Use a fold expression:Ts && ...insteadthrust::cuda_cub::core::*: Those are considered implementation details. No public exposure is provided.thrust::cuda_cub::counting_iterator_t: Usethrust::counting_iteratorinsteadthrust::cuda_cub::identity: Usecuda::std::identityinsteadthrust::cuda_cub::launcher::triple_chevron: No replacement for now, we are working on a new kernel launch facilitythrust::cuda_cub::terminate: Usecuda::std::terminate()insteadthrust::cuda_cub::transform_input_iterator_t: Usethrust::transform_iteratorinsteadthrust::cuda_cub::transform_pair_of_input_iterators_t: Usethrust::transform_iterator of a thrust::zip_iteratorinsteadthrust::disjunction_value<Ts...>: Usecuda::std::bool_constant<(Ts || ...)>insteadthrust::disjunction_value_v<Ts...>: Use a fold expression:Ts || ...insteadthrust::forward_universal_iterator_tag: No replacementthrust::identity<T>: Usecuda::std::identityinstead. Ifthrust::identitywas used to perform a cast toT, please define your own function object.thrust::input_universal_iterator_tag: No replacementthrust::negation_value<T>: Usecuda::std::bool_constant<!T>insteadthrust::negation_value_v<T>: Use a plain negation!Tthrust::not[1|2]: Usecuda::std::not_fninsteadthrust::null_type: No replacementthrust::numeric_limits<T>: Usecuda::std::numeric_limits<T>insteadthrust::optional<T>: Usecuda::std::optional<T>instead.thrust::output_universal_iterator_tag: No replacementthrust::random_access_universal_iterator_tag: No replacementthrust::remove_cvref[_t]: Usecuda::std::remove_cvref[_t]insteadthrust::void_t: Usecuda::std::void_tinstead
Deprecations with planned removal#
CUB_LOG_SMEM_BANKS: No replacementCUB_LOG_WARP_THREADS: No replacementCUB_MAX_DEVICES: No replacementCUB_PREFER_CONFLICT_OVER_PADDING: No replacementCUB_PTX_LOG_SMEM_BANKS: No replacementCUB_PTX_LOG_WARP_THREADS: No replacementCUB_PTX_PREFER_CONFLICT_OVER_PADDING: No replacementCUB_PTX_SMEM_BANKS: No replacementCUB_PTX_SUBSCRIPTION_FACTOR: No replacementCUB_PTX_WARP_THREADS: No replacementCUB_SMEM_BANKS: No replacementCUB_SUBSCRIPTION_FACTOR: No replacementCUB_WARP_THREADS: No replacementTHRUST_FALSE: No replacementTHRUST_PREVENT_MACRO_SUBSTITUTION: No replacementTHRUST_STATIC_ASSERT(expr): Usestatic_assert(expr)insteadTHRUST_TRUE: No replacementTHRUST_UNKNOWN: No replacementTHRUST_UNUSED_VAR: No replacementcub::BFE: Usecuda::bitfield_extractinsteadcub::MergePathSearch: No replacementcub::Traits<T>::Max(): Usecuda::std::numeric_limits<T>::max()insteadcub::Traits<T>::Min(): Usecuda::std::numeric_limits<T>::min()insteadthrust::iterator_difference[_t]<T>: Usecuda::std::iterator_traits<T>::difference_typeorcuda::std::iter_difference_t<T>insteadthrust::iterator_pointer[_t]<T>: Usecuda::std::iterator_traits<T>::pointerinsteadthrust::iterator_reference[_t]<T>: Usecuda::std::iterator_traits<T>::referenceorcuda::std::iter_reference_t<T>insteadthrust::iterator_traits<T>: Usecuda::std::iterator_traits<T>insteadthrust::iterator_value[_t]<T>: Usecuda::std::iterator_traits<T>::value_typeorcuda::std::iter_value_t<T>instead
API breaks#
cub::Block*: All trailingint LEGACY_PTX_ARCHtemplate parameters have been removedcub::CachingAllocator: The constructor taking a trailingbool debugparameter has been removedcub::Device*: All overloads with a trailingbool debug_synchronousparameter have been removedcub::Dispatch*: All Boolean template parameters have been replaced by enumerations to increase readabilitycub::Dispatch*: All policy hub template parameters have been moved to the back of the template parameters listcub::DispatchScan[ByKey]: The offset type must be an unsigned type of at least 4-byte sizecuda::ceil_div: Now returns the common type of its argumentsthrust::pair: Is now an alias tocuda::std::pairand no longer a distinct typethrust::tabulate_output_iterator: Thevalue_typehas been fixed to bevoidthrust::transform_iterator: Upon copying, will now always copy its contained function. If the contained function is neither copy constructible nor copy assignable, the iterator fails to compile when attempting to be copied.thrust::tuple: Is now an alias tocuda::std::tupleand no longer a distinct typethrust::universal_host_pinned_memory_resource: The alias has changed to a different memory resource, potentially changing pointer types derived from an allocator/container using this memory resource.The following Thrust function object types have been made aliases to the equally-named types in
cuda::std:::thrust::[plus|minus|multiplies|divides|modulus|negate|equal_to|not_equal_to|greater|less|greater_equal|less_equal|logical_and|logical_or|logical_not|bit_and|bit_or|bit_xor|identity|maximum|minimum]. No replacement.CUB_DEFINE_DETECT_NESTED_TYPE: The generated detector trait no longer provides a::VALUEmember. Use::valueinstead.
Iterator traits#
cuda::std::iterator_traits will now correctly recognize user-provided specializations of std::iterator_traits.
All of Thrust’s iterator traits have been redefined in terms of cuda::std::iterator_traits,
and users should prefer to use iterator traits from libcu++.
thrust::iterator_traits can no longer be specialized.
Users should prefer to specialize cuda::std::iterator_traits instead of std::iterator_traits when necessary,
to make their iterators work equally in device code.
CUB Traits#
The functionality and internal use of cub::Traits has been minimized, because libcu++ provides better and standard alternatives.
Only the use in CUB’s radix sort implementation for bit-twiddling remains.
Floating-point limits should be obtained using cuda::std::numeric_limits<T> instead of cub::FpLimits<T>.
Classification of types should be done with the facilities from <cuda/std/type_traits> and <cuda/type_traits>,
notably with cuda::std::is_signed[_v], cuda::std::is_integral[_v], etc.
There is an important difference for extended floating point types though:
Since cuda::std::is_floating_point[_v] will only recognize C++ standard floating point types,
cuda::is_floating_point[_v] must be used to correctly classify extended floating point types like __half or __nv_bfloat16.
cub::BaseTraits and cub::Traits can no longer be specialized for custom types, and cub::FpLimits has been removed.
We acknowledge the need to provide user-defined floating point types though,
e.g., registering a custom half type with CUB to be used in radix sort.
Therefore, users can still specialize cub::NumericTraits for their custom floating point types,
inheriting from cub::BaseTraits and providing the necessary information for the type.
Additionally, the traits from libcu++ have to be specialized as well:
For example, a custom floating point type my_half could be registered with CUB and libcu++ like this:
template <>
inline constexpr bool ::cuda::is_floating_point_v<my_half> = true;
template <>
class ::cuda::std::numeric_limits<my_half> {
public:
static constexpr bool is_specialized = true;
static __host__ __device__ my_half max() { return /* TODO */; }
static __host__ __device__ my_half min() { return /* TODO */; }
static __host__ __device__ my_half lowest() { return /* TODO */; }
};
template <>
struct CUB_NS_QUALIFIER::NumericTraits<my_half> : BaseTraits<FLOATING_POINT, true, uint16_t, my_half> {};
Behavioral changes#
cub::DeviceReduce::[Arg][Max|Min]: Will now usecuda::std::numeric_limits<T>::[max|min]()instead ofcub::Traitsto determine the initial valuecuda::std::mdspan: The implementation was entirely rewritten and you may experience subtle behavioral changesthrust::transform_iterator: The logic to determine the reference type has been reworked, especially wrt. to functions that return references to their own arguments (e.g.,thrust::identity).thrust::transform_iterator::difference_type: The logic to select the difference type has been reworked. It’s now eitherintorptrdiff.
ABI breaks#
All of libcu++’s old ABI namespaces have been removed
Platform support#
At least C++17 is required
At least clang 14 is required
At least GCC 7 is required
On Windows, at least Visual Studio 2019 is required (MSC_VER >= 1920)
Intel ICC (
icpx) is no longer supportedAt least CUDA Toolkit 12.0 is required
Support for CUDA Dynamic Parallelism V1 (CDPv1) has been removed
At least a GPU with compute capability 50 (Maxwell) is required