`cuda.core` 0.X.Y Release Notes#

Released on TBD

Highlights#

This is the last release that officially supports Python 3.9.
Fix for LaunchConfig grid parameter unit conversion when thread block clusters are used.

CUDA 11 support dropped: CUDA 11 support is no longer tested and it may or may not work with cuda.bindings and CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.
Support for cuda-bindings (and cuda-python) < 12.6.2 is dropped. Internally, cuda.core now always requires the new binding module layout. As per the cuda-bindings support policy), CUDA 12 users are encouraged to use the latest cuda-bindings 12.9.x, which is backward-compatible with all CUDA Toolkit 12.y.
LaunchConfig grid parameter interpretation: When LaunchConfig.cluster is specified, the LaunchConfig.grid parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures that LaunchConfig(grid=4, cluster=2, block=32) correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalent cudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).
When Buffer is closed, Buffer.handle is now set to None. It was previously set to 0 by accident.

Added Device.arch property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.
CUDA 13.x testing support through new test-cu13 dependency group.
Stream-ordered memory allocation can now be shared on Linux via DeviceMemoryResource.
Added NVVM IR support to Program. NVVM IR is now understood with code_type="nvvm".

None.

Improved DeviceMemoryResource allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).
Improved StridedMemoryView creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).
Fix LaunchConfig grid unit conversion when cluster is set (addresses issue #867).
Fixed a bug in GraphBuilder.add_child where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).
Make Buffer creation more performant.
Enabled MemoryResource subclasses to accept Device objects, in addition to previously supported device ordinals.
Fixed a bug in Stream and other classes where object cleanup would error during interpreter shutdown.
StridedMemoryView of an underlying array using the DLPack protocol will no longer leak memory.
General performance improvement.
Fixed incorrect index usage in vector_add example