`cuda.core` 0.X.Y Release Notes#

Released on TBD

Highlights#

Fix for LaunchConfig grid parameter unit conversion when thread block clusters are used.

LaunchConfig grid parameter interpretation: When LaunchConfig.cluster is specified, the LaunchConfig.grid parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures that LaunchConfig(grid=4, cluster=2, block=32) correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalent cudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).
When Buffer is closed, Buffer.handle is now set to None. It was previously set to 0 by accident.

Added Device.arch property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.
CUDA 13.x testing support through new test-cu13 dependency group.

None.

Improved DeviceMemoryResource allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).
Improved StridedMemoryView creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).
Fix LaunchConfig grid unit conversion when cluster is set (addresses issue #867).
Fixed a bug in GraphBuilder.add_child where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).
Make Buffer creation more performant.