cuda.core 0.X.Y Release Notes#

Released on TBD

Highlights#

  • Fix for LaunchConfig grid parameter unit conversion when thread block clusters are used.

Breaking Changes#

  • LaunchConfig grid parameter interpretation: When LaunchConfig.cluster is specified, the LaunchConfig.grid parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures that LaunchConfig(grid=4, cluster=2, block=32) correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalent cudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).

  • When Buffer is closed, Buffer.handle is now set to None. It was previously set to 0 by accident.

New features#

  • Added Device.arch property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.

  • CUDA 13.x testing support through new test-cu13 dependency group.

New examples#

None.

Fixes and enhancements#

  • Improved DeviceMemoryResource allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).

  • Improved StridedMemoryView creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).

  • Fix LaunchConfig grid unit conversion when cluster is set (addresses issue #867).

  • Fixed a bug in GraphBuilder.add_child where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).

  • Make Buffer creation more performant.