cuda.core
0.X.Y Release Notes#
Released on TBD
Highlights#
Fix for
LaunchConfig
grid parameter unit conversion when thread block clusters are used.
Breaking Changes#
CUDA 11 support dropped: CUDA 11 support is no longer tested and it may or may not work with cuda.bindings and CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.
LaunchConfig grid parameter interpretation: When
LaunchConfig.cluster
is specified, theLaunchConfig.grid
parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures thatLaunchConfig(grid=4, cluster=2, block=32)
correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalentcudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32))
.When
Buffer
is closed,Buffer.handle
is now set toNone
. It was previously set to0
by accident.
New features#
Added
Device.arch
property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.CUDA 13.x testing support through new
test-cu13
dependency group.
New examples#
None.
Fixes and enhancements#
Improved
DeviceMemoryResource
allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).Improved
StridedMemoryView
creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).Fix
LaunchConfig
grid unit conversion when cluster is set (addresses issue #867).Fixed a bug in
GraphBuilder.add_child
where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).Make
Buffer
creation more performant.