cuda.core
0.X.Y Release Notes#
Released on TBD
Highlights#
Fix for
LaunchConfig
grid parameter unit conversion when thread block clusters are used.
Breaking Changes#
LaunchConfig grid parameter interpretation: When
LaunchConfig.cluster
is specified, theLaunchConfig.grid
parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures thatLaunchConfig(grid=4, cluster=2, block=32)
correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalentcudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32))
.When
Buffer
is closed,Buffer.handle
is now set toNone
. It was previously set to0
by accident.
New features#
Added
Device.arch
property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.CUDA 13.x testing support through new
test-cu13
dependency group.
New examples#
None.
Fixes and enhancements#
Improved
DeviceMemoryResource
allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).Improved
StridedMemoryView
creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).Fix
LaunchConfig
grid unit conversion when cluster is set (addresses issue #867).Fixed a bug in
GraphBuilder.add_child
where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).Make
Buffer
creation more performant.