cuda.core 0.X.Y Release Notes#
Released on TBD
Highlights#
This is the last release that officially supports Python 3.9.
Fix for
LaunchConfiggrid parameter unit conversion when thread block clusters are used.
Breaking Changes#
CUDA 11 support dropped: CUDA 11 support is no longer tested and it may or may not work with cuda.bindings and CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.
Support for
cuda-bindings(andcuda-python) < 12.6.2 is dropped. Internally,cuda.corenow always requires the new binding module layout. As per thecuda-bindingssupport policy), CUDA 12 users are encouraged to use the latestcuda-bindings12.9.x, which is backward-compatible with all CUDA Toolkit 12.y.LaunchConfig grid parameter interpretation: When
LaunchConfig.clusteris specified, theLaunchConfig.gridparameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures thatLaunchConfig(grid=4, cluster=2, block=32)correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalentcudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).
New features#
Added
Device.archproperty that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.CUDA 13.x testing support through new
test-cu13dependency group.Stream-ordered memory allocation can now be shared on Linux via
DeviceMemoryResource.Added NVVM IR support to
Program. NVVM IR is now understood withcode_type="nvvm".Added an
ObjectCode.code_typeattribute for querying the code type.Added
VirtualMemoryResourcefor low-level virtual memory management.
New examples#
None.
Fixes and enhancements#
Improved
DeviceMemoryResourceallocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).Improved
StridedMemoryViewcreation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).Fix
LaunchConfiggrid unit conversion when cluster is set (addresses issue #867).Fixed a bug in
GraphBuilder.add_childwhere dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).Make
Buffercreation more performant.Enabled
MemoryResourcesubclasses to acceptDeviceobjects, in addition to previously supported device ordinals.Fixed a bug in
Streamand other classes where object cleanup would error during interpreter shutdown.StridedMemoryViewof an underlying array using the DLPack protocol will no longer leak memory.General performance improvement.
Fixed incorrect index usage in vector_add example