cuda.core 0.4.0 Release Notes#
Released on Oct 9, 2025
Highlights#
This is the last release that officially supports Python 3.9.
Python 3.14 is supported.
Experimental free-threaded builds for Python 3.13/3.14 are made available. Any bugs can be reported to our GitHub repo.
Breaking Changes#
CUDA 11 support dropped: CUDA 11 is no longer tested and it may or may not work with
cuda.bindingsand CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.Support for
cuda-bindings(andcuda-python) < 12.6.2 is dropped. Internally,cuda.corenow always requires the new binding module layout. As per thecuda-bindingssupport policy), CUDA 12 users are encouraged to use the latestcuda-bindings12.9.x, which is backward-compatible with all CUDA Toolkit 12.y.Change in
LaunchConfiggrid parameter interpretation: WhenLaunchConfig.clusteris specified, theLaunchConfig.gridparameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures thatLaunchConfig(grid=4, cluster=2, block=32)correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalentcudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).The
Bufferobjects now deallocate on the stream that was used to allocate it, instead of on the default stream. We encourage users to overwrite the deallocation stream explicitly through theclose()method if desired. Establishing a proper stream order is the user responsibility.
New features#
Added
Device.archproperty that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.CUDA 13.x testing support through new
test-cu13dependency group.Stream-ordered memory allocation can now be shared on Linux via
DeviceMemoryResource.Added NVVM IR support to
Program. NVVM IR is now understood withcode_type="nvvm".Added an
ObjectCode.code_typeattribute for querying the code type.Added
VirtualMemoryResourcefor low-level virtual memory management on Linux.
New examples#
None.
Fixes and enhancements#
Improved
DeviceMemoryResourceallocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).Improved
StridedMemoryViewcreation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).Fix
LaunchConfiggrid unit conversion when cluster is set (addresses issue #867).Fixed a bug in
GraphBuilder.add_childwhere dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).Make
Buffercreation more performant.Enabled
MemoryResourcesubclasses to acceptDeviceobjects, in addition to previously supported device ordinals.Fixed a bug in
Streamand other classes where object cleanup would error during interpreter shutdown.StridedMemoryViewof an underlying array using the DLPack protocol will no longer leak memory.General performance improvement.
Fixed incorrect index usage in vector_add example