`cuda.core` 0.4.0 Release Notes#

Released on Oct 9, 2025

Highlights#

This is the last release that officially supports Python 3.9.
Python 3.14 is supported.
Experimental free-threaded builds for Python 3.13/3.14 are made available. Any bugs can be reported to our GitHub repo.

CUDA 11 support dropped: CUDA 11 is no longer tested and it may or may not work with cuda.bindings and CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.
Support for cuda-bindings (and cuda-python) < 12.6.2 is dropped. Internally, cuda.core now always requires the new binding module layout. As per the cuda-bindings support policy), CUDA 12 users are encouraged to use the latest cuda-bindings 12.9.x, which is backward-compatible with all CUDA Toolkit 12.y.
Change in LaunchConfig grid parameter interpretation: When LaunchConfig.cluster is specified, the LaunchConfig.grid parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures that LaunchConfig(grid=4, cluster=2, block=32) correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalent cudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32)).
The Buffer objects now deallocate on the stream that was used to allocate it, instead of on the default stream. We encourage users to overwrite the deallocation stream explicitly through the close() method if desired. Establishing a proper stream order is the user responsibility.

Added Device.arch property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.
CUDA 13.x testing support through new test-cu13 dependency group.
Stream-ordered memory allocation can now be shared on Linux via DeviceMemoryResource.
Added NVVM IR support to Program. NVVM IR is now understood with code_type="nvvm".
Added an ObjectCode.code_type attribute for querying the code type.
Added VirtualMemoryResource for low-level virtual memory management on Linux.

None.

Improved DeviceMemoryResource allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).
Improved StridedMemoryView creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).
Fix LaunchConfig grid unit conversion when cluster is set (addresses issue #867).
Fixed a bug in GraphBuilder.add_child where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).
Make Buffer creation more performant.
Enabled MemoryResource subclasses to accept Device objects, in addition to previously supported device ordinals.
Fixed a bug in Stream and other classes where object cleanup would error during interpreter shutdown.
StridedMemoryView of an underlying array using the DLPack protocol will no longer leak memory.
General performance improvement.
Fixed incorrect index usage in vector_add example