cuda.core
0.4.0 Release Notes#
Released on Oct 9, 2025
Highlights#
This is the last release that officially supports Python 3.9.
Python 3.14 is supported.
Experimental free-threaded builds for Python 3.13/3.14 are made available. Any bugs can be reported to our GitHub repo.
Breaking Changes#
CUDA 11 support dropped: CUDA 11 is no longer tested and it may or may not work with
cuda.bindings
and CTK 11.x. Users are encouraged to migrate to CUDA 12.x or 13.x.Support for
cuda-bindings
(andcuda-python
) < 12.6.2 is dropped. Internally,cuda.core
now always requires the new binding module layout. As per thecuda-bindings
support policy), CUDA 12 users are encouraged to use the latestcuda-bindings
12.9.x, which is backward-compatible with all CUDA Toolkit 12.y.Change in
LaunchConfig
grid parameter interpretation: WhenLaunchConfig.cluster
is specified, theLaunchConfig.grid
parameter now correctly represents the number of clusters instead of blocks. Previously, the grid parameter was incorrectly interpreted as blocks, causing a mismatch with the expected C++ behavior. This change ensures thatLaunchConfig(grid=4, cluster=2, block=32)
correctly produces 4 clusters × 2 blocks/cluster = 8 total blocks, matching the C++ equivalentcudax::make_hierarchy(cudax::grid_dims(4), cudax::cluster_dims(2), cudax::block_dims(32))
.The
Buffer
objects now deallocate on the stream that was used to allocate it, instead of on the default stream. We encourage users to overwrite the deallocation stream explicitly through theclose()
method if desired. Establishing a proper stream order is the user responsibility.
New features#
Added
Device.arch
property that returns the compute capability as a string (e.g., ‘75’ for CC 7.5), providing a convenient alternative to manually concatenating the compute capability tuple.CUDA 13.x testing support through new
test-cu13
dependency group.Stream-ordered memory allocation can now be shared on Linux via
DeviceMemoryResource
.Added NVVM IR support to
Program
. NVVM IR is now understood withcode_type="nvvm"
.Added an
ObjectCode.code_type
attribute for querying the code type.Added
VirtualMemoryResource
for low-level virtual memory management on Linux.
New examples#
None.
Fixes and enhancements#
Improved
DeviceMemoryResource
allocation performance when there are no active allocations by setting a higher release threshold (addresses issue #771).Improved
StridedMemoryView
creation time performance by optimizing shape and strides tuple creation using Python/C API (addresses issue #449).Fix
LaunchConfig
grid unit conversion when cluster is set (addresses issue #867).Fixed a bug in
GraphBuilder.add_child
where dependencies extracted from capturing stream were passed inconsistently with num_dependencies parameter (addresses issue #843).Make
Buffer
creation more performant.Enabled
MemoryResource
subclasses to acceptDevice
objects, in addition to previously supported device ordinals.Fixed a bug in
Stream
and other classes where object cleanup would error during interpreter shutdown.StridedMemoryView
of an underlying array using the DLPack protocol will no longer leak memory.General performance improvement.
Fixed incorrect index usage in vector_add example