cuda.core 0.5.0 Release Notes#
Highlights#
Added memory management support (allocation, deallocation, copy, and fill) for CUDA graphs.
Added
PinnedMemoryResourceandManagedMemoryResourcefor advanced memory management.Added peer access control to
DeviceMemoryResource.Reduced Python overhead and improved performance for calling
launch(), constructingLaunchConfig, and accessingDeviceMemoryResourceattributes.
Breaking Changes#
The support for setting VirtualMemoryResourceOptions.handle_type to "win32" is removed. Please reach out to us on GitHub if you have a use case.
All public APIs accessible under the cuda.core.experimental namespace are now moved to the top-level cuda.core namespace. For example, cuda.core.experimental.Device is now accessible as cuda.core.Device. The cuda.core.experimental namespace is still retained for backward compatibility, but is considered deprecated and will be removed by cuda.core v1.0.0.
The following APIs have been deprecated and will be removed in 0.6.0:
cuda.core.experimental.system.driver_versionhas been replaced withcuda.core.experimental.system.get_driver_version().cuda.core.experimental.system.num_deviceshas been replaced withcuda.core.experimental.system.get_num_devices().cuda.core.experimental.system.deviceshas been replaced withcuda.core.experimental.Device.get_all_devices().
Other changes:
The
utils.StridedMemoryView.__init__()constructor is deprecated in favor of the newfrom_*classmethods, see below.Support for Python 3.9 and 3.13t is dropped.
New features#
Added
GraphMemoryResourcefor allocating and deallocating memory when building a CUDA graph.Added
PinnedMemoryResourceandPinnedMemoryResourceOptionsfor managing host-pinned memory pools with optional IPC support.Added
ManagedMemoryResourceandManagedMemoryResourceOptionsfor managing unified memory pools accessible from both host and device.Added
Buffer.fill()method for efficient memory initialization, supportingint,bytes, and general buffer protocol objects.Buffercan now wrap external memory allocations with an owner object.Added alternative constructors
from_buffer(),from_dlpack(), andfrom_cuda_array_interface()and a new propertysizeforStridedMemoryView.Added
ProgramOptions.as_bytes()andLinkerOptions.as_bytes()public APIs for converting options to backend-specific byte representations.Updated
Deviceconstructor to accept either aDeviceinstance or a device ordinal (int).Added
Device.get_all_devices()classmethod.IPC-imported buffers can now be re-exported to other processes.
New examples#
None.
Fixes and enhancements#
Zero-size arrays are now supported as inputs when constructing
StridedMemoryView.Most CUDA resources can be hashed now.
Python
boolobjects are now converted to C++booltype when passed as kernel arguments (previously converted toint).Restored v0.3.x
MemoryResourcebehaviors and missing MR attributes for backward compatibility.Added warning when multiprocessing start method is set to
'fork'.Fixed potential memory leaks when DLPack capsule creation is interrupted.
Fixed
VirtualMemoryResourceon Windows platforms.Fixed NVRTC program name handling on Windows to avoid filesystem issues.
Improved test determinism by replacing OS sleep with GPU nanosleep kernel in event timing tests.
Fixed CUDA graph issues with
cuda-python==12.6.*.