cuda.core 0.5.0 Release Notes#

Highlights#

  • Added memory management support (allocation, deallocation, copy, and fill) for CUDA graphs.

  • Added PinnedMemoryResource and ManagedMemoryResource for advanced memory management.

  • Added peer access control to DeviceMemoryResource.

  • Reduced Python overhead and improved performance for calling launch(), constructing LaunchConfig, and accessing DeviceMemoryResource attributes.

Breaking Changes#

The support for setting VirtualMemoryResourceOptions.handle_type to "win32" is removed. Please reach out to us on GitHub if you have a use case.

All public APIs accessible under the cuda.core.experimental namespace are now moved to the top-level cuda.core namespace. For example, cuda.core.experimental.Device is now accessible as cuda.core.Device. The cuda.core.experimental namespace is still retained for backward compatibility, but is considered deprecated and will be removed by cuda.core v1.0.0.

The following APIs have been deprecated and will be removed in 0.6.0:

  • cuda.core.experimental.system.driver_version has been replaced with cuda.core.experimental.system.get_driver_version().

  • cuda.core.experimental.system.num_devices has been replaced with cuda.core.experimental.system.get_num_devices().

  • cuda.core.experimental.system.devices has been replaced with cuda.core.experimental.Device.get_all_devices().

Other changes:

  • The utils.StridedMemoryView.__init__() constructor is deprecated in favor of the new from_* classmethods, see below.

  • Support for Python 3.9 and 3.13t is dropped.

New features#

  • Added GraphMemoryResource for allocating and deallocating memory when building a CUDA graph.

  • Added PinnedMemoryResource and PinnedMemoryResourceOptions for managing host-pinned memory pools with optional IPC support.

  • Added ManagedMemoryResource and ManagedMemoryResourceOptions for managing unified memory pools accessible from both host and device.

  • Added Buffer.fill() method for efficient memory initialization, supporting int, bytes, and general buffer protocol objects.

  • Buffer can now wrap external memory allocations with an owner object.

  • Added alternative constructors from_buffer(), from_dlpack(), and from_cuda_array_interface() and a new property size for StridedMemoryView.

  • Added ProgramOptions.as_bytes() and LinkerOptions.as_bytes() public APIs for converting options to backend-specific byte representations.

  • Updated Device constructor to accept either a Device instance or a device ordinal (int).

  • Added Device.get_all_devices() classmethod.

  • IPC-imported buffers can now be re-exported to other processes.

New examples#

None.

Fixes and enhancements#

  • Zero-size arrays are now supported as inputs when constructing StridedMemoryView.

  • Most CUDA resources can be hashed now.

  • Python bool objects are now converted to C++ bool type when passed as kernel arguments (previously converted to int).

  • Restored v0.3.x MemoryResource behaviors and missing MR attributes for backward compatibility.

  • Added warning when multiprocessing start method is set to 'fork'.

  • Fixed potential memory leaks when DLPack capsule creation is interrupted.

  • Fixed VirtualMemoryResource on Windows platforms.

  • Fixed NVRTC program name handling on Windows to avoid filesystem issues.

  • Improved test determinism by replacing OS sleep with GPU nanosleep kernel in event timing tests.

  • Fixed CUDA graph issues with cuda-python==12.6.*.