cuda.core 0.6.0 Release Notes#

Highlights#

  • Added the cuda.core.system module for NVML-based system and device queries.

  • Several StridedMemoryView improvements, including bfloat16 dlpack support and numpy array interoperability.

  • Improved support for Python object protocols across core API classes.

  • Performance improvements through Cythonization and reduced Python overhead.

Breaking Changes#

  • Building cuda.core from source now requires cuda-bindings >= 12.9.0, due to Cython-level dependencies on the NVVM bindings (cynvvm). Pre-built wheels are unaffected. The previous minimum was 12.8.0.

New features#

  • Added the cuda.core.system module for NVML-based system and device queries, including device attributes, clocks, temperatures, fans, events, and PCI information.

  • StridedMemoryView improvements:

    • Added from_array_interface constructor for creating views from numpy arrays.

    • Improved structured dtype array support.

    • Added bfloat16 dlpack support when the optional ml_dtypes package is installed.

  • Added public access to default CUDA streams via module-level constants LEGACY_DEFAULT_STREAM and PER_THREAD_DEFAULT_STREAM, replacing the previous workaround of using Stream.from_handle(0).

  • Added Kernel.from_handle() for wrapping an existing CUfunction handle into a Kernel object, enabling interoperability with foreign CUDA handles.

  • Added __eq__, __hash__, __weakref__, and __repr__ support for core API classes including Buffer, LaunchConfig, Kernel, ObjectCode, Stream, and Event.

  • Added NVVM extra_sources and use_libdevice options to ProgramOptions for multi-module NVVM compilation and automatic libdevice loading.

  • Added CUDA version compatibility check at import time to detect mismatches between cuda.core and the installed cuda-bindings version.

Fixes and enhancements#

  • Eliminated spurious CUDA driver errors during interpreter shutdown by ensuring resources are destroyed in the correct order.

  • Fixed a bug preventing weak references to core API objects.

  • Fixed zero-sized allocations in legacy memory resources, which previously failed on certain platforms.

  • Improved performance by Cythonizing Program and ObjectCode internals.

  • Reduced StridedMemoryView construction overhead.

  • __hash__ and __eq__ on core API classes no longer require a CUDA context.

  • Device attribute queries now gracefully handle unsupported attributes on older CUDA drivers, returning sensible defaults instead of raising errors.

  • Added a warning when ManagedMemoryResource is created on platforms without concurrent managed access support.

  • Reduced wheel and installed package sizes by excluding Cython source files and build artifacts from distribution packages.

  • Slightly improved typing support.