`cuda.core` 0.6.0 Release Notes#

Highlights#

Added the cuda.core.system module for NVML-based system and device queries.
Several StridedMemoryView improvements, including bfloat16 dlpack support and numpy array interoperability.
Improved support for Python object protocols across core API classes.
Performance improvements through Cythonization and reduced Python overhead.

Building cuda.core from source now requires cuda-bindings >= 12.9.0, due to Cython-level dependencies on the NVVM bindings (cynvvm). Pre-built wheels are unaffected. The previous minimum was 12.8.0.

Added the cuda.core.system module for NVML-based system and device queries, including device attributes, clocks, temperatures, fans, events, and PCI information.
StridedMemoryView improvements:
- Added from_array_interface constructor for creating views from numpy arrays.
- Improved structured dtype array support.
- Added bfloat16 dlpack support when the optional ml_dtypes package is installed.
Added public access to default CUDA streams via module-level constants LEGACY_DEFAULT_STREAM and PER_THREAD_DEFAULT_STREAM, replacing the previous workaround of using Stream.from_handle(0).
Added Kernel.from_handle() for wrapping an existing CUfunction handle into a Kernel object, enabling interoperability with foreign CUDA handles.
Added __eq__, __hash__, __weakref__, and __repr__ support for core API classes including Buffer, LaunchConfig, Kernel, ObjectCode, Stream, and Event.
Added NVVM extra_sources and use_libdevice options to ProgramOptions for multi-module NVVM compilation and automatic libdevice loading.
Added CUDA version compatibility check at import time to detect mismatches between cuda.core and the installed cuda-bindings version.

Eliminated spurious CUDA driver errors during interpreter shutdown by ensuring resources are destroyed in the correct order.
Fixed a bug preventing weak references to core API objects.
Fixed zero-sized allocations in legacy memory resources, which previously failed on certain platforms.
Improved performance by Cythonizing Program and ObjectCode internals.
Reduced StridedMemoryView construction overhead.
__hash__ and __eq__ on core API classes no longer require a CUDA context.
Device attribute queries now gracefully handle unsupported attributes on older CUDA drivers, returning sensible defaults instead of raising errors.
Added a warning when ManagedMemoryResource is created on platforms without concurrent managed access support.
Reduced wheel and installed package sizes by excluding Cython source files and build artifacts from distribution packages.
Slightly improved typing support.