cuda.core 0.3.0 Release Notes#

Released on June 11, 2025

Highlights#

  • Starting this release cuda.core is licensed under Apache 2.0. The biggest implication of this change is that we are open to external contribution now! Please kindly follow the Contributor Guide for detailed instructions.

  • Initial support for CUDA graphs (phase 1).
    • In this release, we support building a CUDA graph that captures kernel launches. The captured graph can be replayed to reduce latency. Graph split/join and conditional nodes are supported.

Breaking Changes#

  • The Buffer object’s __init__() method is removed, see below.

  • The Buffer object’s close() method and destructor now always defer to the underlying memory resource implementation to decide the behavior if a stream is not explicitly passed. Previously, in this case it always uses the default stream, which could interfere with the memory resource’s assumptions.

New features#

  • Kernel adds num_arguments and arguments_info for introspection of kernel arguments. (#612)

  • Add pythonic access to kernel occupancy calculation functions via Kernel.occupancy. (#648)

  • Support launching cooperative kernels by setting LaunchConfig.cooperative_launch to True.

  • A name can be assigned to ObjectCode instances generated by both Program and Linker through their respective options.

  • Expose Buffer, DeviceMemoryResource, LegacyPinnedMemoryResource, and MemoryResource to the top namespace.
    • Before this release, the internal Buffer class had an __init__() constructor. To align with the design of cuda.core objects, this constructor is removed starting this release. Users who still need the old behavior should use the from_handle() alternative constructor.

  • Add a typing annotation for the __cuda_stream__ protocol.

New examples#

  • Add a PyTorch-based example.

  • Split the StridedMemoryView example into two (CPU/GPU).

Fixes and enhancements#

  • cuda.core now raises more clear and actionable error messages whenever possible.

  • ObjectCode can be pickled now.

  • Look-up of the Event.device and Event.context (the device and CUDA context where an event was created from) is now possible.

  • Event-based timing is made more robust (also with better error messages).

  • The launch() function’s handling of fp16 scalars was incorrect and is fixed.

  • ProgramOptions.ptxas_options can now accept more than one argument.

  • The Device constructor is made faster.

  • The CFFI-based example no longer leaves the intermediate files on disk after it finishes.