cuda.core 0.3.0 Release Notes#
Released on June 11, 2025
Highlights#
Starting this release
cuda.coreis licensed under Apache 2.0. The biggest implication of this change is that we are open to external contribution now! Please kindly follow the Contributor Guide for detailed instructions.- Initial support for CUDA graphs (phase 1).
In this release, we support building a CUDA graph that captures kernel launches. The captured graph can be replayed to reduce latency. Graph split/join and conditional nodes are supported.
Breaking Changes#
The
Bufferobject’s__init__()method is removed, see below.The
Bufferobject’sclose()method and destructor now always defer to the underlying memory resource implementation to decide the behavior if a stream is not explicitly passed. Previously, in this case it always uses the default stream, which could interfere with the memory resource’s assumptions.
New features#
Kerneladdsnum_argumentsandarguments_infofor introspection of kernel arguments. (#612)Add pythonic access to kernel occupancy calculation functions via
Kernel.occupancy. (#648)Support launching cooperative kernels by setting
LaunchConfig.cooperative_launchtoTrue.A name can be assigned to
ObjectCodeinstances generated by bothProgramandLinkerthrough their respective options.- Expose
Buffer,DeviceMemoryResource,LegacyPinnedMemoryResource, andMemoryResourceto the top namespace. Before this release, the internal
Bufferclass had an__init__()constructor. To align with the design of cuda.core objects, this constructor is removed starting this release. Users who still need the old behavior should use thefrom_handle()alternative constructor.
- Expose
Add a typing annotation for the
__cuda_stream__protocol.
New examples#
Add a PyTorch-based example.
Split the
StridedMemoryViewexample into two (CPU/GPU).
Fixes and enhancements#
cuda.corenow raises more clear and actionable error messages whenever possible.ObjectCodecan be pickled now.Look-up of the
Event.deviceandEvent.context(the device and CUDA context where an event was created from) is now possible.Event-based timing is made more robust (also with better error messages).The
launch()function’s handling of fp16 scalars was incorrect and is fixed.ProgramOptions.ptxas_optionscan now accept more than one argument.The
Deviceconstructor is made faster.The CFFI-based example no longer leaves the intermediate files on disk after it finishes.