cuda.core
0.3.0 Release Notes¶
Released on MM DD, 2025
Highlights¶
Starting this release
cuda.core
is licensed under Apache 2.0. The biggest implication of this change is that we are open to external contribution now! Please kindly follow the Contributor Guide for detailed instructions.
Breaking Changes¶
The
Buffer
object’s__init__()
method is removed, see below.The
Buffer
object’sclose()
method and destructor now always defer to the underlying memory resource implementation to decide the behavior if a stream is not explicitly passed. Previously, in this case it always uses the default stream, which could interfere with the memory resource’s assumptions.
New features¶
Kernel
addsnum_arguments
andarguments_info
for introspection of kernel arguments. (#612)Add pythonic access to kernel occupancy calculation functions via
Kernel.occupancy
. (#648)Support launching cooperative kernels by setting
LaunchConfig.cooperative_launch
to True.A name can be assigned to
ObjectCode
instances generated by bothProgram
andLinker
through their respective options.- Expose
Buffer
,DeviceMemoryResource
,LegacyPinnedMemoryResource
, andMemoryResource
to the top namespace. Before this release, the internal
Buffer
class had an__init__()
constructor. To align with the design of cuda.core objects, this constructor is removed starting this release. Users who still need the old behavior should use thefrom_handle()
alternative constructor.
- Expose
Add a typing annotation for the
__cuda_stream__
protocol.
New examples¶
Add a PyTorch-based example.
Split the
StridedMemoryView
example into two (CPU/GPU).
Fixes and enhancements¶
cuda.core
now raises more clear and actionable error messages whenever possible.ObjectCode
can be pickled now.Look-up of the
Event.device
andEvent.context
(the device and CUDA context where an event was created from) is now possible.Event
-based timing is made more robust (also with better error messages).The
launch()
function’s handling of fp16 scalars was incorrect and is fixed.ProgramOptions.ptxas_options
can now accept more than one argument.The
Device
constructor is made faster.The CFFI-based example no longer leaves the intermediate files on disk after it finishes.