cuda.core 0.3.0 Release Notes

Released on June 11, 2025

Highlights

  • Starting this release cuda.core is licensed under Apache 2.0. The biggest implication of this change is that we are open to external contribution now! Please kindly follow the Contributor Guide for detailed instructions.

  • Initial support for CUDA graphs (phase 1).
    • In this release, we support building a CUDA graph that captures kernel launches. The captured graph can be replayed to reduce latency. Graph split/join and conditional nodes are supported.

Breaking Changes

  • The Buffer object’s __init__() method is removed, see below.

  • The Buffer object’s close() method and destructor now always defer to the underlying memory resource implementation to decide the behavior if a stream is not explicitly passed. Previously, in this case it always uses the default stream, which could interfere with the memory resource’s assumptions.

New features

New examples

  • Add a PyTorch-based example.

  • Split the StridedMemoryView example into two (CPU/GPU).

Fixes and enhancements

  • cuda.core now raises more clear and actionable error messages whenever possible.

  • ObjectCode can be pickled now.

  • Look-up of the Event.device and Event.context (the device and CUDA context where an event was created from) is now possible.

  • Event-based timing is made more robust (also with better error messages).

  • The launch() function’s handling of fp16 scalars was incorrect and is fixed.

  • ProgramOptions.ptxas_options can now accept more than one argument.

  • The Device constructor is made faster.

  • The CFFI-based example no longer leaves the intermediate files on disk after it finishes.