CUDA Host API

Device Management

Device detection and enquiry

The following functions are available for querying the available hardware:

numba.cuda.is_available()

Returns a boolean to indicate the availability of a CUDA GPU.

This will initialize the driver if it hasn’t been initialized.

numba.cuda.detect()

Detect supported CUDA hardware and print a summary of the detected hardware.

Returns a boolean indicating whether any supported devices were detected.

Context management

CUDA Python functions execute within a CUDA context. Each CUDA device in a system has an associated CUDA context, and Numba presently allows only one context per thread. For further details on CUDA Contexts, refer to the CUDA Driver API Documentation on Context Management and the CUDA C Programming Guide Context Documentation. CUDA Contexts are instances of the Context class:

class numba.cuda.cudadrv.driver.Context(device, handle)

This object wraps a CUDA Context resource.

Contexts should not be constructed directly by user code.

get_memory_info()

Returns (free, total) memory in bytes in the context.

pop()

Pops this context off the current CPU thread. Note that this context must be at the top of the context stack, otherwise an error will occur.

push()

Pushes this context on the current CPU Thread.

reset()

Clean up all owned resources in this context.

The following functions can be used to get or select the context:

numba.cuda.current_context(devnum=None)

Get the current device or use a device by device number, and return the CUDA context.

numba.cuda.require_context(fn)

A decorator that ensures a CUDA context is available when fn is executed.

Note: The function fn cannot switch CUDA-context.

The following functions affect the current context:

numba.cuda.synchronize()

Synchronize the current context.

numba.cuda.close()

Explicitly clears all contexts in the current thread, and destroys all contexts if the current thread is the main thread.

Device management

Numba maintains a list of supported CUDA-capable devices:

numba.cuda.gpus

An indexable list of supported CUDA devices. This list is indexed by integer device ID.

Alternatively, the current device can be obtained:

numba.cuda.gpus.current

The currently-selected device.

Getting a device through numba.cuda.gpus always provides an instance of numba.cuda.cudadrv.devices._DeviceContextManager, which acts as a context manager for the selected device:

class numba.cuda.cudadrv.devices._DeviceContextManager(device)

Provides a context manager for executing in the context of the chosen device. The normal use of instances of this type is from numba.cuda.gpus. For example, to execute on device 2:

with numba.cuda.gpus[2]:
    d_a = numba.cuda.to_device(a)

to copy the array a onto device 2, referred to by d_a.

One may also select a context and device or get the current device using the following three functions:

numba.cuda.select_device(device_id)

Make the context associated with device device_id the current context.

Returns a Device instance.

Raises exception on error.

numba.cuda.get_current_device()

Get current device associated with the current thread

numba.cuda.list_devices()

Return a list of all detected devices

The numba.cuda.cudadrv.driver.Device class can be used to enquire about the functionality of the selected device:

class numba.cuda.cudadrv.driver.Device

The device associated with a particular context.

compute_capability

A tuple, (major, minor) indicating the supported compute capability.

id

The integer ID of the device.

name

The name of the device (e.g. “GeForce GTX 970”).

uuid

The UUID of the device (e.g. “GPU-e6489c45-5b68-3b03-bab7-0e7c8e809643”).

reset()

Delete the context for the device. This will destroy all memory allocations, events, and streams created within the context.

supports_float16

Return True if the device supports float16 operations, False otherwise.

Compilation

Numba provides an entry point for compiling a Python function without invoking any of the driver API. This can be useful for:

  • Generating PTX that is to be inlined into other PTX code (e.g. from outside the Numba / Python ecosystem).

  • Generating PTX or LTO-IR to link with objects from non-Python translation units.

  • Generating code when there is no device present.

  • Generating code prior to a fork without initializing CUDA.

Note

It is the user’s responsibility to manage any ABI issues arising from the use of compilation to PTX / LTO-IR. Passing the abi="c" keyword argument can provide a solution to most issues that may arise - see Using the C ABI.

numba.cuda.compile(pyfunc, sig, debug=None, lineinfo=False, device=True, fastmath=False, cc=None, opt=None, abi='c', abi_info=None, output='ptx')

Compile a Python function to PTX or LTO-IR for a given set of argument types.

Parameters:
  • pyfunc – The Python function to compile.

  • sig – The signature representing the function’s input and output types. If this is a tuple of argument types without a return type, the inferred return type is returned by this function. If a signature including a return type is passed, the compiled code will include a cast from the inferred return type to the specified return type, and this function will return the specified return type.

  • debug (bool) – Whether to include debug info in the compiled code.

  • lineinfo (bool) – Whether to include a line mapping from the compiled code to the source code. Usually this is used with optimized code (since debug mode would automatically include this), so we want debug info in the LLVM IR but only the line mapping in the final output.

  • device (bool) – Whether to compile a device function.

  • fastmath (bool) – Whether to enable fast math flags (ftz=1, prec_sqrt=0, prec_div=, and fma=1)

  • cc (tuple) – Compute capability to compile for, as a tuple (MAJOR, MINOR). Defaults to (5, 0).

  • opt (bool) – Whether to enable optimizations in the compiled code.

  • abi (str) – The ABI for a compiled function - either "numba" or "c". Note that the Numba ABI is not considered stable. The C ABI is only supported for device functions at present.

  • abi_info (dict) – A dict of ABI-specific options. The "c" ABI supports one option, "abi_name", for providing the wrapper function’s name. The "numba" ABI has no options.

  • output (str) – Type of output to generate, either "ptx" or "ltoir".

Returns:

(code, resty): The compiled code and inferred return type

Return type:

tuple

The environment variable NUMBA_CUDA_DEFAULT_PTX_CC can be set to control the default compute capability targeted by compile - see Environment Variables. If code for the compute capability of the current device is required, the compile_for_current_device function can be used:

numba.cuda.compile_for_current_device(pyfunc, sig, debug=None, lineinfo=False, device=True, fastmath=False, opt=None, abi='c', abi_info=None, output='ptx')

Compile a Python function to PTX or LTO-IR for a given signature for the current device’s compute capabilility. This calls compile() with an appropriate cc value for the current device.

Numba also provides two functions that may be used in legacy code that specifically compile to PTX only:

numba.cuda.compile_ptx(pyfunc, sig, debug=None, lineinfo=False, device=False, fastmath=False, cc=None, opt=None, abi='numba', abi_info=None)

Compile a Python function to PTX for a given signature. See compile(). The defaults for this function are to compile a kernel with the Numba ABI, rather than compile()’s default of compiling a device function with the C ABI.

numba.cuda.compile_ptx_for_current_device(pyfunc, sig, debug=None, lineinfo=False, device=False, fastmath=False, opt=None, abi='numba', abi_info=None)

Compile a Python function to PTX for a given signature for the current device’s compute capabilility. See compile_ptx().

Measurement

Profiling

The NVidia Visual Profiler can be used directly on executing CUDA Python code - it is not a requirement to insert calls to these functions into user code. However, these functions can be used to allow profiling to be performed selectively on specific portions of the code. For further information on profiling, see the NVidia Profiler User’s Guide.

numba.cuda.profile_start()

Enable profile collection in the current context.

numba.cuda.profile_stop()

Disable profile collection in the current context.

numba.cuda.profiling()

Context manager that enables profiling on entry and disables profiling on exit.

Events

Events can be used to monitor the progress of execution and to record the timestamps of specific points being reached. Event creation returns immediately, and the created event can be queried to determine if it has been reached. For further information, see the CUDA C Programming Guide Events section.

The following functions are used for creating and measuring the time between events:

numba.cuda.event(timing=True)

Create a CUDA event. Timing data is only recorded by the event if it is created with timing=True.

numba.cuda.event_elapsed_time(evtstart, evtend)

Compute the elapsed time between two events in milliseconds.

Events are instances of the numba.cuda.cudadrv.driver.Event class:

class numba.cuda.cudadrv.driver.Event(context, handle, finalizer=None)
query()

Returns True if all work before the most recent record has completed; otherwise, returns False.

record(stream=0)

Set the record point of the event to the current point in the given stream.

The event will be considered to have occurred when all work that was queued in the stream at the time of the call to record() has been completed.

synchronize()

Synchronize the host thread for the completion of the event.

wait(stream=0)

All future works submitted to stream will wait util the event completes.

Stream Management

Streams allow concurrency of execution on a single device within a given context. Queued work items in the same stream execute sequentially, but work items in different streams may execute concurrently. Most operations involving a CUDA device can be performed asynchronously using streams, including data transfers and kernel execution. For further details on streams, see the CUDA C Programming Guide Streams section.

Numba defaults to using the legacy default stream as the default stream. The per-thread default stream can be made the default stream by setting the environment variable NUMBA_CUDA_PER_THREAD_DEFAULT_STREAM to 1 (see the CUDA Environment Variables section). Regardless of this setting, the objects representing the legacy and per-thread default streams can be constructed using the functions below.

Streams are instances of numba.cuda.cudadrv.driver.Stream:

class numba.cuda.cudadrv.driver.Stream(context, handle, finalizer, external=False)
add_callback(callback, arg=None)

Add a callback to a compute stream. The user provided function is called from a driver thread once all preceding stream operations are complete.

Callback functions are called from a CUDA driver thread, not from the thread that invoked add_callback. No CUDA API functions may be called from within the callback function.

The duration of a callback function should be kept short, as the callback will block later work in the stream and may block other callbacks from being executed.

Note: The driver function underlying this method is marked for eventual deprecation and may be replaced in a future CUDA release.

Parameters:
  • callback – Callback function with arguments (stream, status, arg).

  • arg – Optional user data to be passed to the callback function.

async_done()

Return an awaitable that resolves once all preceding stream operations are complete. The result of the awaitable is the current stream.

auto_synchronize()

A context manager that waits for all commands in this stream to execute and commits any pending memory transfers upon exiting the context.

synchronize()

Wait for all commands in this stream to execute. This will commit any pending memory transfers.

To create a new stream:

numba.cuda.stream()

Create a CUDA stream that represents a command queue for the device.

To get the default stream:

numba.cuda.default_stream()

Get the default CUDA stream. CUDA semantics in general are that the default stream is either the legacy default stream or the per-thread default stream depending on which CUDA APIs are in use. In Numba, the APIs for the legacy default stream are always the ones in use, but an option to use APIs for the per-thread default stream may be provided in future.

To get the default stream with an explicit choice of whether it is the legacy or per-thread default stream:

numba.cuda.legacy_default_stream()

Get the legacy default CUDA stream.

numba.cuda.per_thread_default_stream()

Get the per-thread default CUDA stream.

To construct a Numba Stream object using a stream allocated elsewhere, the external_stream function is provided. Note that the lifetime of external streams must be managed by the user - Numba will not deallocate an external stream, and the stream must remain valid whilst the Numba Stream object is in use.

numba.cuda.external_stream(ptr)

Create a Numba stream object for a stream allocated outside Numba.

Parameters:

ptr (int) – Pointer to the external stream to wrap in a Numba Stream

Runtime

Numba generally uses the Driver API, but it provides a simple wrapper to the Runtime API so that the version of the runtime in use can be queried. This is accessed through cuda.runtime, which is an instance of the numba.cuda.cudadrv.runtime.Runtime class:

class numba.cuda.cudadrv.runtime.Runtime

Runtime object that lazily binds runtime API functions.

get_version()

Returns the CUDA Runtime version as a tuple (major, minor).

is_supported_version()

Returns True if the CUDA Runtime is a supported version.

property supported_versions

A tuple of all supported CUDA toolkit versions. Versions are given in the form (major_version, minor_version).

Whether the current runtime is officially supported and tested with the current version of Numba can also be queried:

numba.cuda.is_supported_version()

Returns True if the CUDA Runtime is a supported version.

Unsupported versions (e.g. newer versions than those known to Numba) may still work; this function provides a facility to check whether the current Numba version is tested and known to work with the current runtime version. If the current version is unsupported, the caller can decide how to act. Options include:

  • Continuing silently,

  • Emitting a warning,

  • Generating an error or otherwise preventing the use of CUDA.