Memory management#
Even though Numba CUDA MLIR can automatically transfer NumPy arrays to the device, it can only do so conservatively by always transferring device memory back to the host when a kernel finishes. To avoid the unnecessary transfer for read-only arrays, it is recommended that applications pass PyTorch Tensors or CuPy arrays to kernels instead.
Numba CUDA MLIR provides APIs for transferring data to the device and creating and managing arrays on the device, but these are intended for compatibility with existing code that was written using Numba and Numba-CUDA; they are not recommended for use in new code.
Backward Compatibility APIs#
Data Transfer#
The following APIs to manually control transfer of host arrays to the device:
In addition to the device arrays, Numba CUDA MLIR can consume any object that implements the CUDA Array Interface. These objects also can be manually converted into a Numba device array by creating a view of the GPU buffer using the following APIs:
Device Arrays#
Numba CUDA MLIR Device Arrays have the following methods. These methods are to be called in host code, not within kernels.
Note
DeviceNDArray defines the cuda array interface.
Pinned Memory#
Mapped Memory#
Managed Memory#
Streams#
Streams can be passed to functions that accept them (e.g. copies between the host and device) and into kernel launch configurations so that the operations are executed asynchronously. Use of cuda.core streams is recommended; stream constructors provided by Numba CUDA MLIR are for backward compatibility with code written for Numba and Numba-CUDA.
Backward Compatibility Constructors#
CUDA streams have the following methods:
Local memory#
Local memory is an area of memory private to each thread. Using local memory helps allocate some scratchpad area when scalar local variables are not enough. The memory is allocated once for the duration of the kernel, unlike traditional dynamic memory management.
- numba_cuda_mlir.cuda.local.array(shape, type)
Allocate a local array of the given shape and type on the device. shape is either an integer or a tuple of integers representing the array’s dimensions and must be a simple constant expression. A “simple constant expression” includes, but is not limited to:
A literal (e.g.
10)A local variable whose right-hand side is a literal or a simple constant expression (e.g.
shape, whereshapeis defined earlier in the function asshape = 10)A global variable that is defined in the jitted function’s globals by the time of compilation (e.g.
shape, whereshapeis defined using any expression at global scope).
The definition must result in a Python
int(i.e. not a NumPy scalar or other scalar / integer-like type). type is a Numba type of the elements needing to be stored in the array. The array is private to the current thread. An array-like object is returned which can be read and written to like any standard array (e.g. through indexing).See also
The Local Memory Section in the CUDA Programming Guide.
Constant memory#
Constant memory is an area of memory that is read only, cached and off-chip, it is accessible by all threads and is host allocated. A method of creating an array in constant memory is through the use of:
- numba_cuda_mlir.cuda.const.array_like(arr)
Allocate and make accessible an array in constant memory based on array-like arr.
Deallocation Behavior#
This section describes the deallocation behaviour of Numba CUDA MLIR’s internal memory management. If an External Memory Management Plugin is in use (see External Memory Management (EMM) Plugin interface), then deallocation behaviour may differ; you may refer to the documentation for the EMM Plugin to understand its deallocation behaviour.
Deallocation of all CUDA resources are tracked on a per-context basis. When the last reference to a device memory is dropped, the underlying memory is scheduled to be deallocated. The deallocation does not occur immediately. It is added to a queue of pending deallocations. This design has two benefits:
Resource deallocation API may cause the device to synchronize; thus, breaking any asynchronous execution. Deferring the deallocation could avoid latency in performance critical code section.
Some deallocation errors may cause all the remaining deallocations to fail. Continued deallocation errors can cause critical errors at the CUDA driver level. In some cases, this could mean a segmentation fault in the CUDA driver. In the worst case, this could cause the system GUI to freeze and could only recover with a system reset. When an error occurs during a deallocation, the remaining pending deallocations are cancelled. Any deallocation error will be reported. When the process is terminated, the CUDA driver is able to release all allocated resources by the terminated process.
The deallocation queue is flushed automatically as soon as the following events occur:
An allocation failed due to out-of-memory error. Allocation is retried after flushing all deallocations.
The deallocation queue has reached its maximum size, which is default to 10. User can override by setting the environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT. For example, NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT=20, increases the limit to 20.
The maximum accumulated byte size of resources that are pending deallocation is reached. This is default to 20% of the device memory capacity. User can override by setting the environment variable NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO. For example, NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO=0.5 sets the limit to 50% of the capacity.
Sometimes, it is desired to defer resource deallocation until a code section ends. Most often, users want to avoid any implicit synchronization due to deallocation. This can be done by using the following context manager: