================== Cooperative Groups ================== Supported features ------------------ Numba's Cooperative Groups support presently provides grid groups and grid synchronization, along with cooperative kernel launches. Cooperative groups are supported on Linux, and Windows for devices in `TCC mode `_. Using Grid Groups ----------------- To get the current grid group, use the :meth:`cg.this_grid() ` function: .. code-block:: python g = cuda.cg.this_grid() Synchronizing the grid is done with the :meth:`sync() ` method of the grid group: .. code-block:: python g.sync() Cooperative Launches -------------------- Unlike the CUDA C/C++ API, a cooperative launch is invoked using the same syntax as a normal kernel launch - Numba automatically determines whether a cooperative launch is required based on whether a grid group is synchronized in the kernel. The grid size limit for a cooperative launch is more restrictive than for a normal launch - the grid must be no larger than the maximum number of active blocks on the device on which it is launched. To get maximum grid size for a cooperative launch of a kernel with a given block size and dynamic shared memory requirement, use the ``max_cooperative_grid_blocks()`` method of kernel overloads: .. automethod:: numba.cuda.dispatcher._Kernel.max_cooperative_grid_blocks This can be used to ensure that the kernel is launched with no more than the maximum number of blocks. Exceeding the maximum number of blocks for the cooperative launch will result in a ``CUDA_ERROR_COOPERATIVE_LAUNCH_TOO_LARGE`` error. Applications and Example ------------------------ Grid group synchronization can be used to implement a global barrier across all threads in the grid - applications of this include a global reduction to a single value, or looping over rows of a large matrix sequentially using the entire grid to operate on column elements in parallel. In the following example, rows are written sequentially by the grid. Each thread in the grid reads a value from the previous row written by it's *opposite* thread. A grid sync is needed to ensure that threads in the grid don't run ahead of threads in other blocks, or fail to see updates from their opposite thread. First we'll define our kernel: .. literalinclude:: ../../../numba_cuda/numba/cuda/tests/doc_examples/test_cg.py :language: python :caption: from ``test_grid_sync`` of ``numba/cuda/tests/doc_example/test_cg.py`` :start-after: magictoken.ex_grid_sync_kernel.begin :end-before: magictoken.ex_grid_sync_kernel.end :dedent: 8 :linenos: Then create some empty input data and determine the grid and block sizes: .. literalinclude:: ../../../numba_cuda/numba/cuda/tests/doc_examples/test_cg.py :language: python :caption: from ``test_grid_sync`` of ``numba/cuda/tests/doc_example/test_cg.py`` :start-after: magictoken.ex_grid_sync_data.begin :end-before: magictoken.ex_grid_sync_data.end :dedent: 8 :linenos: Finally we launch the kernel and print the result: .. literalinclude:: ../../../numba_cuda/numba/cuda/tests/doc_examples/test_cg.py :language: python :caption: from ``test_grid_sync`` of ``numba/cuda/tests/doc_example/test_cg.py`` :start-after: magictoken.ex_grid_sync_launch.begin :end-before: magictoken.ex_grid_sync_launch.end :dedent: 8 :linenos: The maximum grid size for ``sequential_rows`` can be enquired using: .. code-block:: python overload = sequential_rows.overloads[(int32[:,::1],) max_blocks = overload.max_cooperative_grid_blocks(blockdim) print(max_blocks) # 1152 (e.g. on Quadro RTX 8000 with Numba 0.52.1 and CUDA 11.0)