Random Number Generation (Deprecated)#
Warning
The cuda.random module is deprecated in Numba CUDA MLIR and is
only provided for backward compatibility with code written for Numba CUDA. It
may be removed in future.
Users are encouraged to use the nvmath-python cuRAND device APIs for random number generation.
Numba CUDA MLIR provides a random number generation algorithm that can be
executed on the GPU. Numba CUDA MLIR’s GPU random number generator is not based
on cuRAND. Instead, it implements the xoroshiro128+ algorithm. The xoroshiro128+ algorithm has a period of
2**128 - 1, which is shorter than the period of the XORWOW algorithm used by
default in cuRAND, but xoroshiro128+ still passes the BigCrush tests of random
number generator quality.
When using any RNG on the GPU, it is important to make sure that each thread
has its own RNG state, and they have been initialized to produce non-overlapping
sequences. The numba_cuda_mlir.cuda.random module provides a host function
to do this, as well as CUDA device functions to obtain uniformly or normally
distributed random numbers.
Note
Numba CUDA MLIR (like cuRAND) uses the Box-Muller transform to generate normally distributed random numbers from a uniform generator. However, Box-Muller generates pairs of random numbers, and the current implementation only returns one of them. As a result, generating normally distributed values is half the speed of uniformly distributed values.
A simple example#
Here is a sample program that uses the random number generator:
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
@cuda.jit
def compute_pi(rng_states, iterations, out):
"""Find the maximum value in values and store in result[0]"""
thread_id = cuda.grid(1)
# Compute pi by drawing random (x, y) points and finding what
# fraction lie inside a unit circle
inside = 0
for i in range(iterations):
x = xoroshiro128p_uniform_float32(rng_states, thread_id)
y = xoroshiro128p_uniform_float32(rng_states, thread_id)
if x**2 + y**2 <= 1.0:
inside += 1
out[thread_id] = 4.0 * inside / iterations
threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())
An example of managing RNG state size and using a 3D grid#
The number of RNG states scales with the number of threads using the RNG, so it is often better to use strided loops in conjunction with the RNG in order to keep the state size manageable.
In the following example, which initializes a large 3D array with random
numbers, using one thread per output element would result in 453,617,100 RNG
states. This would take a long time to initialize and poorly utilize the GPU.
Instead, it uses a fixed size 3D grid with a total of 2,097,152 ((16 ** 3) *
(8 ** 3)) threads striding over the output array. The 3D thread indices
startx, starty, and startz are linearized into a 1D index,
tid, to index into the 2,097,152 RNG states.
1from numba import cuda
2from numba.cuda.random import (
3 create_xoroshiro128p_states,
4 xoroshiro128p_uniform_float32,
5)
6import numpy as np
7
8@cuda.jit
9def random_3d(arr, rng_states):
10 # Per-dimension thread indices and strides
11 startx, starty, startz = cuda.grid(3)
12 stridex, stridey, stridez = cuda.gridsize(3)
13
14 # Linearized thread index
15 tid = (startz * stridey * stridex) + (starty * stridex) + startx
16
17 # Use strided loops over the array to assign a random value to each entry
18 for i in range(startz, arr.shape[0], stridez):
19 for j in range(starty, arr.shape[1], stridey):
20 for k in range(startx, arr.shape[2], stridex):
21 arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)
22
23# Array dimensions
24X, Y, Z = 701, 900, 719
25
26# Block and grid dimensions
27bx, by, bz = 8, 8, 8
28gx, gy, gz = 16, 16, 16
29
30# Total number of threads
31nthreads = bx * by * bz * gx * gy * gz
32
33# Initialize a state for each thread
34rng_states = create_xoroshiro128p_states(nthreads, seed=1)
35
36# Generate random numbers
37arr = cuda.device_array((X, Y, Z), dtype=np.float32)
38random_3d[(gx, gy, gz), (bx, by, bz)](arr, rng_states)