Random Number Generation
Numba provides a random number generation algorithm that can be executed on
the GPU. Due to technical issues with how NVIDIA implemented cuRAND, however,
Numba’s GPU random number generator is not based on cuRAND. Instead, Numba’s
GPU RNG is an implementation of the xoroshiro128+ algorithm. The xoroshiro128+ algorithm has a period of
2**128 - 1
, which is shorter than the period of the XORWOW algorithm
used by default in cuRAND, but xoroshiro128+ still passes the BigCrush tests
of random number generator quality.
When using any RNG on the GPU, it is important to make sure that each thread has its own RNG state, and they have been initialized to produce non-overlapping sequences. The numba.cuda.random module provides a host function to do this, as well as CUDA device functions to obtain uniformly or normally distributed random numbers.
Note
Numba (like cuRAND) uses the Box-Muller transform <https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform> to generate normally distributed random numbers from a uniform generator. However, Box-Muller generates pairs of random numbers, and the current implementation only returns one of them. As a result, generating normally distributed values is half the speed of uniformly distributed values.
- numba.cuda.random.create_xoroshiro128p_states(n, seed, subsequence_start=0, stream=0)
Returns a new device array initialized for n random number generators.
This initializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.
The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.
- Parameters:
n (int) – number of RNG states to create
seed (uint64) – starting seed for list of generators
subsequence_start (uint64)
stream (CUDA stream) – stream to run initialization kernel on
- numba.cuda.random.init_xoroshiro128p_states(states, seed, subsequence_start=0, stream=0)
Initialize RNG states on the GPU for parallel generators.
This initializes the RNG states so that each state in the array corresponds subsequences in the separated by 2**64 steps from each other in the main sequence. Therefore, as long no CUDA thread requests more than 2**64 random numbers, all of the RNG states produced by this function are guaranteed to be independent.
The subsequence_start parameter can be used to advance the first RNG state by a multiple of 2**64 steps.
- Parameters:
states (1D DeviceNDArray, dtype=xoroshiro128p_dtype) – array of RNG states
seed (uint64) – starting seed for list of generators
- numba.cuda.random.xoroshiro128p_normal_float32(states, index)
Return a normally distributed float32 and advance
states[index]
.The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.
- Parameters:
states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
index (int64) – offset in states to update
- Return type:
float32
- numba.cuda.random.xoroshiro128p_normal_float64(states, index)
Return a normally distributed float32 and advance
states[index]
.The return value is drawn from a Gaussian of mean=0 and sigma=1 using the Box-Muller transform. This advances the RNG sequence by two steps.
- Parameters:
states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
index (int64) – offset in states to update
- Return type:
float64
- numba.cuda.random.xoroshiro128p_uniform_float32(states, index)
Return a float32 in range [0.0, 1.0) and advance
states[index]
.- Parameters:
states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
index (int64) – offset in states to update
- Return type:
float32
- numba.cuda.random.xoroshiro128p_uniform_float64(states, index)
Return a float64 in range [0.0, 1.0) and advance
states[index]
.- Parameters:
states (1D array, dtype=xoroshiro128p_dtype) – array of RNG states
index (int64) – offset in states to update
- Return type:
float64
A simple example
Here is a sample program that uses the random number generator:
from __future__ import print_function, absolute_import
from numba import cuda
from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32
import numpy as np
@cuda.jit
def compute_pi(rng_states, iterations, out):
"""Find the maximum value in values and store in result[0]"""
thread_id = cuda.grid(1)
# Compute pi by drawing random (x, y) points and finding what
# fraction lie inside a unit circle
inside = 0
for i in range(iterations):
x = xoroshiro128p_uniform_float32(rng_states, thread_id)
y = xoroshiro128p_uniform_float32(rng_states, thread_id)
if x**2 + y**2 <= 1.0:
inside += 1
out[thread_id] = 4.0 * inside / iterations
threads_per_block = 64
blocks = 24
rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)
out = np.zeros(threads_per_block * blocks, dtype=np.float32)
compute_pi[blocks, threads_per_block](rng_states, 10000, out)
print('pi:', out.mean())
An example of managing RNG state size and using a 3D grid
The number of RNG states scales with the number of threads using the RNG, so it is often better to use strided loops in conjunction with the RNG in order to keep the state size manageable.
In the following example, which initializes a large 3D array with random
numbers, using one thread per output element would result in 453,617,100 RNG
states. This would take a long time to initialize and poorly utilize the GPU.
Instead, it uses a fixed size 3D grid with a total of 2,097,152 ((16 ** 3) *
(8 ** 3)
) threads striding over the output array. The 3D thread indices
startx
, starty
, and startz
are linearized into a 1D index,
tid
, to index into the 2,097,152 RNG states.
1from numba import cuda
2from numba.cuda.random import (create_xoroshiro128p_states,
3 xoroshiro128p_uniform_float32)
4import numpy as np
5
6@cuda.jit
7def random_3d(arr, rng_states):
8 # Per-dimension thread indices and strides
9 startx, starty, startz = cuda.grid(3)
10 stridex, stridey, stridez = cuda.gridsize(3)
11
12 # Linearized thread index
13 tid = (startz * stridey * stridex) + (starty * stridex) + startx
14
15 # Use strided loops over the array to assign a random value to each entry
16 for i in range(startz, arr.shape[0], stridez):
17 for j in range(starty, arr.shape[1], stridey):
18 for k in range(startx, arr.shape[2], stridex):
19 arr[i, j, k] = xoroshiro128p_uniform_float32(rng_states, tid)
20
21# Array dimensions
22X, Y, Z = 701, 900, 719
23
24# Block and grid dimensions
25bx, by, bz = 8, 8, 8
26gx, gy, gz = 16, 16, 16
27
28# Total number of threads
29nthreads = bx * by * bz * gx * gy * gz
30
31# Initialize a state for each thread
32rng_states = create_xoroshiro128p_states(nthreads, seed=1)
33
34# Generate random numbers
35arr = cuda.device_array((X, Y, Z), dtype=np.float32)
36random_3d[(gx, gy, gz), (bx, by, bz)](arr, rng_states)