Concurrency#

Asynchronous Operations#

Kernel Launches#

Kernels launched on a CUDA device are asynchronous with respect to the host (CPU Python thread). Launching a kernel schedules its execution on the CUDA device, but the wp.launch() function can return before the kernel execution completes. This allows us to run some CPU computations while the CUDA kernel is executing, which is an easy way to introduce parallelism into our programs.

wp.launch(kernel1, dim=n, inputs=[a], device="cuda:0")

# do some CPU work while the CUDA kernel is running
do_cpu_work()

Kernels launched on different CUDA devices can execute concurrently. This can be used to tackle independent sub-tasks in parallel on different GPUs while using the CPU to do other useful work:

# launch concurrent kernels on different devices
wp.launch(kernel1, dim=n, inputs=[a0], device="cuda:0")
wp.launch(kernel2, dim=n, inputs=[a1], device="cuda:1")

# do CPU work while kernels are running on both GPUs
do_cpu_work()

Launching kernels on the CPU is currently a synchronous operation. In other words, wp.launch() will return only after the kernel has finished executing on the CPU. To run a CUDA kernel and a CPU kernel concurrently, the CUDA kernel should be launched first:

# schedule a kernel on a CUDA device
wp.launch(kernel1, ..., device="cuda:0")

# run a kernel on the CPU while the CUDA kernel is running
wp.launch(kernel2, ..., device="cpu")

Graph Launches#

The concurrency rules for CUDA graph launches are similar to CUDA kernel launches, except that graphs are not available on the CPU.

# capture work on cuda:0 in a graph
with wp.ScopedCapture(device="cuda:0") as capture0:
    do_gpu0_work()

# capture work on cuda:1 in a graph
with wp.ScopedCapture(device="cuda:1") as capture1:
    do_gpu1_work()

# launch captured graphs on the respective devices concurrently
wp.capture_launch(capture0.graph)
wp.capture_launch(capture1.graph)

# do some CPU work while the CUDA graphs are running
do_cpu_work()

Array Creation#

Creating CUDA arrays is also asynchronous with respect to the host. It involves allocating memory on the device and initializing it, which is done under the hood using a kernel launch or an asynchronous CUDA memset operation.

a0 = wp.zeros(n, dtype=float, device="cuda:0")
b0 = wp.ones(n, dtype=float, device="cuda:0")

a1 = wp.empty(n, dtype=float, device="cuda:1")
b1 = wp.full(n, 42.0, dtype=float, device="cuda:1")

In this snippet, arrays a0 and b0 are created on device cuda:0 and arrays a1 and b1 are created on device cuda:1. The operations on the same device are sequential, but each device executes them independently of the other device, so they can run concurrently.

Array Copying#

Copying arrays between devices can also be asynchronous, but there are some details to be aware of.

Copying from host memory to a CUDA device and copying from a CUDA device to host memory is asynchronous only if the host array is pinned. Pinned memory allows the CUDA driver to use direct memory transfers (DMA), which are generally faster and can be done without involving the CPU. There are a couple of drawbacks to using pinned memory: allocation and deallocation is usually slower and there are system-specific limits on how much pinned memory can be allocated on the system. For this reason, Warp CPU arrays are not pinned by default. You can request a pinned allocation by passing the pinned=True flag when creating a CPU array. This is a good option for arrays that are used to copy data between host and device, especially if asynchronous transfers are desired.

h = wp.zeros(n, dtype=float, device="cpu")
p = wp.zeros(n, dtype=float, device="cpu", pinned=True)
d = wp.zeros(n, dtype=float, device="cuda:0")

# host-to-device copy
wp.copy(d, h)  # synchronous
wp.copy(d, p)  # asynchronous

# device-to-host copy
wp.copy(h, d)  # synchronous
wp.copy(p, d)  # asynchronous

# wait for asynchronous operations to complete
wp.synchronize_device("cuda:0")

Copying between CUDA arrays on the same device is always asynchronous with respect to the host, since it does not involve the CPU:

a = wp.zeros(n, dtype=float, device="cuda:0")
b = wp.empty(n, dtype=float, device="cuda:0")

# asynchronous device-to-device copy
wp.copy(a, b)

# wait for transfer to complete
wp.synchronize_device("cuda:0")

Copying between CUDA arrays on different devices is also asynchronous with respect to the host. Peer-to-peer transfers require extra care, because CUDA devices are also asynchronous with respect to each other. When copying an array from one GPU to another, the destination GPU is used to perform the copy, so we need to ensure that prior work on the source GPU completes before the transfer.

a0 = wp.zeros(n, dtype=float, device="cuda:0")
a1 = wp.empty(n, dtype=float, device="cuda:1")

# wait for outstanding work on the source device to complete to ensure the source array is ready
wp.synchronize_device("cuda:0")

# asynchronous peer-to-peer copy
wp.copy(a1, a0)

# wait for the copy to complete on the destination device
wp.synchronize_device("cuda:1")

Note that peer-to-peer transfers can be accelerated using memory pool access or peer access, which enables DMA transfers between CUDA devices on supported systems.

Streams#

A CUDA stream is a sequence of operations that execute in order on the GPU. Operations from different streams may run concurrently and may be interleaved by the device scheduler.

Warp automatically creates a stream for each CUDA device during initialization. This becomes the current stream for the device. All kernel launches and memory operations issued on that device are placed on the current stream.

Creating Streams#

A stream is tied to a particular CUDA device. New streams can be created using the wp.Stream constructor:

s1 = wp.Stream("cuda:0")  # create a stream on a specific CUDA device
s2 = wp.Stream()          # create a stream on the default device

If the device parameter is omitted, the default device will be used, which can be managed using wp.ScopedDevice.

For interoperation with external code, it is possible to pass a CUDA stream handle to wrap an external stream:

s3 = wp.Stream("cuda:0", cuda_stream=stream_handle)

The cuda_stream argument must be a native stream handle (cudaStream_t or CUstream) passed as a Python integer. This mechanism is used internally for sharing streams with external frameworks like PyTorch or DLPack. The caller is responsible for ensuring that the external stream does not get destroyed while it is referenced by a wp.Stream object.

Using Streams#

Use wp.ScopedStream to temporarily change the current stream on a device and schedule a sequence of operations on that stream:

stream = wp.Stream("cuda:0")

with wp.ScopedStream(stream):
    a = wp.zeros(n, dtype=float)
    b = wp.empty(n, dtype=float)
    wp.launch(kernel, dim=n, inputs=[a])
    wp.copy(b, a)

Since streams are tied to a particular device, wp.ScopedStream subsumes the functionality of wp.ScopedDevice. That’s why we don’t need to explicitly specify the device argument to each of the calls.

An important benefit of streams is that they can be used to overlap compute and data transfer operations on the same device, which can improve the overall throughput of a program by doing those operations in parallel.

with wp.ScopedDevice("cuda:0"):
    a = wp.zeros(n, dtype=float)
    b = wp.empty(n, dtype=float)
    c = wp.ones(n, dtype=float, device="cpu", pinned=True)

    compute_stream = wp.Stream()
    transfer_stream = wp.Stream()

    # asynchronous kernel launch on a stream
    with wp.ScopedStream(compute_stream)
        wp.launch(kernel, dim=a.size, inputs=[a])

    # asynchronous host-to-device copy on another stream
    with wp.ScopedStream(transfer_stream)
        wp.copy(b, c)

The wp.get_stream() function can be used to get the current stream on a device:

s1 = wp.get_stream("cuda:0")  # get the current stream on a specific device
s2 = wp.get_stream()          # get the current stream on the default device

The wp.set_stream() function can be used to set the current stream on a device:

wp.set_stream(stream, device="cuda:0")  # set the stream on a specific device
wp.set_stream(stream)                   # set the stream on the default device

In general, we recommend using wp.ScopedStream rather than wp.set_stream().

Synchronization#

The wp.synchronize_stream() function can be used to block the host thread until the given stream completes:

wp.synchronize_stream(stream)

In a program that uses multiple streams, this gives a more fine-grained level of control over synchronization behavior than wp.synchronize_device(), which synchronizes all streams on the device. For example, if a program has multiple compute and transfer streams, the host might only want to wait for one transfer stream to complete, without waiting for the other streams. By synchronizing only one stream, we allow the others to continue running concurrently with the host thread.

Events#

Functions like wp.synchronize_device() or wp.synchronize_stream() block the CPU thread until work completes on a CUDA device, but they’re not intended to synchronize multiple CUDA streams with each other.

CUDA events provide a mechanism for device-side synchronization between streams. This kind of synchronization does not block the host thread, but it allows one stream to wait for work on another stream to complete.

Like streams, events are tied to a particular device:

e1 = wp.Event("cuda:0")  # create an event on a specific CUDA device
e2 = wp.Event()          # create an event on the default device

To wait for a stream to complete some work, we first record the event on that stream. Then we make another stream wait on that event:

stream1 = wp.Stream("cuda:0")
stream2 = wp.Stream("cuda:0")
event = wp.Event("cuda:0")

stream1.record_event(event)
stream2.wait_event(event)

Note that when recording events, the event must be from the same device as the recording stream. When waiting for events, the waiting stream can be from another device. This allows using events to synchronize streams on different GPUs.

If the record_event() method is called without an event argument, a temporary event will be created, recorded, and returned:

event = stream1.record_event()
stream2.wait_event(event)

The wait_stream() method combines the acts of recording and waiting on an event in one call:

stream2.wait_stream(stream1)

Warp also provides global functions wp.record_event(), wp.wait_event(), and wp.wait_stream() which operate on the current stream of the default device:

wp.record_event(event)  # record an event on the current stream
wp.wait_event(event)    # make the current stream wait for an event
wp.wait_stream(stream)  # make the current stream wait for another stream

These variants are convenient to use inside of wp.ScopedStream and wp.ScopedDevice managers.

Here is a more complete example with a producer stream that copies data into an array and a consumer stream that uses the array in a kernel:

with wp.ScopedDevice("cuda:0"):
    a = wp.empty(n, dtype=float)
    b = wp.ones(n, dtype=float, device="cpu", pinned=True)

    producer_stream = wp.Stream()
    consumer_stream = wp.Stream()

    with wp.ScopedStream(producer_stream)
        # asynchronous host-to-device copy
        wp.copy(a, b)

        # record an event to create a synchronization point for the consumer stream
        event = wp.record_event()

        # do some unrelated work in the producer stream
        do_other_producer_work()

    with wp.ScopedStream(consumer_stream)
        # do some unrelated work in the consumer stream
        do_other_consumer_work()

        # wait for the producer copy to complete
        wp.wait_event(event)

        # consume the array in a kernel
        wp.launch(kernel, dim=a.size, inputs=[a])

The function wp.synchronize_event() can be used to block the host thread until a recorded event completes. This is useful when the host wants to wait for a specific synchronization point on a stream, while allowing subsequent stream operations to continue executing asynchronously.

with wp.ScopedDevice("cpu"):
    # CPU buffers for readback
    a_host = wp.empty(N, dtype=float, pinned=True)
    b_host = wp.empty(N, dtype=float, pinned=True)

with wp.ScopedDevice("cuda:0"):
    stream = wp.get_stream()

    # initialize first GPU array
    a = wp.full(N, 17, dtype=float)
    # asynchronous readback
    wp.copy(a_host, a)
    # record event
    a_event = stream.record_event()

    # initialize second GPU array
    b = wp.full(N, 42, dtype=float)
    # asynchronous readback
    wp.copy(b_host, b)
    # record event
    b_event = stream.record_event()

    # wait for first array readback to complete
    wp.synchronize_event(a_event)
    # process first array on the CPU
    assert np.array_equal(a_host.numpy(), np.full(N, fill_value=17.0))

    # wait for second array readback to complete
    wp.synchronize_event(b_event)
    # process second array on the CPU
    assert np.array_equal(b_host.numpy(), np.full(N, fill_value=42.0))

CUDA Default Stream#

Warp avoids using the synchronous CUDA default stream, which is a special stream that synchronizes with all other streams on the same device. This stream is currently only used during readback operations that are provided for convenience, such as array.numpy() and array.list().

stream1 = wp.Stream("cuda:0")
stream2 = wp.Stream("cuda:0")

with wp.ScopedStream(stream1):
    a = wp.zeros(n, dtype=float)

with wp.ScopedStream(stream2):
    b = wp.ones(n, dtype=float)

print(a)
print(b)

In the snippet above, there are two arrays that are initialized on different CUDA streams. Printing those arrays triggers a readback, which is done using the array.numpy() method. This readback happens on the synchronous CUDA default stream, which means that no explicit synchronization is required. The reason for this is convenience - printing an array is useful for debugging purposes, so it’s nice not to worry about synchronization.

The drawback of this approach is that the CUDA default stream (and any methods that use it) cannot be used during graph capture. The regular wp.copy() function should be used to capture readback operations in a graph.

Explicit Streams Arguments#

Several Warp functions accept optional stream arguments. This allows directly specifying the stream without using a wp.ScopedStream manager. There are benefits and drawbacks to both approaches, which will be discussed below. Functions that accept stream arguments directly include wp.launch(), wp.capture_launch(), and wp.copy().

To launch a kernel on a specific stream:

wp.launch(kernel, dim=n, inputs=[...], stream=my_stream)

When launching a kernel with an explicit stream argument, the device argument should be omitted, since the device is inferred from the stream. If both stream and device are specified, the stream argument takes precedence.

To launch a graph on a specific stream:

wp.capture_launch(graph, stream=my_stream)

For both kernel and graph launches, specifying the stream directly can be faster than using wp.ScopedStream. While wp.ScopedStream is useful for scheduling a sequence of operations on a specific stream, there is some overhead in setting and restoring the current stream on the device. This overhead is negligible for larger workloads, but performance-sensitive code may benefit from specifying the stream directly instead of using wp.ScopedStream, especially for a single kernel or graph launch.

In addition to these performance considerations, specifying the stream directly can be useful when copying arrays between two CUDA devices. By default, Warp uses the following rules to determine which stream will be used for the copy:

  • If the destination array is on a CUDA device, use the current stream on the destination device.

  • Otherwise, if the source array is on a CUDA device, use the current stream on the source device.

In the case of peer-to-peer copies, specifying the stream argument allows overriding these rules, and the copy can be performed on a stream from any device.

stream0 = wp.get_stream("cuda:0")
stream1 = wp.get_stream("cuda:1")

a0 = wp.zeros(n, dtype=float, device="cuda:0")
a1 = wp.empty(n, dtype=float, device="cuda:1")

# wait for the destination array to be ready
stream0.wait_stream(stream1)

# use the source device stream to do the copy
wp.copy(a1, a0, stream=stream0)

Notice that we use event synchronization to make the source stream wait for the destination stream prior to the copy. This is due to the stream-ordered memory pool allocators introduced in Warp 0.14.0. The allocation of the empty array a1 is scheduled on stream stream1. To avoid use-before-alloc errors, we need to wait until the allocation completes before using that array on a different stream.

Stream Usage Guidance#

Stream synchronization can be a tricky business, even for experienced CUDA developers. Consider the following code:

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

wp.launch(kernel, dim=a.size, inputs=[a], stream=s)

This snippet has a stream synchronization problem that is difficult to detect at first glance. It’s quite possible that the code will work just fine, but it introduces undefined behaviour, which may lead to incorrect results that manifest only once in a while. The issue is that the kernel is launched on stream s, which is different than the stream used for creating array a. The array is allocated and initialized on the current stream of device cuda:0, which means that it might not be ready when stream s begins executing the kernel that consumes the array.

The solution is to synchronize the streams, which can be done like this:

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

# wait for the current stream on cuda:0 to finish initializing the array
s.wait_stream(wp.get_stream("cuda:0"))

wp.launch(kernel, dim=a.size, inputs=[a], stream=s)

The wp.ScopedStream manager is designed to alleviate this common problem. It synchronizes the new stream with the previous stream on the device. Its behavior is equivalent to inserting the wait_stream() call as shown above. With wp.ScopedStream, we don’t need to explicitly sync the new stream with the previous stream:

a = wp.zeros(n, dtype=float, device="cuda:0")

s = wp.Stream("cuda:0")

with wp.ScopedStream(s):
    wp.launch(kernel, dim=a.size, inputs=[a])

This makes wp.ScopedStream the recommended way of getting started with streams in Warp. Using explicit stream arguments might be slightly more performant, but it requires more attention to stream synchronization mechanics. If you are a stream novice, consider the following trajectory for integrating streams into your Warp programs:

  • Level 1: Don’t. You don’t need to use streams to use Warp. Avoiding streams is a perfectly valid and respectable way to live. Many interesting and sophisticated algorithms can be developed without fancy stream juggling. Often it’s better to focus on solving a problem in a simple and elegant way, unencumbered by the vagaries of low-level stream management.

  • Level 2: Use wp.ScopedStream. It helps to avoid some common hard-to-catch issues. There’s a little bit of overhead, but it should be negligible if the GPU workloads are large enough. Consider adding streams into your program as a form of targeted optimization, especially if some areas like memory transfers (“feeding the beast”) are a known bottleneck. Streams are great for overlapping memory transfers with compute workloads.

  • Level 3: Use explicit stream arguments for kernel launches, array copying, etc. This will be the most performant approach that can get you close to the speed of light. You will need to take care of all stream synchronization yourself, but the results can be rewarding in the benchmarks.

Synchronization Guidance#

The general rule with synchronization is to use as little of it as possible, but not less.

Excessive synchronization can severely limit the performance of programs. Synchronization means that a stream or thread is waiting for something else to complete. While it’s waiting, it’s not doing any useful work, which means that any outstanding work cannot start until the synchronization point is reached. This limits parallel execution, which is often important for squeezing the most juice out of the collection of hardware components.

On the other hand, insufficient synchronization can lead to errors or incorrect results if operations execute out-of-order. A fast program is no good if it can’t guarantee correct results.

Host-side Synchronization#

Host-side synchronization blocks the host thread (Python) until GPU work completes. This is necessary when you are waiting for some GPU work to complete so that you can access the results on the CPU.

wp.synchronize() is the most heavy-handed synchronization function, since it synchronizes all the devices in the system. It is almost never the right function to call if performance is important. However, it can sometimes be useful when debugging synchronization-related issues.

wp.synchronize_device(device) synchronizes a single device, which is generally better and faster. This synchronizes all the streams on the specified device, including streams created by Warp and those created by any other framework.

wp.synchronize_stream(stream) synchronizes a single stream, which is better still. If the program uses multiple streams, you can wait for a specific one to finish without waiting for the others. This is handy if you have a readback stream that is copying data from the GPU to the CPU. You can wait for the transfer to complete and start processing it on the CPU while other streams are still chugging along on the GPU, in parallel with the host code.

wp.synchronize_event(event) is the most specific host synchronization function. It blocks the host until an event previously recorded on a CUDA stream completes. This can be used to wait for a specific stream synchronization point to be reached, while allowing subsequent operations on that stream to continue asynchronously.

Device-side Synchronization#

Device-side synchronization uses CUDA events to make one stream wait for a synchronization point recorded on another stream (wp.record_event(), wp.wait_event(), wp.wait_stream()).

These functions don’t block the host thread, so the CPU can stay busy doing useful work, like preparing the next batch of data to feed the beast. Events can be used to synchronize streams on the same device or even different CUDA devices, so you can choreograph very sophisticated multi-stream and multi-device workloads that execute entirely on the available GPUs. This allows keeping host-side synchronization to a minimum, perhaps only when reading back the final results.

Synchronization and Graph Capture#

A CUDA graph captures a sequence of operations on a CUDA stream that can be replayed multiple times with low overhead. During capture, certain CUDA functions are not allowed, which includes host-side synchronization functions. Using the synchronous CUDA default stream is also not allowed. The only form of synchronization allowed in CUDA graphs is event-based synchronization.

A CUDA graph capture must start and end on the same stream, but multiple streams can be used in the middle. This allows CUDA graphs to encompass multiple streams and even multiple GPUs. Events play a crucial role with multi-stream graph capture because they are used to fork and join new streams to the main capture stream, in addition to their regular synchronization duties.

Here’s an example of capturing a multi-GPU graph using a stream on each device:

stream0 = wp.Stream("cuda:0")
stream1 = wp.Stream("cuda:1")

# use stream0 as the main capture stream
with wp.ScopedCapture(stream=stream0) as capture:

    # fork stream1, which adds it to the set of streams being captured
    stream1.wait_stream(stream0)

    # launch a kernel on stream0
    wp.launch(kernel, ..., stream=stream0)

    # launch a kernel on stream1
    wp.launch(kernel, ..., stream=stream1)

    # join stream1
    stream0.wait_stream(stream1)

# launch the multi-GPU graph, which can execute the captured kernels concurrently
wp.capture_launch(capture.graph)