CUB Benchmarks

CUB comes with a set of NVBench-based benchmarks for its algorithms, which can be used to measure the performance of CUB on your system on a variety of workloads. The integration with NVBench allows to archive and compare benchmark results, which is useful for continuous performance testing, detecting regressions, tuning, and optimization. This guide gives an introduction into CUB’s benchmarking infrastructure.

Building benchmarks

CUB benchmarks are build as part of the CCCL CMake infrastructure. Starting from scratch:

git clone https://github.com/NVIDIA/cccl.git
cd cccl
mkdir build
cd build
cmake .. --preset=cub-benchmark -DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture

You clone the repository, create a build directory and configure the build with CMake. It’s important that you enable benchmarks (CCCL_ENABLE_BENCHMARKS=ON), build in Release mode (CMAKE_BUILD_TYPE=Release), and set the GPU architecture to match your system (CMAKE_CUDA_ARCHITECTURES=XX). This website contains a great table listing the architectures for different brands of GPUs.

We use Ninja as CMake generator in this guide, but you can use any other generator you prefer.

You can then proceed to build the benchmarks.

You can list the available cmake build targets with, if you intend to only build selected benchmarks:

ninja -t targets | grep '\.bench\.'
cub.bench.adjacent_difference.subtract_left.base: phony
cub.bench.copy.memcpy.base: phony
...
cub.bench.transform.babelstream3.base: phony
cub.bench.transform_reduce.sum.base: phony

We also provide a target to build all benchmarks:

ninja cub.all.benches

Running a benchmark

After we built a benchmark, we can run it as follows:

./bin/cub.bench.adjacent_difference.subtract_left.base\
    -d 0\
    --stopping-criterion entropy\
    --json base.json\
    --md base.md

In this command, -d 0 indicates that we want to run on GPU 0 on our system. Setting –stopping-criterion entropy is advisable since it reduces runtime and increase confidence in the resulting data. It’s not set as default yet, because NVBench is still evaluating it. By default, NVBench will print the benchmark results to the terminal as Markdown. –json base.json will save the detailed results in a JSON file as well for later use. –md base.md will save the Markdown output to a file as well, so you can easily view the results later without having to parse the JSON. More information on what command line options are available can be found in the NVBench documentation.

The expected terminal output is something along the following lines (also saved to base.md), shortened for brevity:

# Log
Run:  [1/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^16]
Pass: Cold: 0.004571ms GPU, 0.009322ms CPU, 0.00s total GPU, 0.01s total wall, 334x
Run:  [2/8] base [Device=0 T{ct}=I32 OffsetT{ct}=I32 Elements{io}=2^20]
Pass: Cold: 0.015161ms GPU, 0.023367ms CPU, 0.01s total GPU, 0.02s total wall, 430x
...
# Benchmark Results
| T{ct} | OffsetT{ct} |   Elements{io}   | Samples |  CPU Time  |  Noise  |  GPU Time  | Noise  | Elem/s  | GlobalMem BW | BWUtil |
|-------|-------------|------------------|---------|------------|---------|------------|--------|---------|--------------|--------|
|   I32 |         I32 |     2^16 = 65536 |    334x |   9.322 us | 104.44% |   4.571 us | 10.87% | 14.337G | 114.696 GB/s | 14.93% |
|   I32 |         I32 |   2^20 = 1048576 |    430x |  23.367 us | 327.68% |  15.161 us |  3.47% | 69.161G | 553.285 GB/s | 72.03% |
...

If you are only interested in a subset of workloads, you can restrict benchmarking as follows:

./bin/cub.bench.adjacent_difference.subtract_left.base ...\
    -a 'T{ct}=I32'\
    -a 'OffsetT{ct}=I32'\
    -a 'Elements{io}[pow2]=[24,28]'\

The -a option allows you to restrict the values for each axis available for the benchmark. See the NVBench documentation. for more information on how to specify the axis values. If the specified axis does not exist, the benchmark will terminate with an error.

Comparing benchmark results

Let’s say you have a modification that you’d like to benchmark. To compare the performance you have to build and run the benchmark as described above for the unmodified code, saving the results to a JSON file, e.g. base.json. Then, you apply your code changes (e.g., switch to a different branch, git stash pop, apply a patch file, etc.), rebuild and rerun the benchmark, saving the results to a different JSON file, e.g. new.json.

You can now compare the two result JSON files using, assuming you are still in your build directory:

PYTHONPATH=./_deps/nvbench-src/scripts ./_deps/nvbench-src/scripts/nvbench_compare.py base.json new.json

The PYTHONPATH environment variable may not be necessary in all cases. The script will print a Markdown report showing the runtime differences between each variant of the two benchmark run. This could look like this, again shortened for brevity:

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I32   |      I32      |      2^16      |   4.571 us |      10.87% |   4.096 us |       0.00% |  -0.475 us | -10.39% |   FAIL   |
|   I32   |      I32      |      2^20      |  15.161 us |       3.47% |  15.143 us |       3.55% |  -0.018 us |  -0.12% |   PASS   |
...

In addition to showing the absolute and relative runtime difference, NVBench reports the noise of the measurements, which corresponds to the relative standard deviation. It then reports with statistical significance in the Status column how the runtime changed from the base to the new version.

Running all benchmarks directly from the command line

To get a full snapshot of CUB’s performance, you can run all benchmarks and save the results. For example:

ninja cub.all.benches
benchmarks=$(ls bin | grep cub.bench); n=$(echo $benchmarks | wc -w); i=1; \
for b in $benchmarks; do \
  echo "=== Running $b ($i/$n) ==="; \
  ./bin/$b -d 0 --stopping-criterion entropy --json $b.json --md $b.md; \
  ((i++)); \
done

This will generate one JSON and one Markdown file for each benchmark. You can archive those files for later comparison or analysis.

Running all benchmarks via tuning scripts (alternative)

The benchmark suite can also be run using the tuning infrastructure. The tuning infrastructure handles building benchmarks itself, because it records the build times. Therefore, it’s critical that you run it in a clean build directory without any build artifacts. Running cmake is enough. Alternatively, you can also clean your build directory with. Furthermore, the tuning scripts require some additional python dependencies, which you have to install.

To select the appropriate CUDA GPU, first identify the GPU ID by running nvidia-smi, then set the desired GPU using export CUDA_VISIBLE_DEVICES=x, where x is the ID of the GPU you want to use (e.g., 1). This ensures your application uses only the specified GPU.

ninja clean
pip install --user fpzip pandas scipy

We can then run the full benchmark suite from the build directory with:

<root_dir_to_cccl>/cccl/benchmarks/scripts/run.py

You can expect the output to look like this:

&&&& RUNNING bench
ctk:  12.2.140
cub:  812ba98d1
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__16 4.095999884157209e-06 -sec
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__20 1.2288000107218977e-05 -sec
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__24 0.00016998399223666638 -sec
&&&& PERF cub_bench_adjacent_difference_subtract_left_base_T_ct__I32___OffsetT_ct__I32___Elements_io__pow2__28 0.002673664130270481 -sec
...

The tuning infrastructure will build and execute all benchmarks and their variants one after each other, reporting the time in seconds it took to execute the benchmark executable.

It’s also possible to benchmark a subset of algorithms and workloads:

<root_dir_to_cccl>/cccl/benchmarks/scripts/run.py -R '.*scan.exclusive.sum.*' -a 'Elements{io}[pow2]=[24,28]' -a 'T{ct}=I32'
&&&& RUNNING bench
 ctk:  12.6.77
cccl:  v2.7.0-rc0-265-g32aa6aa5a
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28 0.003194367978721857 -sec
&&&& PERF cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U64___Elements_io__pow2__28 0.00319383991882205 -sec
&&&& PASSED bench

The -R option allows you to specify a regular expression for selecting benchmarks. The -a restricts the values for an axis across all benchmarks See the NVBench documentation. for more information on how to specify the axis values. Contrary to running a benchmark directly, the tuning infrastructure will just ignore an axis value if a benchmark does not support, run the benchmark regardless, and continue.

The tuning infrastructure stores results in an SQLite database called cccl_meta_bench.db in the build directory. This database persists across tuning runs. If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run. The resulting database contains all samples, which can be extracted into JSON files:

<root_dir_to_cccl>/cccl/benchmarks/scripts/analyze.py -o ./cccl_meta_bench.db

This will create a JSON file for each benchmark variant next to the database. For example:

cat cub_bench_scan_exclusive_sum_base_T_ct__I32___OffsetT_ct__U32___Elements_io__pow2__28.json
[
  {
    "variant": "base ()",
    "elapsed": 2.6299014091,
    "center": 0.003194368,
    "bw": 0.8754671386,
    "samples": [
      0.003152896,
      0.0031549439,
      ...
    ],
    "Elements{io}[pow2]": "28",
    "base_samples": [
      0.003152896,
      0.0031549439,
      ...
    ],
    "speedup": 1
  }
]