CUB Tuning Infrastructure
Device-scope algorithms in CUB have many knobs that do not affect the algorithms’ correctness but can significantly impact performance. For instance, the number of threads per block and items per thread can be tuned to maximize performance for a given device and data type. This document describes CUB’s tuning Infrastructure, a set of tools facilitating the process of selecting optimal tuning parameters for a given device and data type.
Definitions
We omit the word “tuning” but assume it in the definitions for all terms below, so those terms may mean something else in a more generic context.
Algorithms are tuned for different workloads, which are sub spaces of all benchmark versions defined by NVBench via a benchmark’s axes. For instance, radix sort can be tuned for different key types, different number of keys, and different distributions of keys. We partition the space spanned by a benchmark’s axes into two categories:
Compile-time (ct) Workload - a workload that can be recognized at compile time. For instance, the combination of key type and offset type is a compile-time workload for radix sort. A compile-time workload is a point in the space spanned by the Cartesian product of compile-time type axes of NVBench.
Runtime (rt) Workload - a workload that can be recognized only at runtime. For instance, the number of keys along with their distribution is a runtime workload for radix sort. A runtime workload is a point in the space spanned by the Cartesian product of non-compile-time type axes of NVBench.
The tuning infrastructure can optimize algorithms only for specific compile-time workloads, aggregating results across all runtime workloads: It searches through a space of parameters to find the combination for a given compile-time workload with the highest score:
Parameter - a parameter that can be tuned to maximize performance for a given device and data type. For instance, the number of threads per block and items per thread are tuning parameters.
Parameter Space - the set of all possible values for a given tuning parameter. Parameter Space is specific to algorithm. For instance, the parameter space for the number of threads per block is \(\{32, 64, 96, 128, \dots, 1024\}\) for radix sort, but \(\{32, 64, 128, 256, 512\}\) for merge sort.
Parameter Point - a concrete value of a tuning parameter. For instance, the parameter point for the number of threads per block is \(threads\_per\_block=128\).
Search Space - Cartesian product of parameter spaces. For instance, search space for an algorithm with tunable items per thread and threads per block might look like \(\{(ipt \times tpb) | ipt \in \{1, \dots, 25\} \text{and} tpb \in \{32, 64, 96, 128, \dots, 1024\}\}\).
Variant - a point in the corresponding search space.
Base - the variant that CUB uses by default.
Score - a single number representing the performance for a given compile-time workload across all runtime workloads. For instance, a weighted-sum of speedups of a given variant compared to its base for all runtime workloads is a score.
Search - a process consisting of covering all variants for all compile-time workloads to find a variant with maximal score.
Search Process
To get started with tuning, you need to configure CMake. You can use the following command:
$ mkdir build
$ cd build
$ cmake .. --preset=cub-tune -DCMAKE_CUDA_ARCHITECTURES=90 # TODO: Set your GPU architecture
You can then run the tuning search for a specific algorithm and compile-time workload:
$ ../benchmarks/scripts/search.py -R '.*merge_sort.*pairs' -a 'KeyT{ct}=I128' -a 'Elements{io}[pow2]=28'
cub.bench.merge_sort.pairs.trp_0.ld_1.ipt_13.tpb_6 0.6805093269929858
cub.bench.merge_sort.pairs.trp_0.ld_1.ipt_11.tpb_10 1.0774560502969677
...
This will search the space of merge sort for key-value pairs, for the key type int128_t
on 2^28
elements.
The -R
and -a
options are optional. If not specified, all benchmarks are going to be tuned.
The -R
option can select multiple benchmarks using a regular expression.
For the axis option -a
, you can also specify a range of values like -a 'KeyT{ct}=[I32,I64]'
.
Any axis values not supported by a selected benchmark will be ignored.
The first variant cub.bench.merge_sort.pairs.trp_0.ld_1.ipt_13.tpb_6
has a score <1 and is thus generally slower than the baseline,
whereas the second variant cub.bench.merge_sort.pairs.trp_0.ld_1.ipt_11.tpb_10
has a score of >1 and is thus an improvement over the baseline.
Notice there is currently a limitation in search.py
which will only execute runs for the first axis value for each axis
(independently of whether the axis is specified on the command line or not).
Tuning for multiple axis values requires multiple runs of search.py
.
Please see this issue for more information.
The tuning framework will handle building the benchmarks (base and variants) and running them by itself. It will keep track of the build time for base and variants. Sometimes, a tuning variant may lead the compiler to hang or take exceptionally long to compile. To keep the tuning process going, if the build time of a variant exceeds a threshold, the build is cancelled. The same applies to benchmarks running for too long.
To get quick feedback on what benchmarks are selected and how big the search space is,
you can add the -l
option:
$ ../benchmarks/scripts/search.py -R '.*merge_sort.*pairs' -a 'KeyT{ct}=I128' -a 'Elements{io}[pow2]=28' -l
ctk: 12.6.85
cccl: v2.7.0
### Benchmarks
* `cub.bench.merge_sort.pairs`: 540 variants:
* `trp`: (0, 2, 1)
* `ld`: (0, 3, 1)
* `ipt`: (7, 25, 1)
* `tpb`: (6, 11, 1)
It will list all selected benchmarks as well as the total number of variants (the magnitude of the search space) as a result of the Cartesian product of all its tuning parameter spaces.
The tuning infrastructure stores results in an SQLite database called cccl_meta_bench.db
in the build directory.
This database persists across tuning runs.
If you interrupt the benchmark script and then launch it again, only missing benchmark variants will be run.
Tuning on multiple GPUs
Because the search process computes scores by comparing the performance of a variant to the baseline,
it has to store the baseline result in the tuning database.
The baseline is specific to the physical GPU on which it was obtained.
Therefore, a single tuning database should not be used to run the tuning search on two different GPUs, even of the same architecture.
Similarly, you should also not interrupt the search and resume it on a different GPU.
Be careful when sharing build directories over network file systems.
Check whether a build directory already contains a cccl_meta_bench.db
from a previous run before starting a new search.
Because the search space can be separated based on different axis values,
a tuning search can be run on multiple GPUs in parallel, even across multiple physical machines (e.g., on a cluster).
To do this, search.py
is invoked in parallel, one invocation/process per GPU,
with different axis values specified for each invocation.
A dedicated tuning database will be created per physical GPU.
If a shared filesystem is in use, make sure that search.py
is run from different directories,
so the cccl_meta_bench.db
files are placed into distinct paths.
It is recommended to drive a multi-GPU/multi-node search process from a script,
iterating the axis values and invoking search.py
for each variant.
This integrates nicely with workload managers on clusters, which allow submitting batch jobs.
In such a scenario, it is recommended to submit a job per variant.
After tuning on multiple GPUs, the results are available in multiple tuning databases, which can be analyzed together.
Analyzing the results
The result of the search is stored in one or more cccl_meta_bench.db
files. To analyze the
result you can use the analyze.py
script.
The --coverage
flag will show the amount of variants that were covered per compile-time workload:
$ ../benchmarks/scripts/analyze.py --coverage
cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I32] coverage: 167 / 522 (31.9923%)
cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I64] coverage: 152 / 522 (29.1188%)
The --top N
flag will list the best N
variants for each compile-time workload:
$ ../benchmarks/scripts/analyze.py --top=5
cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I32]:
variant score mins means maxs
97 ipt_19.tpb_512 1.141015 1.039052 1.243448 1.679558
84 ipt_18.tpb_512 1.136463 1.030434 1.245825 1.668038
68 ipt_17.tpb_512 1.132696 1.020470 1.250665 1.688889
41 ipt_15.tpb_576 1.124077 1.011560 1.245011 1.722379
52 ipt_16.tpb_512 1.121044 0.995238 1.252378 1.717514
cub.bench.radix_sort.keys[T{ct}=I8, OffsetT{ct}=I64]:
variant score mins means maxs
71 ipt_19.tpb_512 1.250941 1.155738 1.321665 1.647868
86 ipt_20.tpb_512 1.250840 1.128940 1.308591 1.612382
55 ipt_17.tpb_512 1.244399 1.152033 1.327424 1.692091
98 ipt_21.tpb_448 1.231045 1.152798 1.298332 1.621110
85 ipt_20.tpb_480 1.229382 1.135447 1.294937 1.631225
The name of the variant contains the short parameter names and values used for the variant. For each variant, a score is reported. The base has a score of 1.0, so each score higher than 1.0 is an improvement over the base. However, because a single variant contains multiple runtime workloads, also the minimum, mean, maximum score is reported. If all those three values are larger than 1.0, the variant is strictly better than the base. If only the mean or max are larger than 1.0, the variant may perform better in most runtime workloads, but regress in others. This information can be used to change the existing tuning policies in CUB.
$ ../benchmarks/scripts/analyze.py --variant='ipt_(18|19).tpb_512'
The last command plots distribution of the elapsed times for the specified variants.