Environment Variables
The following section lists the environment variables available to configure the cuDecomp library.
CUDECOMP_ENABLE_NCCL_UBR
(since v0.4.0, requires NCCL v2.19 or newer)
CUDECOMP_ENABLE_NCCL_UBR
controls whether cuDecomp registers its communication buffers with the NCCL library using ncclCommRegister
/ncclCommDeregister
(i.e., user buffer registration).
Registration can improve NCCL send/receive performance in some scenarios. See the User Buffer Registration
section of the NCCL documentation for more details.
Default setting is off (0
). Setting this variable to 1
will enable this feature.
CUDECOMP_ENABLE_CUMEM
(since v0.5.0, requires CUDA 12.3 driver/toolkit or newer)
CUDECOMP_ENABLE_CUMEM
controls whether cuDecomp uses cuMem*
APIs to allocate fabric-registered workspace buffers via cudecompMalloc
. This option can improve the performance of
some MPI distributions on multi-node NVLink (MNNVL) capable systems.
Default setting is off (0
). Setting this variable to 1
will enable this feature.
CUDECOMP_ENABLE_CUDA_GRAPHS
(since v0.5.1, requires CUDA 11.1 driver/toolkit or newer)
CUDECOMP_ENABLE_CUDA_GRAPHS
controls whether cuDecomp uses CUDA Graphs APIs to capture/replay packing operations for pipelined backends. This option can improve the launch efficiency
and communication overlap of packing kernels in large scale cases.
Default setting is off (0
). Setting this variable to 1
will enable this feature.
CUDECOMP_ENABLE_PERFORMANCE_REPORT
(since v0.5.1)
CUDECOMP_ENABLE_PERFORMANCE_REPORT
controls whether cuDecomp performance reporting is enabled.
Default setting is off (0
). Setting this variable to 1
will enable this feature.
CUDECOMP_PERFORMANCE_REPORT_DETAIL
(since v0.5.1)
CUDECOMP_PERFORMANCE_REPORT_DETAIL
controls the verbosity of performance reporting when CUDECOMP_ENABLE_PERFORMANCE_REPORT
is enabled. This setting determines whether individual sample data is printed in addition to the aggregated performance summary.
The following values are supported:
0
: Aggregated report only - prints only the summary table with averaged performance statistics (default)1
: Per-sample reporting on rank 0 - prints individual sample data for each transpose/halo configuration, but only from rank 02
: Per-sample reporting on all ranks - prints individual sample data for each transpose/halo configuration from all ranks, gathered and sorted by rank on rank 0
Default setting is 0
.
CUDECOMP_PERFORMANCE_REPORT_SAMPLES
(since v0.5.1)
CUDECOMP_PERFORMANCE_REPORT_SAMPLES
controls the number of performance samples to keep for the final performance report. This setting determines the size of the circular buffer used to store timing measurements for each transpose/halo configuration.
Default setting is 20
samples.
CUDECOMP_PERFORMANCE_REPORT_WARMUP_SAMPLES
(since v0.5.1)
CUDECOMP_PERFORMANCE_REPORT_WARMUP_SAMPLES
controls the number of initial samples to ignore for each transpose/halo configuration. This helps exclude outliers from GPU warmup, memory allocation, and other initialization effects from the final performance statistics.
Default setting is 3
warmup samples. Setting this to 0 disables warmup sample filtering.
CUDECOMP_PERFORMANCE_REPORT_WRITE_DIR
(since v0.5.1)
CUDECOMP_PERFORMANCE_REPORT_WRITE_DIR
controls the directory where CSV performance reports are written when CUDECOMP_ENABLE_PERFORMANCE_REPORT
is enabled. When this variable is set, cuDecomp will write performance data to CSV files in the specified directory.
CSV files are created with descriptive names encoding the grid configuration, for example:
cudecomp-perf-report-transpose-aggregated-tcomm_1-hcomm_1-pdims_2x2-gdims_256x256x256-memorder_012012012.csv
The following CSV files are generated:
Aggregated transpose performance data
Aggregated halo performance data
Per-sample transpose data (when
CUDECOMP_PERFORMANCE_REPORT_DETAIL
> 0)Per-sample halo data (when
CUDECOMP_PERFORMANCE_REPORT_DETAIL
> 0)
Each CSV file includes grid configuration information as comments at the top, followed by performance data in comma-separated format.
Default setting is unset (no CSV files written). Setting this variable to a directory path will enable CSV file output.