RAPIDS Accelerator for Apache Spark Configuration

The following is the list of options that rapids-plugin-4-spark supports.

On startup use: --conf [conf key]=[conf value]. For example:

${SPARK_HOME}/bin/spark-shell --jars rapids-4-spark_2.12-24.02.0-cuda11.jar \
--conf spark.plugins=com.nvidia.spark.SQLPlugin \
--conf spark.rapids.sql.concurrentGpuTasks=2

At runtime use: spark.conf.set("[conf key]", [conf value]). For example:

scala> spark.conf.set("spark.rapids.sql.concurrentGpuTasks", 2)

All configs can be set on startup, but some configs, especially for shuffle, will not work if they are set at runtime. Please check the column of “Applicable at” to see when the config can be set. “Startup” means only valid on startup, “Runtime” means valid on both startup and runtime.

General Configuration

Name Description Default Value Applicable at
spark.rapids.cloudSchemes Comma separated list of additional URI schemes that are to be considered cloud based filesystems. Schemes already included: abfs, abfss, dbfs, gs, s3, s3a, s3n, wasbs, cosn. Cloud based stores generally would be total separate from the executors and likely have a higher I/O read cost. Many times the cloud filesystems also get better throughput when you have multiple readers in parallel. This is used with spark.rapids.sql.format.parquet.reader.type None Runtime
spark.rapids.filecache.enabled Controls whether the caching of input files is enabled. When enabled, input datais cached to the same local directories configured for the Spark application. The cache will use up to half the available space by default. To set an absolute cache size limit, see the spark.rapids.filecache.maxBytes configuration setting. Currently only data from Parquet files are cached. false Startup
spark.rapids.memory.gpu.maxAllocFraction The fraction of total GPU memory that limits the maximum size of the RMM pool. The value must be greater than or equal to the setting for spark.rapids.memory.gpu.allocFraction. Note that this limit will be reduced by the reserve memory configured in spark.rapids.memory.gpu.reserve. 1.0 Startup
spark.rapids.memory.gpu.minAllocFraction The fraction of total GPU memory that limits the minimum size of the RMM pool. The value must be less than or equal to the setting for spark.rapids.memory.gpu.allocFraction. 0.25 Startup
spark.rapids.memory.host.spillStorageSize Amount of off-heap host memory to use for buffering spilled GPU data before spilling to local disk. Use -1 to set the amount to the combined size of pinned and pageable memory pools. -1 Startup
spark.rapids.memory.pinnedPool.size The size of the pinned memory pool in bytes unless otherwise specified. Use 0 to disable the pool. 0 Startup
spark.rapids.sql.batchSizeBytes Set the target number of bytes for a GPU batch. Splits sizes for input data is covered by separate configs. The maximum setting is 2 GB to avoid exceeding the cudf row count limit of a column. 1073741824 Runtime
spark.rapids.sql.concurrentGpuTasks Set the number of tasks that can execute concurrently per GPU. Tasks may temporarily block when the number of concurrent tasks in the executor exceeds this amount. Allowing too many concurrent tasks on the same GPU may lead to GPU out of memory errors. 2 Runtime
spark.rapids.sql.enabled Enable (true) or disable (false) sql operations on the GPU true Runtime
spark.rapids.sql.explain Explain why some parts of a query were not placed on a GPU or not. Possible values are ALL: print everything, NONE: print nothing, NOT_ON_GPU: print only parts of a query that did not go on the GPU NOT_ON_GPU Runtime
spark.rapids.sql.metrics.level GPU plans can produce a lot more metrics than CPU plans do. In very large queries this can sometimes result in going over the max result size limit for the driver. Supported values include DEBUG which will enable all metrics supported and typically only needs to be enabled when debugging the plugin. MODERATE which should output enough metrics to understand how long each part of the query is taking and how much data is going to each part of the query. ESSENTIAL which disables most metrics except those Apache Spark CPU plans will also report or their equivalents. MODERATE Runtime
spark.rapids.sql.multiThreadedRead.numThreads The maximum number of threads on each executor to use for reading small files in parallel. This can not be changed at runtime after the executor has started. Used with COALESCING and MULTITHREADED readers, see spark.rapids.sql.format.parquet.reader.type, spark.rapids.sql.format.orc.reader.type, or spark.rapids.sql.format.avro.reader.type for a discussion of reader types. If it is not set explicitly and spark.executor.cores is set, it will be tried to assign value of max(MULTITHREAD_READ_NUM_THREADS_DEFAULT, spark.executor.cores), where MULTITHREAD_READ_NUM_THREADS_DEFAULT = 20. 20 Startup
spark.rapids.sql.reader.batchSizeBytes Soft limit on the maximum number of bytes the reader reads per batch. The readers will read chunks of data until this limit is met or exceeded. Note that the reader may estimate the number of bytes that will be used on the GPU in some cases based on the schema and number of rows in each batch. 2147483647 Runtime
spark.rapids.sql.reader.batchSizeRows Soft limit on the maximum number of rows the reader will read per batch. The orc and parquet readers will read row groups until this limit is met or exceeded. The limit is respected by the csv reader. 2147483647 Runtime
spark.rapids.sql.shuffle.spillThreads Number of threads used to spill shuffle data to disk in the background. 6 Runtime
spark.rapids.sql.udfCompiler.enabled When set to true, Scala UDFs will be considered for compilation as Catalyst expressions false Runtime

For more advanced configs, please refer to the RAPIDS Accelerator for Apache Spark Advanced Configuration page.