GPUDirect Storage (GDS) Spilling is a beta feature!
The RAPIDS Shuffle Manager has a spillable cache that keeps GPU data in device memory, but can spill to host memory and then to disk when the GPU is out of memory. Using GPUDirect Storage (GDS), device buffers can be spilled directly to storage. This direct path increases system bandwidth, decreases latency and utilization load on the CPU.
After GDS is installed on the host, to enable GDS spilling:
- Make sure the RAPIDS Shuffle Manager is enabled and configured correctly.
- Make sure the Spark “scratch” directory configured by
spark.rapids.memory.gpu.direct.storage.spill.enabled=truein the Spark app.
To verify that GDS spilling is working correctly, add the following line to
When spilling happens, the log file should show information for writing to and reading from GDS.
Writing many small device buffers through GDS incurs overhead that may affect spilling performance. To combat this issue, small device buffers are concatenated together before written to disk in a batch. The batch write buffer used for this purpose takes up PCI Base Address Register (BAR) space, which can be very limited on some GPUs. For example, the NVIDIA T4 only has 256 MiB. On GPUs with a larger BAR space (e.g. the NVIDIA V100 or the NVIDIA A100), you can increase the size of the batch write buffer, which may further improve spilling performance. To change the batch write buffer size from the default 8 MiB to, say, 64 MiB, set
spark.rapids.memory.gpu.direct.storage.spill.batchWriteBuffer.size=64m in the Spark app.