The Hybrid(CPU/GPU) execution
Note: this is an experimental feature currently.
Overview
The Hybrid execution provides a way to offload Parquet scan onto CPU by leveraging Gluten/Velox.
Configuration
To enable Hybrid Execution, please set the following configurations:
"spark.sql.sources.useV1SourceList": "parquet"
"spark.rapids.sql.parquet.useHybridReader": "true"
"spark.rapids.sql.hybrid.loadBackend": "true"
Build
Build Gluten bundle and third party jars.
Hybrid execution targets Gluten v1.2.0 code tag. For the Gluten building, please refer to link. Start the docker Gluten project provided, then execute the following
git clone https://github.com/apache/incubator-gluten.git
git checkout v1.2.0
# Cherry pick a fix from main branch: Fix ObjectStore::stores initialized twice issue
git cherry-pick 2a6a974d6fbaa38869eb9a0b91b2e796a578884c
./dev/package.sh
Note: Should cherry-pick a fix as shown in the above steps. In the $Gluten_ROOT/package/target, you can get the bundle third_party jars.
Download Rapids Hybrid jar from Maven repo
<dependency>
<groupId>com.nvidia</groupId>
<artifactId>rapids-4-spark-hybrid_${scala.binary.version}</artifactId>
<version>${project.version}</version>
</dependency>
How to use
Decide the Spark version. Set the configurations as described in the above section. Prepare the Gluten bundle and third party jars for the Spark version as described in the above section. Get the Rapids Hybrid jar. Put the jars(Gluten two jars and the Rapids hybrid jar) in the classpath by specifying: --jars=<gluten-bundle-jar>,<gluten-thirdparty-jar>,<rapids-hybrid-jar>
Limitations
- Only supports V1 Parquet data source.
- Only supports Scala 2.12, do not support Scala 2.13.
- Support Spark 3.2.2, 3.3.1, 3.4.2, and 3.5.1, matching Gluten. Other Spark versions 32x, 33x, 34x, 35x may work, but are not fully tested.