Shim Source Code Layout Simplification with Shimplify

This document describes the next iteration of shim source code maintenance. It addresses the drawback introduced with the shim layer rework resulting in the guaranteed ABI-compatible bytecode management for each of the 14 currently supported Spark builds but at the expense of maintaining 50+ directories. Many shims are spread over an overlapping set of directories making it hard to determine where to make additions while keeping code duplication in check.

shimplify.py is the new goal in the Maven build binding to the generate-sources phase.

It defines a new simpler shim directory structure where there is only a single directory per shim, and a special comment is injected to define metadata defining all the shim builds it participates in.
It can convert all or a subset of existing shims to the new build. The build can support partially converted shims if a longer transition is desired.

Simplified Shim Source Directory Structure

In our build each supported Apache Spark build and its corresponding shim is identified by its buildver property. Every Maven submodule requiring shimming (sql-plugin, tests as of the time of this writing) have a new set of special sibling directories src/(main|test)/spark${buildver}.

Previous src/(main|test)/${buildver} and version-range-with-exceptions directories such as src/main/320until340-non330db are deprecated and are being removed as a result of the conversion to the new structure.

shimplify changes the way the source code is shared among shims by using an explicit lexicographically sorted list of buildver property values in a source-code level comment instead of the shared directories.

/*** spark-rapids-shim-json-lines
{"spark": "320"}
{"spark": "323"}
spark-rapids-shim-json-lines ***/

The content inside the tags spark-rapids-shim-json-lines is in the JSON Lines format where each line is an extensible object with the shim metadata currently consisting just of the Spark build dependency version. The top object in the comment, the minimum version in the comment intuitively represents the first version of Spark requiring shimming in the plugin, albeit it might not be the original one as support for older Spark releases is eventually dropped. This buildver is called the owner shim.

On the default read-only invocation path of the Maven build shimplify does not make any changes to shim source code files and their locations.

It analyzes the pre-shimplify directory structure and identifies the shims that through the code evolution ended up using more dedicated directories than necessary contributing avoidable complexity on top of an inherently complex directory structure already.
For the shimplify directory structure all files under src/(main|test)/spark* directories are read to parse the spark-rapids-shim-json-lines comments. It performs the following validations:
- It makes sure that the comment is present and can be parsed
- The list of shims is non-empty (e.g., has not orphaned through dropping shims) and sorted.
- The file is stored under the owner shim directory.
All files participating listing the buildver of the current Maven build session are symlinked to target/${buildver}/generated/src/(main|test)/(scala|java) except for template classes requiring spark.version.classifier in the package name.
If the package name of a class such as RapidsShuffleManager contains $_spark.version.classifier_ (because it is source-identical across shims up to the package name) it will be materialized in the target/${buildver}/generated/src/(main|test)/(scala|java) with spark.version.classifier interpolated into the package name.

Thus, instead of hardcoding distinct lists of directories for build-helper Maven plugin to add (one for each shim) after the full transition to shimplify, the pom will have only 4 add source statements that is independent of the number of supported shims.

With the shimplify format in place it is easy to review all the files for a single shim without relying on Maven:

git grep '{"spark": "323"}' '*.scala' '*.java'

Conversion to the Shimplify-based Directory Structure

Shimplify can automatically convert the prior version-range-with-exceptions directory structure to the simplified version. This allows to make it an atomic transition without having to resolve almost unavoidable merge conflicts due to the sheer size of this sweeping change while the shim development is ongoing. The conversion of the shims source code and the regular build should not be done simultaneously for faster isolation and correction of potential bugs in the conversion code.

Prior to invoking the conversion standalone, you first run

mvn clean install -DskipTests

on the current state of the spark-rapids repo.

After that you can execute conversion in one or more iterations depending on specified -D parameters

mvn generate-sources -Dshimplify=true [-D...]

With -Dshimplify=true, shimplify is put on the write call path to generate and inject spark-rapids-shim-json-lines comments to all shim source files. The files are not yet moved to their owner shim directory, and so it is easy to verify with git diff the comments being injected. If you see any issue you can fix it and re-execute the command by adding -Dshimplify.overwrite=true. However, it is usually easier to just have git restore the previous state:

git restore sql-plugin tests

Once the shim comments looks good (as expected, it was tested), you can repeat it and now actually move the files to designated locations by invoking

mvn generate-sources -Dshimplify=true -Dshimplify.move=true

Now you can run a package build with the simplified directory structure and run a few integration tests preferably in the test standalone mode with the RAPIDS Shuffle Manager on for increased coverage:

mvn clean package -DskipTests -Dbuildver=331
SPARK_HOME=~/dist/spark-3.3.1-bin-hadoop3 \
    NUM_LOCAL_EXECS=2 \
    PYSP_TEST_spark_rapids_shuffle_mode=MULTITHREADED \
    PYSP_TEST_spark_rapids_shuffle_multiThreaded_writer_threads=2 \
    PYSP_TEST_spark_rapids_shuffle_multiThreaded_reader_threads=2 \
    PYSP_TEST_spark_shuffle_manager=com.nvidia.spark.rapids.spark331.RapidsShuffleManager \
    PYSP_TEST_spark_rapids_memory_gpu_minAllocFraction=0 \
    PYSP_TEST_spark_rapids_memory_gpu_maxAllocFraction=0.1 \
    PYSP_TEST_spark_rapids_memory_gpu_allocFraction=0.1 \
    ./integration_tests/run_pyspark_from_build.sh -k test_hash_grpby_sum

If smoke testing does not reveal any issues proceed to committing the change. If there are issues you can undo with

git restore --staged sql-plugin tests
git restore sql-plugin tests

and by reviewing and removing the new directories with

git clean -f -d --dry-run

Partial Conversion

It is not expected to be really necessary but it is possible to convert a subset of the shims

Either by adding -Dshimplify.shims=buildver1,buildver2,… to the commands above
Or by specifying a list of directories you would like to delete to have a simpler directory -Dshimplify.dirs=320until340-non330db,320until330-noncdh

The latter is just a minor twist on the former. Instead of having an explicit list of shims, it first computes the list of all buildver values using provided directories. After this all the files for the shims, not just under specified directories are converted.

In both cases, the conversion does not leave the rest of the shims totally unaffected when there are common files with a specified shim. However, it guarantees to leave the previous dedicated files under src/(main|test)/${buildver} in place for shims outside the list. This is useful when developers of a certain shim would like to continue working on it without adapting the new method. However, for the simplicity of future refactoring the full transition is preferred.

Evolving shims without automatic conversion

Suppose a bulk-conversion of existing shims is not an option whereas the next shimming issue requires difficult refactoring of version ranges with adding more directories with exceptions. Now it can be resolved easily by placing just the affected files to owner shim directories and adding shim JSON lines comments by hand.

Adding a new shim

Shimplify can clone an existing shim based as a basis of the new shim. For example when adding support for a new maintenance version of Spark, say 3.2.4, it’s expected to be similar to 3.2.3.

If just 3.2.3 or all shims after the full transition have already been converted you can execute

mvn generate-sources -Dshimplify=true \
    -Dshimplify.move=true -Dshimplify.overwrite=true \
    -Dshimplify.add.shim=324 -Dshimplify.add.base=323

to clone 323 as 324. This will add {"spark": "324"} to every shared file constituting the 323 shim. Moreover, it will create

a copy of dedicated 323 files with spark323 under spark324 shim directory
substitute spark324 for spark323 in the package name and path,
and modify the comment from {"spark": "323"} to {"spark": "324"}

Review the new repo state, e.g., using git grep '{"spark": "324"}'. Besides having to add the release324 profile to various pom.xml as before, this alone is likely to be insufficient to complete the work on 324. It is expected you will need to work on resolving potential compilation failures manually.

Deleting a Shim

Every Spark build is de-supported eventually. To drop a build say 320 you can run

mvn generate-sources -Dshimplify=true -Dshimplify.move=true \
    -Dshimplify.remove.shim=320

This command will remove the comment line {"spark": "320"} from all source files contributing to the 320 shim. If a file belongs exclusively to 320 it will be removed.

After adding or deleting shims you should sanity-check the diff in the local git repo and run the integration tests above.

Symlinks & IDE

IDEs may or may not reveal whether a file is accessed via a symlink. IntelliJ IDEA treats the original file path and a path via a symlink to the same file as two independent files by default.

In the context of shimplify, only the generated symlink path is part of the project because the owner shim path is not add-sourced during build and therefore during IDEA Project Import. The user can install the Resolve Symlinks plugin to prevent IDEA from opening multiple windows for the same physical source file. As of the time of this writing, it works seamlessly with the exception when the file is open via a Debugger either on a breakpoint hit or subsequent clicking on the affected stack frame in which case you will see an extra editor tab being added.

No matter whether or not you use the Resolve Symlinks plugin, IDEA is able to add a breakpoint set directly via the original physical file or a symlink path.

Reducing Code Duplication

You can help reducing code complexity by consolidating copy-and-pasted shim code accumulated because it had been hard to fit it into a less flexible shim inheritance hierarchy based on versions with exceptions.

You can use the CPD tool that is integrated into our Maven build to find duplicate code in the shim and in the regular code base. It is not ready for automation and has to invoked manually, separately for Java and Scala, e.g.:

mvn antrun:run@duplicate-code-detector \
    -Dcpd.argLine='--minimum-tokens 50 --language scala --skip-blocks-pattern /*|*/' \
    -Dcpd.sourceType='main' \
    > target/cpd.scala.txt

Delete duplicate methods and move a single copy into an object such as SomethingShim and annotate its file with the list of buildvers.

See CPD user doc for more details about the options you can pass inside cpd.argLine.