Skip to content

BioNeMo test data management

This library manages the downloading and caching of large or binary data files used in the documentation or test suite. These files should not be committed directly to the repo, and instead should be loaded at test-time when they are needed.

We currently support two locations for test data or saved models:

SwiftStack

SwiftStack or pbss is an NVIDIA-internal, s3-compatible object store that allows for very large data and fast, parallel read/writes. Most critically, pbss can be uploaded to without legal approvals for dataset redistribution. These files will not be accessible by external collaborators.

NGC

NGC hosts containers, models, and resources, some of which require authentication and others that are generally available. This library uses the model and resource types to save test data and reference model weights. These items are accessible by external collaborators, but require legal approval before re-distributing test data.

Loading test or example data

Test data are specified via yaml files in sub-packages/bionemo-testing/src/bionemo/testing/data/resources. As an example, in esm2.yaml:

- tag: nv_650m:1.0
  ngc: "nvidia/clara/esm2nv650m:1.0"
  ngc_registry: model
  pbss: "s3://bionemo-ci/models/esm2nv_650M_converted.nemo"
  sha256: 1e38063cafa808306329428dd17ea6df78c9e5d6b3d2caf04237c555a1f131b7
  owner: Farhad Ramezanghorbani <farhadr@nvidia.com>
  description: >
    A pretrained 650M parameter ESM-2 model.
    See https://ngc.nvidia.com/catalog/models/nvidia:clara:esm2nv650m.

To load these model weights during a test, use the load function with the filename and tag of the desired asset, which returns a path a the specified file:

path_to_my_checkpoint = load("esm2/nv_650m:1.0")
config = ESM2Config(nemo1_ckpt_path=path_to_my_checkpoint)

If this function is called without the data available on the local machine, it will be fetched from the default source (currently pbss.) Otherwise, it will return the cached directory. To download with NGC, pass source="ngc" to load.

File unpacking and/or decompression

All test artifacts are individual files. If a zip or tar archive is specified, it will be unpacked automatically, and the path to the directory will be returned via load. Compressed files ('gzip', 'bz2', or 'xz') are automatically decompressed before they are returned. The file's compression and/or archive format is determined based on the filename specified in the pbss URL.

Files in NGC resources

NGC resources are folders, i.e., they may contain multiple files per resource. load will only download the filename matching the stem of the pbss url. The same NGC resource can therefore be used to host multiple test assets that are used independently.

Adding new test assets

To add new data, first ensure that the data is available from either NGC or pbss. Next, extend or create a new yaml file in sub-packages/bionemo-testing/src/bionemo/testing/data/resources with the required information. Owner emails must be provided for all assets. The description and ngc fields are currently optional. If the sha256 is left unspecified, pooch will report the downloaded file's sha when loaded.

Warning

SHAs should be provided for all files to ensure the download completes correctly, and to invalidate caches if the files change.