aselmdb#

ASE LMDB data source for atomic/molecular pipelines.

Reads .aselmdb files and yields AtomicData objects for use in curator pipelines.

This source is designed for datasets stored in the ASE LMDB format, such as the Open Molecules 2025 (OMol25) dataset from Meta FAIR. OMol25 contains over 100 million DFT calculations at the ωB97M-V/def2-TZVPD level of theory, covering 83 elements and ~83M unique molecular systems including small molecules, biomolecules, metal complexes, and electrolytes (systems up to 350 atoms).

Each source index corresponds to one .aselmdb file. The generator returned by __getitem__() iterates over every row in that database file, converting each ASE Atoms entry to an AtomicData instance.

Two read backends are supported:

  • python (default): Uses a pure-Python reader (lmdb + zlib + json) to open the database, decompress entries, and convert __ndarray__ markers to NumPy arrays. Then constructs AtomicData directly from the raw row dicts.

  • rust: Uses a native Rust reader (physicsnemo_curator._lib.lmdb.read_lmdb()) for I/O, zlib decompression, and JSON parsing, then constructs AtomicData directly from the raw row dicts. Avoids the ase.Atoms intermediate and can be significantly faster for large datasets. Falls back to python if the Rust extension is not available.

References

  • OMol25 dataset: https://huggingface.co/facebook/OMol25

  • OMol25 paper: Levine et al., “The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models”, arXiv:2505.08762 (2025). https://arxiv.org/abs/2505.08762

  • OPoly26 extension: Levine et al., “The Open Polymers 2026 (OPoly26) Dataset and Evaluations”, arXiv:2512.23117 (2025). https://arxiv.org/abs/2512.23117

  • fairchem toolkit: facebookresearch/fairchem

Attributes#

Classes#

ASELMDBSource

Read atomic data from ASE LMDB (.aselmdb) database files.

Module Contents#

class physicsnemo_curator.domains.atm.sources.aselmdb.ASELMDBSource(
data_dir: str,
file_pattern: str = '**/*.aselmdb',
metadata_path: str = '',
backend: Literal['python', 'rust'] = 'python',
)#

Bases: physicsnemo_curator.core.base.Source[nvalchemi.data.AtomicData]

Read atomic data from ASE LMDB (.aselmdb) database files.

The source discovers all .aselmdb files under data_dir matching file_pattern, sorted lexicographically. Each file is treated as one source index containing many atomic structures. Calling source[i] returns a generator that opens the i-th database file and yields one AtomicData per row.

This source is compatible with any dataset stored in the .aselmdb format, including local extracts of the OMol25 and OPoly26 datasets.

An optional metadata.npz file (same directory or explicit path) is loaded eagerly if present. It is not required for operation.

Parameters:
  • data_dir (str) – Directory containing .aselmdb files.

  • file_pattern (str) – Glob pattern for file discovery. Defaults to "**/*.aselmdb" which recursively finds all .aselmdb files. Use "*.aselmdb" to restrict to the top-level directory.

  • metadata_path (str) – Optional path to a metadata.npz file. Empty string (default) means auto-detect <data_dir>/metadata.npz.

  • backend (str) – Read backend: "python" (default) uses the ASE database API, "rust" uses the native Rust reader for faster I/O. Falls back to "python" if the Rust extension is unavailable.

Note

Examples

>>> source = ASELMDBSource(data_dir="./val/")
>>> len(source)
80
>>> atomic_data = next(source[0])
classmethod params() list[physicsnemo_curator.core.base.Param]#

Return configurable parameters for this source.

Returns:

Parameter list for CLI configuration.

Return type:

list[Param]

relative_path(index: int) str#

Return the relative path of the file containing structure index.

This is used by sinks (e.g. AtomicDataZarrSink) to resolve {relpath} and {stem} naming placeholders, enabling output directory layouts that mirror the input.

Parameters:

index (int) – Zero-based structure index (global across all files).

Returns:

POSIX-style relative path (e.g. "subdir/data.aselmdb").

Return type:

str

property backend: str#

Return the active read backend name.

property data_dir: pathlib.Path#

Return the data directory path.

property db_files: list[pathlib.Path]#

Return the list of discovered database file paths.

description: ClassVar[str] = 'Read atomic data from ASE LMDB (.aselmdb) files'#

Short description shown in the interactive CLI.

property metadata: dict[str, numpy.ndarray]#

Return loaded metadata arrays, if any.

name: ClassVar[str] = 'ASE LMDB'#

Human-readable display name for the interactive CLI.

property num_files: int#

Return the number of .aselmdb files discovered.

property root: pathlib.Path#

Return the root directory of this source.

Returns:

The resolved root directory containing the discovered files.

Return type:

pathlib.Path

property row_counts: list[int]#

Return the number of structures in each file.

Returns:

List where row_counts[i] is the structure count in the i-th file.

Return type:

list[int]

physicsnemo_curator.domains.atm.sources.aselmdb.logger#