aselmdb#
ASE LMDB data source for atomic/molecular pipelines.
Reads .aselmdb files and yields
AtomicData objects for use in curator pipelines.
This source is designed for datasets stored in the ASE LMDB format, such as the Open Molecules 2025 (OMol25) dataset from Meta FAIR. OMol25 contains over 100 million DFT calculations at the ωB97M-V/def2-TZVPD level of theory, covering 83 elements and ~83M unique molecular systems including small molecules, biomolecules, metal complexes, and electrolytes (systems up to 350 atoms).
Each source index corresponds to one .aselmdb file. The generator
returned by __getitem__() iterates over every row in that database
file, converting each ASE Atoms entry to an
AtomicData instance.
Two read backends are supported:
python (default): Uses a pure-Python reader (
lmdb+zlib+json) to open the database, decompress entries, and convert__ndarray__markers to NumPy arrays. Then constructsAtomicDatadirectly from the raw row dicts.rust: Uses a native Rust reader (
physicsnemo_curator._lib.lmdb.read_lmdb()) for I/O, zlib decompression, and JSON parsing, then constructsAtomicDatadirectly from the raw row dicts. Avoids thease.Atomsintermediate and can be significantly faster for large datasets. Falls back to python if the Rust extension is not available.
References
OMol25 dataset: https://huggingface.co/facebook/OMol25
OMol25 paper: Levine et al., “The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models”, arXiv:2505.08762 (2025). https://arxiv.org/abs/2505.08762
OPoly26 extension: Levine et al., “The Open Polymers 2026 (OPoly26) Dataset and Evaluations”, arXiv:2512.23117 (2025). https://arxiv.org/abs/2512.23117
fairchem toolkit: facebookresearch/fairchem
Attributes#
Classes#
Read atomic data from ASE LMDB ( |
Module Contents#
- class physicsnemo_curator.domains.atm.sources.aselmdb.ASELMDBSource(
- data_dir: str,
- file_pattern: str = '**/*.aselmdb',
- metadata_path: str = '',
- backend: Literal['python', 'rust'] = 'python',
Bases:
physicsnemo_curator.core.base.Source[nvalchemi.data.AtomicData]Read atomic data from ASE LMDB (
.aselmdb) database files.The source discovers all
.aselmdbfiles under data_dir matching file_pattern, sorted lexicographically. Each file is treated as one source index containing many atomic structures. Callingsource[i]returns a generator that opens the i-th database file and yields oneAtomicDataper row.This source is compatible with any dataset stored in the
.aselmdbformat, including local extracts of the OMol25 and OPoly26 datasets.An optional
metadata.npzfile (same directory or explicit path) is loaded eagerly if present. It is not required for operation.- Parameters:
data_dir (str) – Directory containing
.aselmdbfiles.file_pattern (str) – Glob pattern for file discovery. Defaults to
"**/*.aselmdb"which recursively finds all.aselmdbfiles. Use"*.aselmdb"to restrict to the top-level directory.metadata_path (str) – Optional path to a
metadata.npzfile. Empty string (default) means auto-detect<data_dir>/metadata.npz.backend (str) – Read backend:
"python"(default) uses the ASE database API,"rust"uses the native Rust reader for faster I/O. Falls back to"python"if the Rust extension is unavailable.
Note
Dataset: OMol25
License: CC-BY-4.0 (dataset), FAIR Chemistry License (models)
Paper: arXiv:2505.08762
Examples
>>> source = ASELMDBSource(data_dir="./val/") >>> len(source) 80 >>> atomic_data = next(source[0])
- classmethod params() list[physicsnemo_curator.core.base.Param]#
Return configurable parameters for this source.
- relative_path(index: int) str#
Return the relative path of the file containing structure index.
This is used by sinks (e.g.
AtomicDataZarrSink) to resolve{relpath}and{stem}naming placeholders, enabling output directory layouts that mirror the input.
- property data_dir: pathlib.Path#
Return the data directory path.
- property db_files: list[pathlib.Path]#
Return the list of discovered database file paths.
- description: ClassVar[str] = 'Read atomic data from ASE LMDB (.aselmdb) files'#
Short description shown in the interactive CLI.
- property metadata: dict[str, numpy.ndarray]#
Return loaded metadata arrays, if any.
- property root: pathlib.Path#
Return the root directory of this source.
- Returns:
The resolved root directory containing the discovered files.
- Return type:
- physicsnemo_curator.domains.atm.sources.aselmdb.logger#