BioNeMo - Geneformer inferencing for single cell downstream tasks¶
This tutorial showcases how to run the BioNeMo container, pre-train a geneformer model, and use it for inferencing downstream single cell tasks. At the end of this tutorial, a user will learn:
- launching the BioNeMo container
- Download data from czi to use for pre-training and inference.
- Convert AnnData files into the sparse SCDL memmap format used by BioNeMo
- Kick-off pretraining with a custom single cell dataset
- Restore the pre-trained model and perform inference with the same czi dataset.
Prerequisites:¶
- BioNeMo Framework container is running (refer to the Getting Started section)
Running the BioNeMo container¶
This example has been built by launching the container in a local machine with 2 x A6000 RTX GPUs. Refer to specific instructions for [remote and multi-node launch]
Once the container is launched, navigate to http://0.0.0.0:8888, http://localhost:8888, or the IP address of the workstation/node. A JupyterLab instance should show up.
Copy this code and input files into JupyterLab¶
In the launched JupyterLab, run the codes in a Jupyter notebook as provided in the code cells below.
Getting example single cell data and setting it up for inference¶
First, we must acquire single cell training data for inference. To do this we will install the cellxgene-census api and download a small dataset. We use the example provided by the czi api examples page to download a single h5ad file. Generally, our workflow expects a collection of h5ad files to be used for pre-training. In this case, we restrict to 100k cells from a single dataset to keep training time and downloading time small.
!pip install cellxgene-census
DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_thunder-0.2.2.dev0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/dill-0.3.9-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/nvfuser-0.2.27a0+5111d3b-py3.12-linux-x86_64.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/looseversion-1.3.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/opt_einsum-3.4.0-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 DEPRECATION: Loading egg at /usr/local/lib/python3.12/dist-packages/lightning_utilities-0.14.3-py3.12.egg is deprecated. pip 25.1 will enforce this behaviour change. A possible replacement is to use pip for package installation. Discussion can be found at https://github.com/pypa/pip/issues/12330 Requirement already satisfied: cellxgene-census in /usr/local/lib/python3.12/dist-packages (1.17.0) Requirement already satisfied: tiledbsoma>=1.15.3 in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (1.16.2) Requirement already satisfied: anndata in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (0.11.3) Requirement already satisfied: numpy>=1.23 in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (1.26.4) Requirement already satisfied: requests in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (2.32.3) Requirement already satisfied: typing_extensions in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (4.12.2) Requirement already satisfied: s3fs>=2021.06.1 in /usr/local/lib/python3.12/dist-packages (from cellxgene-census) (2025.3.0) Requirement already satisfied: aiobotocore<3.0.0,>=2.5.4 in /usr/local/lib/python3.12/dist-packages (from s3fs>=2021.06.1->cellxgene-census) (2.13.3) Requirement already satisfied: fsspec==2025.3.0.* in /usr/local/lib/python3.12/dist-packages (from s3fs>=2021.06.1->cellxgene-census) (2025.3.0) Requirement already satisfied: aiohttp!=4.0.0a0,!=4.0.0a1 in /usr/local/lib/python3.12/dist-packages (from s3fs>=2021.06.1->cellxgene-census) (3.11.14) Requirement already satisfied: attrs>=22.2 in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (25.3.0) Requirement already satisfied: more-itertools in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (10.3.0) Requirement already satisfied: pandas in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (2.2.3) Requirement already satisfied: pyarrow in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (19.0.1) Requirement already satisfied: scanpy>=1.9.2 in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (1.10.4) Requirement already satisfied: scipy in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (1.15.2) Requirement already satisfied: somacore==1.0.28 in /usr/local/lib/python3.12/dist-packages (from tiledbsoma>=1.15.3->cellxgene-census) (1.0.28) Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.12/dist-packages (from somacore==1.0.28->tiledbsoma>=1.15.3->cellxgene-census) (0.7) Requirement already satisfied: shapely in /usr/local/lib/python3.12/dist-packages (from somacore==1.0.28->tiledbsoma>=1.15.3->cellxgene-census) (2.1.0) Requirement already satisfied: array-api-compat!=1.5,>1.4 in /usr/local/lib/python3.12/dist-packages (from anndata->cellxgene-census) (1.11.2) Requirement already satisfied: h5py>=3.7 in /usr/local/lib/python3.12/dist-packages (from anndata->cellxgene-census) (3.13.0) Requirement already satisfied: natsort in /usr/local/lib/python3.12/dist-packages (from anndata->cellxgene-census) (8.4.0) Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.12/dist-packages (from anndata->cellxgene-census) (23.2) Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.12/dist-packages (from requests->cellxgene-census) (3.4.1) Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.12/dist-packages (from requests->cellxgene-census) (3.10) Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.12/dist-packages (from requests->cellxgene-census) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.12/dist-packages (from requests->cellxgene-census) (2025.1.31) Requirement already satisfied: botocore<1.34.163,>=1.34.70 in /usr/local/lib/python3.12/dist-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs>=2021.06.1->cellxgene-census) (1.34.162) Requirement already satisfied: wrapt<2.0.0,>=1.10.10 in /usr/local/lib/python3.12/dist-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs>=2021.06.1->cellxgene-census) (1.17.2) Requirement already satisfied: aioitertools<1.0.0,>=0.5.1 in /usr/local/lib/python3.12/dist-packages (from aiobotocore<3.0.0,>=2.5.4->s3fs>=2021.06.1->cellxgene-census) (0.12.0) Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (2.6.1) Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (1.3.2) Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (1.5.0) Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (6.2.0) Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (0.3.0) Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.12/dist-packages (from aiohttp!=4.0.0a0,!=4.0.0a1->s3fs>=2021.06.1->cellxgene-census) (1.18.3) Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.12/dist-packages (from pandas->tiledbsoma>=1.15.3->cellxgene-census) (2.9.0.post0) Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.12/dist-packages (from pandas->tiledbsoma>=1.15.3->cellxgene-census) (2023.4) Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.12/dist-packages (from pandas->tiledbsoma>=1.15.3->cellxgene-census) (2025.2) Requirement already satisfied: joblib in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.4.2) Requirement already satisfied: legacy-api-wrap>=1.4 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.4.1) Requirement already satisfied: matplotlib>=3.6 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (3.10.1) Requirement already satisfied: networkx>=2.7 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (3.4.2) Requirement already satisfied: numba>=0.56 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.59.1) Requirement already satisfied: patsy!=1.0.0 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.0.1) Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.5.13) Requirement already satisfied: scikit-learn>=1.1 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.6.1) Requirement already satisfied: seaborn>=0.13 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.13.2) Requirement already satisfied: session-info in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.0.1) Requirement already satisfied: statsmodels>=0.13 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.14.4) Requirement already satisfied: tqdm in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (4.67.1) Requirement already satisfied: umap-learn!=0.5.0,>=0.5 in /usr/local/lib/python3.12/dist-packages (from scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.5.7) Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /usr/local/lib/python3.12/dist-packages (from botocore<1.34.163,>=1.34.70->aiobotocore<3.0.0,>=2.5.4->s3fs>=2021.06.1->cellxgene-census) (1.0.1) Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.3.1) Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (4.57.0) Requirement already satisfied: kiwisolver>=1.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (1.4.8) Requirement already satisfied: pillow>=8 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (11.1.0) Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.12/dist-packages (from matplotlib>=3.6->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (3.2.2) Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /usr/local/lib/python3.12/dist-packages (from numba>=0.56->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.42.0) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.12/dist-packages (from python-dateutil>=2.8.2->pandas->tiledbsoma>=1.15.3->cellxgene-census) (1.16.0) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.12/dist-packages (from scikit-learn>=1.1->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (3.6.0) Requirement already satisfied: stdlib_list in /usr/local/lib/python3.12/dist-packages (from session-info->scanpy>=1.9.2->tiledbsoma>=1.15.3->cellxgene-census) (0.11.1)
# Below are paths required for setting up pre-training and inference.
from bionemo.core import BIONEMO_CACHE_DIR
single_cell_workdir = BIONEMO_CACHE_DIR / "singlecell_tutorial"
tutorial_data_dir = single_cell_workdir / "download_anndata"
train_tutorial_data_dir = single_cell_workdir / "download_anndata/train"
val_tutorial_data_dir = single_cell_workdir / "download_anndata/val"
test_tutorial_data_dir = single_cell_workdir / "download_anndata/test"
train_tutorial_processed_dir = single_cell_workdir / "processed_data/train"
val_tutorial_processed_dir = single_cell_workdir / "processed_data/val"
test_tutorial_processed_dir = single_cell_workdir / "processed_data/test"
tutorial_output_dir = single_cell_workdir / "inference_output"
tutorial_output_inference_pickle = tutorial_output_dir / "human_covid19_bcells_from_scratch.pkl"
demo_data_train_download_path = train_tutorial_data_dir / "human_covid19_bcells.h5ad"
demo_data_val_download_path = val_tutorial_data_dir / "human_covid19_bcells.h5ad"
demo_data_test_download_path = test_tutorial_data_dir / "human_covid19_bcells.h5ad"
!mkdir -p {train_tutorial_data_dir}
!mkdir -p {val_tutorial_data_dir}
!mkdir -p {test_tutorial_data_dir}
!mkdir -p {train_tutorial_processed_dir}
!mkdir -p {val_tutorial_processed_dir}
!mkdir -p {test_tutorial_processed_dir}
!mkdir -p {tutorial_output_dir}
import cellxgene_census
frac_train = 0.8
frac_val = 0.1
frac_test = 0.1
with cellxgene_census.open_soma(census_version="2023-12-15") as census:
filter1 = (
"cell_type == 'B cell' and tissue_general == 'lung' and disease == 'COVID-19' and is_primary_data == True"
)
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=filter1,
)
n_train = int(adata.shape[0] * frac_train)
n_val = int(adata.shape[0] * frac_val)
n_test = adata.shape[0] - n_train - n_val
# Create some splits, bad practice since ordering may be a thing but let's just take ranges for this demo.
adata_train = adata[0:n_train].copy()
adata_val = adata[n_train : (n_train + n_val)].copy()
adata_test = adata[(n_train + n_val) :].copy()
adata_train.write(demo_data_train_download_path)
adata_val.write(demo_data_val_download_path)
adata_test.write(demo_data_test_download_path)
!rm -rf {train_tutorial_processed_dir}
!rm -rf {val_tutorial_processed_dir}
!rm -rf {test_tutorial_processed_dir}
# Create training data processed directory
!convert_h5ad_to_scdl \
--data-path {train_tutorial_data_dir} \
--save-path {train_tutorial_processed_dir}
# Create validation data processed directory
!convert_h5ad_to_scdl \
--data-path {val_tutorial_data_dir} \
--save-path {val_tutorial_processed_dir}
# Create test data processed directory
!convert_h5ad_to_scdl \
--data-path {test_tutorial_data_dir} \
--save-path {test_tutorial_processed_dir}
!ls -laht {train_tutorial_processed_dir}
total 12M drwxr-xr-x 5 ubuntu ubuntu 4.0K May 21 01:43 .. -rw-r--r-- 1 ubuntu ubuntu 18 May 21 01:43 metadata.json drwxr-xr-x 2 ubuntu ubuntu 4.0K May 21 01:43 features drwxr-xr-x 3 ubuntu ubuntu 4.0K May 21 01:43 . -rw-r--r-- 1 ubuntu ubuntu 5.9M May 21 01:43 col_ptr.npy -rw-r--r-- 1 ubuntu ubuntu 15K May 21 01:43 row_ptr.npy -rw-r--r-- 1 ubuntu ubuntu 5.9M May 21 01:43 data.npy -rw-r--r-- 1 ubuntu ubuntu 7 May 21 01:43 version.json
Pretraining¶
Now that we have converted the h5ad files to scdl memmapped files we can begin training. We will kickoff training.
Check the full recipe/config file in pretrain-recipe-short.yaml
for a complete list of arguments and config parameters.
# See where the processed data is stored
{train_tutorial_processed_dir}
{PosixPath('/home/ubuntu/.cache/bionemo/singlecell_tutorial/processed_data/train')}
# Create the recipe file
single_cell_workdir = BIONEMO_CACHE_DIR / "singlecell_tutorial"
!bionemo-geneformer-recipe --recipe geneformer_10m_shortpretrain_recipe --dest pretrain-recipe-short.yaml --result-dir {single_cell_workdir}/results --data-path {single_cell_workdir}/processed_data/
Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so') The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [NeMo I 2025-05-21 01:45:41 nemo_logging:393] Saved configuration to args.dest='pretrain-recipe-short.yaml'
!cat pretrain-recipe-short.yaml
bionemo_model_config: activation_func: gelu apply_query_key_layer_scaling: false apply_residual_connection_post_layernorm: false attention_dropout: 0.1 autocast_dtype: bf16-mixed bias_activation_fusion: true bias_dropout_fusion: true biobert_spec_option: bert_layer_with_transformer_engine_spec enable_autocast: false ffn_hidden_size: 512 fp16_lm_cross_entropy: false fp32_residual_connection: false get_attention_mask_from_fusion: true gradient_accumulation_fusion: false hidden_dropout: 0.02 hidden_size: 256 init_method_std: 0.02 initial_ckpt_path: null initial_ckpt_skip_keys_with_these_prefixes: [] kv_channels: null layernorm_epsilon: 1.0e-12 layernorm_zero_centered_gamma: false make_vocab_size_divisible_by: 128 masked_softmax_fusion: true nemo1_ckpt_path: null num_attention_heads: 4 num_layers: 6 params_dtype: bf16-mixed pipeline_dtype: bf16-mixed qk_layernorm: false seq_length: 2048 share_embeddings_and_output_weights: true data_config: data_dir: /home/ubuntu/.cache/bionemo/singlecell_tutorial/processed_data/ micro_batch_size: 8 num_dataset_workers: 0 result_dir: ./results seq_length: 2048 experiment_config: create_checkpoint_callback: true create_tensorboard_logger: false experiment_name: geneformer-10m metric_to_monitor_for_checkpoints: reduced_train_loss restore_from_checkpoint_path: null result_dir: /home/ubuntu/.cache/bionemo/singlecell_tutorial/results save_every_n_steps: 100 save_last_checkpoint: true save_top_k: 2 optim_config: adam_eps: 1.0e-08 cosine_hold_frac: 0.05 cosine_rampup_frac: 0.01 interval: step lr: 0.001 lr_scheduler: cosine max_steps: null monitor: val_loss optimizer: adam sgd_momentum: 0.9 use_distributed_optimizer: true warmup_steps: 0 weight_decay: 0.01 parallel_config: accumulate_grad_batches: 1 ddp: megatron num_devices: 1 num_nodes: 1 pipeline_model_parallel_size: 1 remove_unused_parameters: true tensor_model_parallel_size: 1 use_distributed_optimizer: true training_config: accelerator: gpu create_tflops_callback: false enable_checkpointing: true gc_interval: 0 limit_val_batches: 8 log_train_ppl: false log_val_ppl: true max_steps: 500 precision: bf16-mixed val_check_interval: 100 wandb_config: anonymous: true entity: geneformer-10m_pretraining group: geneformer-10m id: '1' job_type: null log_model: false name: null offline: true project: geneformer-10m_pretraining tags: - geneformer-10m
# Run pretraining using the short recipe
!bionemo-geneformer-train --config pretrain-recipe-short.yaml
Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so') The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [NeMo I 2025-05-21 01:46:02 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_name_id_dict_gc30M.pkl?download=true [NeMo I 2025-05-21 01:46:02 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:46:02 nemo_logging:393] Downloading resource: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_median_dictionary_gc30M.pkl?download=true [NeMo I 2025-05-21 01:46:02 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:46:03 nemo_logging:393] *************** Preprocessing Finished ************ Trainer already configured with model summary callbacks: [<class 'lightning.pytorch.callbacks.rich_model_summary.RichModelSummary'>]. Skipping setting a default `ModelSummary` callback. GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs [NeMo W 2025-05-21 01:46:03 nemo_logging:405] User-set tensorboard is currently turned off. Internally one may still be set by NeMo2. [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Experiments will be logged at /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev [NeMo W 2025-05-21 01:46:03 nemo_logging:405] The Trainer already contains a ModelCheckpoint callback. This will be overwritten. [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has data parallel group : [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Ranks 0 has data parallel rank: 0 [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has context parallel group: [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All context parallel group ranks: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Ranks 0 has context parallel rank: 0 [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has model parallel group: [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has tensor model parallel group: [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All tensor model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has tensor model parallel rank: 0 [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has pipeline model parallel group: [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has embedding group: [0] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All pipeline model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has pipeline model parallel rank 0 [NeMo I 2025-05-21 01:46:03 nemo_logging:393] All embedding group ranks: [[0]] [NeMo I 2025-05-21 01:46:03 nemo_logging:393] Rank 0 has embedding rank: 0 2025-05-21 01:46:03 - nemo.lightning.pytorch.strategies.megatron_strategy - INFO - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 1 processes ---------------------------------------------------------------------------------------------------- wandb: WARNING `resume` will be ignored since W&B syncing is set to `offline`. Starting a new run with run id 1. wandb: Tracking run with wandb version 0.19.11 wandb: W&B syncing is set to `offline` in this directory. Run `wandb online` or set WANDB_MODE=online to enable cloud syncing. [NeMo I 2025-05-21 01:46:05 nemo_logging:393] Padded vocab_size: 25472, original vocab_size: 25429, dummy tokens: 43. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] [NeMo I 2025-05-21 01:46:05 nemo_logging:393] Copying Trainer's 'max_steps' (500) to LR scheduler's 'max_steps'. [NeMo I 2025-05-21 01:46:05 num_microbatches_calculator:228] setting number of microbatches to constant 1 [NeMo I 2025-05-21 01:46:05 nemo_logging:393] > number of parameters on (tensor, pipeline) model parallel rank (0 ,0): 10300032 [NeMo I 2025-05-21 01:46:05 utils:554] Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=False, overlap_grad_reduce=True, overlap_param_gather=False, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, check_for_large_grads=False, bucket_size=40000000, pad_buckets_for_high_nccl_busbw=False, average_in_collective=True, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False) [NeMo I 2025-05-21 01:46:05 utils:575] Number of buckets for gradient all-reduce / reduce-scatter: 1 Params for bucket 1 (10300032 elements, 10300032 padded size): module.output_layer.bias module.encoder.layers.5.mlp.linear_fc1.weight module.encoder.layers.2.self_attention.linear_proj.bias module.encoder.layers.0.self_attention.linear_qkv.layer_norm_bias module.encoder.final_layernorm.bias module.encoder.layers.5.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.0.mlp.linear_fc2.bias module.encoder.layers.3.mlp.linear_fc1.layer_norm_weight module.encoder.layers.3.self_attention.linear_proj.weight module.encoder.layers.4.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.3.mlp.linear_fc1.weight module.embedding.word_embeddings.weight module.lm_head.dense.bias module.encoder.layers.5.mlp.linear_fc1.bias module.encoder.layers.4.mlp.linear_fc1.weight module.encoder.layers.4.mlp.linear_fc2.bias module.encoder.layers.3.mlp.linear_fc2.weight module.encoder.layers.3.mlp.linear_fc1.bias module.encoder.layers.0.self_attention.linear_qkv.weight module.encoder.layers.5.self_attention.linear_qkv.weight module.encoder.layers.2.self_attention.linear_qkv.bias module.encoder.layers.0.mlp.linear_fc1.weight module.encoder.layers.3.self_attention.linear_qkv.bias module.encoder.layers.1.mlp.linear_fc2.bias module.encoder.layers.1.mlp.linear_fc1.weight module.encoder.layers.1.self_attention.linear_qkv.weight module.encoder.layers.1.self_attention.linear_proj.weight module.encoder.layers.0.mlp.linear_fc1.layer_norm_weight module.encoder.layers.2.mlp.linear_fc1.layer_norm_bias module.encoder.layers.1.mlp.linear_fc1.layer_norm_weight module.lm_head.layer_norm.weight module.encoder.final_layernorm.weight module.encoder.layers.1.mlp.linear_fc2.weight module.encoder.layers.0.self_attention.linear_proj.weight module.encoder.layers.4.mlp.linear_fc1.layer_norm_weight module.encoder.layers.4.self_attention.linear_qkv.weight module.encoder.layers.4.self_attention.linear_proj.weight module.lm_head.dense.weight module.encoder.layers.5.self_attention.linear_proj.weight module.encoder.layers.2.self_attention.linear_proj.weight module.encoder.layers.3.self_attention.linear_proj.bias module.encoder.layers.1.self_attention.linear_proj.bias module.encoder.layers.5.self_attention.linear_qkv.bias module.encoder.layers.1.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.5.mlp.linear_fc2.weight module.encoder.layers.2.mlp.linear_fc1.bias module.encoder.layers.0.mlp.linear_fc1.layer_norm_bias module.encoder.layers.5.mlp.linear_fc2.bias module.encoder.layers.4.mlp.linear_fc1.bias module.encoder.layers.2.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.1.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.2.mlp.linear_fc2.bias module.encoder.layers.5.mlp.linear_fc1.layer_norm_bias module.encoder.layers.3.self_attention.linear_qkv.weight module.encoder.layers.2.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.0.self_attention.linear_qkv.layer_norm_weight module.lm_head.layer_norm.bias module.encoder.layers.5.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.2.self_attention.linear_qkv.weight module.encoder.layers.2.mlp.linear_fc1.weight module.encoder.layers.1.self_attention.linear_qkv.bias module.embedding.position_embeddings.weight module.encoder.layers.0.mlp.linear_fc2.weight module.encoder.layers.4.mlp.linear_fc2.weight module.encoder.layers.3.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.0.self_attention.linear_qkv.bias module.encoder.layers.4.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.3.mlp.linear_fc1.layer_norm_bias module.encoder.layers.5.mlp.linear_fc1.layer_norm_weight module.encoder.layers.4.mlp.linear_fc1.layer_norm_bias module.encoder.layers.3.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.2.mlp.linear_fc2.weight module.encoder.layers.1.mlp.linear_fc1.layer_norm_bias module.encoder.layers.5.self_attention.linear_proj.bias module.encoder.layers.4.self_attention.linear_qkv.bias module.encoder.layers.2.mlp.linear_fc1.layer_norm_weight module.encoder.layers.1.mlp.linear_fc1.bias module.encoder.layers.0.mlp.linear_fc1.bias module.encoder.layers.4.self_attention.linear_proj.bias module.encoder.layers.3.mlp.linear_fc2.bias module.encoder.layers.0.self_attention.linear_proj.bias [NeMo I 2025-05-21 01:46:05 utils:554] Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.001, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=False, bf16=True, params_dtype=torch.bfloat16, use_precision_aware_optimizer=False, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=0.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='') ┏━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓ ┃ ┃ Name ┃ Type ┃ Params ┃ Mode ┃ ┡━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩ │ 0 │ module │ DDP │ 10.3 M │ train │ │ 1 │ module.module │ Float16Module │ 10.3 M │ train │ │ 2 │ module.module.module │ MegatronBioBertMod… │ 10.3 M │ train │ │ 3 │ module.module.module.embedding │ LanguageModelEmbed… │ 7.0 M │ train │ │ 4 │ module.module.module.encoder │ TransformerBlock │ 3.2 M │ train │ │ 5 │ module.module.module.lm_head │ BertLMHead │ 66.3 K │ train │ │ 6 │ module.module.module.output_layer │ ColumnParallelLine… │ 25.5 K │ train │ └───┴───────────────────────────────────┴─────────────────────┴────────┴───────┘ Trainable params: 10.3 M Non-trainable params: 0 Total params: 10.3 M Total estimated model params size (MB): 41 Modules in train mode: 134 Modules in eval mode: 0 2025-05-21 01:46:05 - root - INFO - Instantiating MegatronPretrainingSampler with total_samples: 64 and consumed_samples: 0 [NeMo W 2025-05-21 01:46:05 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=19` in the `DataLoader` to improve performance. Sanity checking Validation: iteration 1/2 Sanity checking Validation: iteration 2/2 [NeMo W 2025-05-21 01:46:07 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('global_batch_size', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. [NeMo W 2025-05-21 01:46:07 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:431: It is recommended to use `self.log('val_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices. 2025-05-21 01:46:07 - root - INFO - Instantiating MegatronPretrainingSampler with total_samples: 4000 and consumed_samples: 0 [NeMo W 2025-05-21 01:46:07 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=19` in the `DataLoader` to improve performance. [NeMo W 2025-05-21 01:46:08 rerun_state_machine:1264] Implicit initialization of Rerun State Machine! [NeMo W 2025-05-21 01:46:08 rerun_state_machine:239] RerunStateMachine initialized in mode RerunMode.DISABLED Training epoch 0, iteration 0/499 | lr: 0 | global_batch_size: 8 | global_step: 0 | reduced_train_loss: 10.19 | train_step_timing in s: 0.9984 Training epoch 0, iteration 1/499 | lr: 0.0002 | global_batch_size: 8 | global_step: 1 | reduced_train_loss: 10.22 | train_step_timing in s: 0.1136 | consumed_samples: 16 Training epoch 0, iteration 2/499 | lr: 0.0004 | global_batch_size: 8 | global_step: 2 | reduced_train_loss: 10.16 | train_step_timing in s: 0.1174 | consumed_samples: 24 Training epoch 0, iteration 3/499 | lr: 0.0006 | global_batch_size: 8 | global_step: 3 | reduced_train_loss: 10.15 | train_step_timing in s: 0.0987 | consumed_samples: 32 Training epoch 0, iteration 4/499 | lr: 0.0008 | global_batch_size: 8 | global_step: 4 | reduced_train_loss: 10.09 | train_step_timing in s: 0.09418 | consumed_samples: 40 Training epoch 0, iteration 5/499 | lr: 0.001 | global_batch_size: 8 | global_step: 5 | reduced_train_loss: 10.03 | train_step_timing in s: 0.09319 | consumed_samples: 48 Training epoch 0, iteration 6/499 | lr: 0.001 | global_batch_size: 8 | global_step: 6 | reduced_train_loss: 9.94 | train_step_timing in s: 0.1156 | consumed_samples: 56 Training epoch 0, iteration 7/499 | lr: 0.001 | global_batch_size: 8 | global_step: 7 | reduced_train_loss: 9.896 | train_step_timing in s: 0.1094 | consumed_samples: 64 Training epoch 0, iteration 8/499 | lr: 0.0009999 | global_batch_size: 8 | global_step: 8 | reduced_train_loss: 9.763 | train_step_timing in s: 0.1062 | consumed_samples: 72 Training epoch 0, iteration 9/499 | lr: 0.0009998 | global_batch_size: 8 | global_step: 9 | reduced_train_loss: 9.685 | train_step_timing in s: 0.1141 | consumed_samples: 80 Training epoch 0, iteration 10/499 | lr: 0.0009997 | global_batch_size: 8 | global_step: 10 | reduced_train_loss: 9.892 | train_step_timing in s: 0.1222 | consumed_samples: 88 Training epoch 0, iteration 11/499 | lr: 0.0009996 | global_batch_size: 8 | global_step: 11 | reduced_train_loss: 9.557 | train_step_timing in s: 0.1267 | consumed_samples: 96 Training epoch 0, iteration 12/499 | lr: 0.0009995 | global_batch_size: 8 | global_step: 12 | reduced_train_loss: 9.647 | train_step_timing in s: 0.1021 | consumed_samples: 104 Training epoch 0, iteration 13/499 | lr: 0.0009993 | global_batch_size: 8 | global_step: 13 | reduced_train_loss: 9.358 | train_step_timing in s: 0.1025 | consumed_samples: 112 Training epoch 0, iteration 14/499 | lr: 0.0009991 | global_batch_size: 8 | global_step: 14 | reduced_train_loss: 9.367 | train_step_timing in s: 0.1039 | consumed_samples: 120 Training epoch 0, iteration 15/499 | lr: 0.0009989 | global_batch_size: 8 | global_step: 15 | reduced_train_loss: 9.418 | train_step_timing in s: 0.1149 | consumed_samples: 128 Training epoch 0, iteration 16/499 | lr: 0.0009987 | global_batch_size: 8 | global_step: 16 | reduced_train_loss: 9.395 | train_step_timing in s: 0.1172 | consumed_samples: 136 Training epoch 0, iteration 17/499 | lr: 0.0009984 | global_batch_size: 8 | global_step: 17 | reduced_train_loss: 9.286 | train_step_timing in s: 0.1161 | consumed_samples: 144 Training epoch 0, iteration 18/499 | lr: 0.0009981 | global_batch_size: 8 | global_step: 18 | reduced_train_loss: 9.301 | train_step_timing in s: 0.1169 | consumed_samples: 152 Training epoch 0, iteration 19/499 | lr: 0.0009978 | global_batch_size: 8 | global_step: 19 | reduced_train_loss: 9.334 | train_step_timing in s: 0.1158 | consumed_samples: 160 Training epoch 0, iteration 20/499 | lr: 0.0009975 | global_batch_size: 8 | global_step: 20 | reduced_train_loss: 9.257 | train_step_timing in s: 0.1047 | consumed_samples: 168 Training epoch 0, iteration 21/499 | lr: 0.0009972 | global_batch_size: 8 | global_step: 21 | reduced_train_loss: 9.32 | train_step_timing in s: 0.1136 | consumed_samples: 176 Training epoch 0, iteration 22/499 | lr: 0.0009968 | global_batch_size: 8 | global_step: 22 | reduced_train_loss: 9.027 | train_step_timing in s: 0.1087 | consumed_samples: 184 Training epoch 0, iteration 23/499 | lr: 0.0009964 | global_batch_size: 8 | global_step: 23 | reduced_train_loss: 9.142 | train_step_timing in s: 0.1182 | consumed_samples: 192 Training epoch 0, iteration 24/499 | lr: 0.000996 | global_batch_size: 8 | global_step: 24 | reduced_train_loss: 9.107 | train_step_timing in s: 0.1125 | consumed_samples: 200 Training epoch 0, iteration 25/499 | lr: 0.0009956 | global_batch_size: 8 | global_step: 25 | reduced_train_loss: 9.142 | train_step_timing in s: 0.1174 | consumed_samples: 208 Training epoch 0, iteration 26/499 | lr: 0.0009951 | global_batch_size: 8 | global_step: 26 | reduced_train_loss: 9.076 | train_step_timing in s: 0.1201 | consumed_samples: 216 Training epoch 0, iteration 27/499 | lr: 0.0009947 | global_batch_size: 8 | global_step: 27 | reduced_train_loss: 9.163 | train_step_timing in s: 0.1173 | consumed_samples: 224 Training epoch 0, iteration 28/499 | lr: 0.0009942 | global_batch_size: 8 | global_step: 28 | reduced_train_loss: 9.222 | train_step_timing in s: 0.113 | consumed_samples: 232 Training epoch 0, iteration 29/499 | lr: 0.0009936 | global_batch_size: 8 | global_step: 29 | reduced_train_loss: 9.388 | train_step_timing in s: 0.105 | consumed_samples: 240 Training epoch 0, iteration 30/499 | lr: 0.0009931 | global_batch_size: 8 | global_step: 30 | reduced_train_loss: 9.273 | train_step_timing in s: 0.1191 | consumed_samples: 248 Training epoch 0, iteration 31/499 | lr: 0.0009925 | global_batch_size: 8 | global_step: 31 | reduced_train_loss: 9.165 | train_step_timing in s: 0.1164 | consumed_samples: 256 Training epoch 0, iteration 32/499 | lr: 0.000992 | global_batch_size: 8 | global_step: 32 | reduced_train_loss: 9.178 | train_step_timing in s: 0.1148 | consumed_samples: 264 Training epoch 0, iteration 33/499 | lr: 0.0009914 | global_batch_size: 8 | global_step: 33 | reduced_train_loss: 9.122 | train_step_timing in s: 0.1138 | consumed_samples: 272 Training epoch 0, iteration 34/499 | lr: 0.0009907 | global_batch_size: 8 | global_step: 34 | reduced_train_loss: 9.089 | train_step_timing in s: 0.1133 | consumed_samples: 280 Training epoch 0, iteration 35/499 | lr: 0.0009901 | global_batch_size: 8 | global_step: 35 | reduced_train_loss: 9.076 | train_step_timing in s: 0.1104 | consumed_samples: 288 Training epoch 0, iteration 36/499 | lr: 0.0009894 | global_batch_size: 8 | global_step: 36 | reduced_train_loss: 9.284 | train_step_timing in s: 0.1168 | consumed_samples: 296 Training epoch 0, iteration 37/499 | lr: 0.0009887 | global_batch_size: 8 | global_step: 37 | reduced_train_loss: 9.107 | train_step_timing in s: 0.1146 | consumed_samples: 304 Training epoch 0, iteration 38/499 | lr: 0.000988 | global_batch_size: 8 | global_step: 38 | reduced_train_loss: 9.169 | train_step_timing in s: 0.1045 | consumed_samples: 312 Training epoch 0, iteration 39/499 | lr: 0.0009873 | global_batch_size: 8 | global_step: 39 | reduced_train_loss: 9.021 | train_step_timing in s: 0.1039 | consumed_samples: 320 Training epoch 0, iteration 40/499 | lr: 0.0009865 | global_batch_size: 8 | global_step: 40 | reduced_train_loss: 9.111 | train_step_timing in s: 0.108 | consumed_samples: 328 Training epoch 0, iteration 41/499 | lr: 0.0009857 | global_batch_size: 8 | global_step: 41 | reduced_train_loss: 9.211 | train_step_timing in s: 0.1148 | consumed_samples: 336 Training epoch 0, iteration 42/499 | lr: 0.0009849 | global_batch_size: 8 | global_step: 42 | reduced_train_loss: 9.107 | train_step_timing in s: 0.1198 | consumed_samples: 344 Training epoch 0, iteration 43/499 | lr: 0.0009841 | global_batch_size: 8 | global_step: 43 | reduced_train_loss: 9.128 | train_step_timing in s: 0.1104 | consumed_samples: 352 Training epoch 0, iteration 44/499 | lr: 0.0009833 | global_batch_size: 8 | global_step: 44 | reduced_train_loss: 8.896 | train_step_timing in s: 0.1087 | consumed_samples: 360 Training epoch 0, iteration 45/499 | lr: 0.0009824 | global_batch_size: 8 | global_step: 45 | reduced_train_loss: 8.931 | train_step_timing in s: 0.1172 | consumed_samples: 368 Training epoch 0, iteration 46/499 | lr: 0.0009815 | global_batch_size: 8 | global_step: 46 | reduced_train_loss: 9.115 | train_step_timing in s: 0.1065 | consumed_samples: 376 Training epoch 0, iteration 47/499 | lr: 0.0009806 | global_batch_size: 8 | global_step: 47 | reduced_train_loss: 8.898 | train_step_timing in s: 0.1107 | consumed_samples: 384 Training epoch 0, iteration 48/499 | lr: 0.0009797 | global_batch_size: 8 | global_step: 48 | reduced_train_loss: 9.208 | train_step_timing in s: 0.1138 | consumed_samples: 392 Training epoch 0, iteration 49/499 | lr: 0.0009787 | global_batch_size: 8 | global_step: 49 | reduced_train_loss: 9.1 | train_step_timing in s: 0.1174 | consumed_samples: 400 Training epoch 0, iteration 50/499 | lr: 0.0009778 | global_batch_size: 8 | global_step: 50 | reduced_train_loss: 8.975 | train_step_timing in s: 0.1147 | consumed_samples: 408 Training epoch 0, iteration 51/499 | lr: 0.0009768 | global_batch_size: 8 | global_step: 51 | reduced_train_loss: 8.955 | train_step_timing in s: 0.1117 | consumed_samples: 416 Training epoch 0, iteration 52/499 | lr: 0.0009758 | global_batch_size: 8 | global_step: 52 | reduced_train_loss: 9.112 | train_step_timing in s: 0.1187 | consumed_samples: 424 Training epoch 0, iteration 53/499 | lr: 0.0009747 | global_batch_size: 8 | global_step: 53 | reduced_train_loss: 8.978 | train_step_timing in s: 0.1081 | consumed_samples: 432 Training epoch 0, iteration 54/499 | lr: 0.0009737 | global_batch_size: 8 | global_step: 54 | reduced_train_loss: 9.034 | train_step_timing in s: 0.1127 | consumed_samples: 440 Training epoch 0, iteration 55/499 | lr: 0.0009726 | global_batch_size: 8 | global_step: 55 | reduced_train_loss: 9.143 | train_step_timing in s: 0.1096 | consumed_samples: 448 Training epoch 0, iteration 56/499 | lr: 0.0009715 | global_batch_size: 8 | global_step: 56 | reduced_train_loss: 9.075 | train_step_timing in s: 0.1067 | consumed_samples: 456 Training epoch 0, iteration 57/499 | lr: 0.0009704 | global_batch_size: 8 | global_step: 57 | reduced_train_loss: 8.833 | train_step_timing in s: 0.1169 | consumed_samples: 464 Training epoch 0, iteration 58/499 | lr: 0.0009693 | global_batch_size: 8 | global_step: 58 | reduced_train_loss: 9.092 | train_step_timing in s: 0.1111 | consumed_samples: 472 Training epoch 0, iteration 59/499 | lr: 0.0009681 | global_batch_size: 8 | global_step: 59 | reduced_train_loss: 9.0 | train_step_timing in s: 0.1101 | consumed_samples: 480 Training epoch 0, iteration 60/499 | lr: 0.0009669 | global_batch_size: 8 | global_step: 60 | reduced_train_loss: 9.044 | train_step_timing in s: 0.1172 | consumed_samples: 488 Training epoch 0, iteration 61/499 | lr: 0.0009657 | global_batch_size: 8 | global_step: 61 | reduced_train_loss: 9.075 | train_step_timing in s: 0.1117 | consumed_samples: 496 Training epoch 0, iteration 62/499 | lr: 0.0009645 | global_batch_size: 8 | global_step: 62 | reduced_train_loss: 8.936 | train_step_timing in s: 0.1188 | consumed_samples: 504 Training epoch 0, iteration 63/499 | lr: 0.0009633 | global_batch_size: 8 | global_step: 63 | reduced_train_loss: 9.005 | train_step_timing in s: 0.1165 | consumed_samples: 512 Training epoch 0, iteration 64/499 | lr: 0.000962 | global_batch_size: 8 | global_step: 64 | reduced_train_loss: 8.908 | train_step_timing in s: 0.1023 | consumed_samples: 520 Training epoch 0, iteration 65/499 | lr: 0.0009607 | global_batch_size: 8 | global_step: 65 | reduced_train_loss: 8.923 | train_step_timing in s: 0.1036 | consumed_samples: 528 Training epoch 0, iteration 66/499 | lr: 0.0009594 | global_batch_size: 8 | global_step: 66 | reduced_train_loss: 8.981 | train_step_timing in s: 0.1116 | consumed_samples: 536 Training epoch 0, iteration 67/499 | lr: 0.0009581 | global_batch_size: 8 | global_step: 67 | reduced_train_loss: 8.974 | train_step_timing in s: 0.108 | consumed_samples: 544 Training epoch 0, iteration 68/499 | lr: 0.0009568 | global_batch_size: 8 | global_step: 68 | reduced_train_loss: 8.992 | train_step_timing in s: 0.1038 | consumed_samples: 552 Training epoch 0, iteration 69/499 | lr: 0.0009554 | global_batch_size: 8 | global_step: 69 | reduced_train_loss: 8.773 | train_step_timing in s: 0.1096 | consumed_samples: 560 Training epoch 0, iteration 70/499 | lr: 0.000954 | global_batch_size: 8 | global_step: 70 | reduced_train_loss: 8.935 | train_step_timing in s: 0.1153 | consumed_samples: 568 Training epoch 0, iteration 71/499 | lr: 0.0009526 | global_batch_size: 8 | global_step: 71 | reduced_train_loss: 8.796 | train_step_timing in s: 0.107 | consumed_samples: 576 Training epoch 0, iteration 72/499 | lr: 0.0009512 | global_batch_size: 8 | global_step: 72 | reduced_train_loss: 8.856 | train_step_timing in s: 0.113 | consumed_samples: 584 Training epoch 0, iteration 73/499 | lr: 0.0009497 | global_batch_size: 8 | global_step: 73 | reduced_train_loss: 8.959 | train_step_timing in s: 0.1189 | consumed_samples: 592 Training epoch 0, iteration 74/499 | lr: 0.0009483 | global_batch_size: 8 | global_step: 74 | reduced_train_loss: 8.921 | train_step_timing in s: 0.1054 | consumed_samples: 600 Training epoch 0, iteration 75/499 | lr: 0.0009468 | global_batch_size: 8 | global_step: 75 | reduced_train_loss: 8.857 | train_step_timing in s: 0.1044 | consumed_samples: 608 Training epoch 0, iteration 76/499 | lr: 0.0009453 | global_batch_size: 8 | global_step: 76 | reduced_train_loss: 9.077 | train_step_timing in s: 0.1187 | consumed_samples: 616 Training epoch 0, iteration 77/499 | lr: 0.0009438 | global_batch_size: 8 | global_step: 77 | reduced_train_loss: 8.88 | train_step_timing in s: 0.1055 | consumed_samples: 624 Training epoch 0, iteration 78/499 | lr: 0.0009422 | global_batch_size: 8 | global_step: 78 | reduced_train_loss: 9.004 | train_step_timing in s: 0.1075 | consumed_samples: 632 Training epoch 0, iteration 79/499 | lr: 0.0009407 | global_batch_size: 8 | global_step: 79 | reduced_train_loss: 8.922 | train_step_timing in s: 0.1202 | consumed_samples: 640 Training epoch 0, iteration 80/499 | lr: 0.0009391 | global_batch_size: 8 | global_step: 80 | reduced_train_loss: 8.719 | train_step_timing in s: 0.1039 | consumed_samples: 648 Training epoch 0, iteration 81/499 | lr: 0.0009375 | global_batch_size: 8 | global_step: 81 | reduced_train_loss: 8.725 | train_step_timing in s: 0.1007 | consumed_samples: 656 Training epoch 0, iteration 82/499 | lr: 0.0009359 | global_batch_size: 8 | global_step: 82 | reduced_train_loss: 8.918 | train_step_timing in s: 0.1121 | consumed_samples: 664 Training epoch 0, iteration 83/499 | lr: 0.0009342 | global_batch_size: 8 | global_step: 83 | reduced_train_loss: 8.893 | train_step_timing in s: 0.1013 | consumed_samples: 672 Training epoch 0, iteration 84/499 | lr: 0.0009326 | global_batch_size: 8 | global_step: 84 | reduced_train_loss: 8.738 | train_step_timing in s: 0.1124 | consumed_samples: 680 Training epoch 0, iteration 85/499 | lr: 0.0009309 | global_batch_size: 8 | global_step: 85 | reduced_train_loss: 9.011 | train_step_timing in s: 0.111 | consumed_samples: 688 Training epoch 0, iteration 86/499 | lr: 0.0009292 | global_batch_size: 8 | global_step: 86 | reduced_train_loss: 8.956 | train_step_timing in s: 0.116 | consumed_samples: 696 Training epoch 0, iteration 87/499 | lr: 0.0009275 | global_batch_size: 8 | global_step: 87 | reduced_train_loss: 9.006 | train_step_timing in s: 0.1089 | consumed_samples: 704 Training epoch 0, iteration 88/499 | lr: 0.0009258 | global_batch_size: 8 | global_step: 88 | reduced_train_loss: 8.798 | train_step_timing in s: 0.1136 | consumed_samples: 712 Training epoch 0, iteration 89/499 | lr: 0.000924 | global_batch_size: 8 | global_step: 89 | reduced_train_loss: 8.902 | train_step_timing in s: 0.1134 | consumed_samples: 720 Training epoch 0, iteration 90/499 | lr: 0.0009222 | global_batch_size: 8 | global_step: 90 | reduced_train_loss: 8.989 | train_step_timing in s: 0.1129 | consumed_samples: 728 Training epoch 0, iteration 91/499 | lr: 0.0009204 | global_batch_size: 8 | global_step: 91 | reduced_train_loss: 8.634 | train_step_timing in s: 0.1044 | consumed_samples: 736 Training epoch 0, iteration 92/499 | lr: 0.0009186 | global_batch_size: 8 | global_step: 92 | reduced_train_loss: 8.918 | train_step_timing in s: 0.1142 | consumed_samples: 744 Training epoch 0, iteration 93/499 | lr: 0.0009168 | global_batch_size: 8 | global_step: 93 | reduced_train_loss: 8.854 | train_step_timing in s: 0.1184 | consumed_samples: 752 Training epoch 0, iteration 94/499 | lr: 0.000915 | global_batch_size: 8 | global_step: 94 | reduced_train_loss: 8.976 | train_step_timing in s: 0.1147 | consumed_samples: 760 Training epoch 0, iteration 95/499 | lr: 0.0009131 | global_batch_size: 8 | global_step: 95 | reduced_train_loss: 8.963 | train_step_timing in s: 0.1141 | consumed_samples: 768 Training epoch 0, iteration 96/499 | lr: 0.0009112 | global_batch_size: 8 | global_step: 96 | reduced_train_loss: 8.941 | train_step_timing in s: 0.1146 | consumed_samples: 776 Training epoch 0, iteration 97/499 | lr: 0.0009093 | global_batch_size: 8 | global_step: 97 | reduced_train_loss: 8.941 | train_step_timing in s: 0.1203 | consumed_samples: 784 Training epoch 0, iteration 98/499 | lr: 0.0009074 | global_batch_size: 8 | global_step: 98 | reduced_train_loss: 8.906 | train_step_timing in s: 0.1129 | consumed_samples: 792 Training epoch 0, iteration 99/499 | lr: 0.0009055 | global_batch_size: 8 | global_step: 99 | reduced_train_loss: 8.835 | train_step_timing in s: 0.1166 | consumed_samples: 800 Epoch 0, global step 99: 'reduced_train_loss' reached 8.83520 (best 8.83520), saving model to '/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0.ckpt' as top 2 [NeMo I 2025-05-21 01:46:20 nemo_logging:393] Using FullyParallelSaveStrategyWrapper(torch_dist, 1) dist-ckpt save strategy. [NeMo I 2025-05-21 01:46:20 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 99 : Start time: 1747791980.172s : Save duration: 0.140s [NeMo I 2025-05-21 01:46:24 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0.ckpt [NeMo I 2025-05-21 01:46:24 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 99 : Start time: 1747791984.216s : Save duration: 0.061s [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0-last.ckpt [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Successfully saved checkpoint from iteration 99 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0.ckpt [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Async checkpoint save for step 100 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0.ckpt) finalized successfully. [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Successfully saved checkpoint from iteration 99 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0-last.ckpt [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Async checkpoint save for step 100 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=0.00-step=99-consumed_samples=800.0-last.ckpt) finalized successfully. [NeMo I 2025-05-21 01:46:27 nemo_logging:393] Async finalization time took 0.071 s Validation: iteration 1/2 Validation: iteration 2/2 Validation: iteration 3/2 Validation: iteration 4/2 Validation: iteration 5/2 Validation: iteration 6/2 Validation: iteration 7/2 Validation: iteration 8/2 Training epoch 0, iteration 100/499 | lr: 0.0009035 | global_batch_size: 8 | global_step: 100 | reduced_train_loss: 8.888 | train_step_timing in s: 0.1193 | consumed_samples: 808 | val_loss: 9.014 Training epoch 0, iteration 101/499 | lr: 0.0009015 | global_batch_size: 8 | global_step: 101 | reduced_train_loss: 8.705 | train_step_timing in s: 0.1103 | consumed_samples: 816 | val_loss: 9.014 Training epoch 0, iteration 102/499 | lr: 0.0008995 | global_batch_size: 8 | global_step: 102 | reduced_train_loss: 8.867 | train_step_timing in s: 0.1167 | consumed_samples: 824 | val_loss: 9.014 Training epoch 0, iteration 103/499 | lr: 0.0008975 | global_batch_size: 8 | global_step: 103 | reduced_train_loss: 8.848 | train_step_timing in s: 0.115 | consumed_samples: 832 | val_loss: 9.014 Training epoch 0, iteration 104/499 | lr: 0.0008955 | global_batch_size: 8 | global_step: 104 | reduced_train_loss: 8.974 | train_step_timing in s: 0.1163 | consumed_samples: 840 | val_loss: 9.014 Training epoch 0, iteration 105/499 | lr: 0.0008935 | global_batch_size: 8 | global_step: 105 | reduced_train_loss: 8.902 | train_step_timing in s: 0.1092 | consumed_samples: 848 | val_loss: 9.014 Training epoch 0, iteration 106/499 | lr: 0.0008914 | global_batch_size: 8 | global_step: 106 | reduced_train_loss: 8.868 | train_step_timing in s: 0.1178 | consumed_samples: 856 | val_loss: 9.014 Training epoch 0, iteration 107/499 | lr: 0.0008893 | global_batch_size: 8 | global_step: 107 | reduced_train_loss: 8.884 | train_step_timing in s: 0.1076 | consumed_samples: 864 | val_loss: 9.014 Training epoch 0, iteration 108/499 | lr: 0.0008872 | global_batch_size: 8 | global_step: 108 | reduced_train_loss: 8.877 | train_step_timing in s: 0.1125 | consumed_samples: 872 | val_loss: 9.014 Training epoch 0, iteration 109/499 | lr: 0.0008851 | global_batch_size: 8 | global_step: 109 | reduced_train_loss: 8.928 | train_step_timing in s: 0.1094 | consumed_samples: 880 | val_loss: 9.014 Training epoch 0, iteration 110/499 | lr: 0.000883 | global_batch_size: 8 | global_step: 110 | reduced_train_loss: 8.905 | train_step_timing in s: 0.1188 | consumed_samples: 888 | val_loss: 9.014 Training epoch 0, iteration 111/499 | lr: 0.0008809 | global_batch_size: 8 | global_step: 111 | reduced_train_loss: 8.794 | train_step_timing in s: 0.1083 | consumed_samples: 896 | val_loss: 9.014 Training epoch 0, iteration 112/499 | lr: 0.0008787 | global_batch_size: 8 | global_step: 112 | reduced_train_loss: 8.738 | train_step_timing in s: 0.1204 | consumed_samples: 904 | val_loss: 9.014 Training epoch 0, iteration 113/499 | lr: 0.0008765 | global_batch_size: 8 | global_step: 113 | reduced_train_loss: 8.853 | train_step_timing in s: 0.1192 | consumed_samples: 912 | val_loss: 9.014 Training epoch 0, iteration 114/499 | lr: 0.0008743 | global_batch_size: 8 | global_step: 114 | reduced_train_loss: 8.855 | train_step_timing in s: 0.1079 | consumed_samples: 920 | val_loss: 9.014 Training epoch 0, iteration 115/499 | lr: 0.0008721 | global_batch_size: 8 | global_step: 115 | reduced_train_loss: 8.936 | train_step_timing in s: 0.1093 | consumed_samples: 928 | val_loss: 9.014 Training epoch 0, iteration 116/499 | lr: 0.0008699 | global_batch_size: 8 | global_step: 116 | reduced_train_loss: 8.57 | train_step_timing in s: 0.1044 | consumed_samples: 936 | val_loss: 9.014 Training epoch 0, iteration 117/499 | lr: 0.0008676 | global_batch_size: 8 | global_step: 117 | reduced_train_loss: 8.691 | train_step_timing in s: 0.1135 | consumed_samples: 944 | val_loss: 9.014 Training epoch 0, iteration 118/499 | lr: 0.0008654 | global_batch_size: 8 | global_step: 118 | reduced_train_loss: 8.846 | train_step_timing in s: 0.1152 | consumed_samples: 952 | val_loss: 9.014 Training epoch 0, iteration 119/499 | lr: 0.0008631 | global_batch_size: 8 | global_step: 119 | reduced_train_loss: 8.91 | train_step_timing in s: 0.1217 | consumed_samples: 960 | val_loss: 9.014 Training epoch 0, iteration 120/499 | lr: 0.0008608 | global_batch_size: 8 | global_step: 120 | reduced_train_loss: 8.919 | train_step_timing in s: 0.1117 | consumed_samples: 968 | val_loss: 9.014 Training epoch 0, iteration 121/499 | lr: 0.0008585 | global_batch_size: 8 | global_step: 121 | reduced_train_loss: 8.734 | train_step_timing in s: 0.1016 | consumed_samples: 976 | val_loss: 9.014 Training epoch 0, iteration 122/499 | lr: 0.0008562 | global_batch_size: 8 | global_step: 122 | reduced_train_loss: 9.046 | train_step_timing in s: 0.1132 | consumed_samples: 984 | val_loss: 9.014 Training epoch 0, iteration 123/499 | lr: 0.0008538 | global_batch_size: 8 | global_step: 123 | reduced_train_loss: 8.779 | train_step_timing in s: 0.1174 | consumed_samples: 992 | val_loss: 9.014 Training epoch 0, iteration 124/499 | lr: 0.0008515 | global_batch_size: 8 | global_step: 124 | reduced_train_loss: 8.74 | train_step_timing in s: 0.113 | consumed_samples: 1000 | val_loss: 9.014 Training epoch 0, iteration 125/499 | lr: 0.0008491 | global_batch_size: 8 | global_step: 125 | reduced_train_loss: 8.81 | train_step_timing in s: 0.1105 | consumed_samples: 1008 | val_loss: 9.014 Training epoch 0, iteration 126/499 | lr: 0.0008467 | global_batch_size: 8 | global_step: 126 | reduced_train_loss: 8.657 | train_step_timing in s: 0.1065 | consumed_samples: 1016 | val_loss: 9.014 Training epoch 0, iteration 127/499 | lr: 0.0008443 | global_batch_size: 8 | global_step: 127 | reduced_train_loss: 8.752 | train_step_timing in s: 0.1169 | consumed_samples: 1024 | val_loss: 9.014 Training epoch 0, iteration 128/499 | lr: 0.0008419 | global_batch_size: 8 | global_step: 128 | reduced_train_loss: 8.823 | train_step_timing in s: 0.1055 | consumed_samples: 1032 | val_loss: 9.014 Training epoch 0, iteration 129/499 | lr: 0.0008395 | global_batch_size: 8 | global_step: 129 | reduced_train_loss: 8.825 | train_step_timing in s: 0.1052 | consumed_samples: 1040 | val_loss: 9.014 Training epoch 0, iteration 130/499 | lr: 0.000837 | global_batch_size: 8 | global_step: 130 | reduced_train_loss: 8.861 | train_step_timing in s: 0.1052 | consumed_samples: 1048 | val_loss: 9.014 Training epoch 0, iteration 131/499 | lr: 0.0008346 | global_batch_size: 8 | global_step: 131 | reduced_train_loss: 8.804 | train_step_timing in s: 0.1114 | consumed_samples: 1056 | val_loss: 9.014 Training epoch 0, iteration 132/499 | lr: 0.0008321 | global_batch_size: 8 | global_step: 132 | reduced_train_loss: 8.886 | train_step_timing in s: 0.1116 | consumed_samples: 1064 | val_loss: 9.014 Training epoch 0, iteration 133/499 | lr: 0.0008296 | global_batch_size: 8 | global_step: 133 | reduced_train_loss: 8.838 | train_step_timing in s: 0.1024 | consumed_samples: 1072 | val_loss: 9.014 Training epoch 0, iteration 134/499 | lr: 0.0008271 | global_batch_size: 8 | global_step: 134 | reduced_train_loss: 8.938 | train_step_timing in s: 0.1163 | consumed_samples: 1080 | val_loss: 9.014 Training epoch 0, iteration 135/499 | lr: 0.0008246 | global_batch_size: 8 | global_step: 135 | reduced_train_loss: 8.755 | train_step_timing in s: 0.1047 | consumed_samples: 1088 | val_loss: 9.014 Training epoch 0, iteration 136/499 | lr: 0.0008221 | global_batch_size: 8 | global_step: 136 | reduced_train_loss: 8.822 | train_step_timing in s: 0.1152 | consumed_samples: 1096 | val_loss: 9.014 Training epoch 0, iteration 137/499 | lr: 0.0008195 | global_batch_size: 8 | global_step: 137 | reduced_train_loss: 8.814 | train_step_timing in s: 0.1082 | consumed_samples: 1104 | val_loss: 9.014 Training epoch 0, iteration 138/499 | lr: 0.0008169 | global_batch_size: 8 | global_step: 138 | reduced_train_loss: 8.793 | train_step_timing in s: 0.1139 | consumed_samples: 1112 | val_loss: 9.014 Training epoch 0, iteration 139/499 | lr: 0.0008144 | global_batch_size: 8 | global_step: 139 | reduced_train_loss: 8.774 | train_step_timing in s: 0.1087 | consumed_samples: 1120 | val_loss: 9.014 Training epoch 0, iteration 140/499 | lr: 0.0008118 | global_batch_size: 8 | global_step: 140 | reduced_train_loss: 8.789 | train_step_timing in s: 0.1153 | consumed_samples: 1128 | val_loss: 9.014 Training epoch 0, iteration 141/499 | lr: 0.0008092 | global_batch_size: 8 | global_step: 141 | reduced_train_loss: 8.795 | train_step_timing in s: 0.1149 | consumed_samples: 1136 | val_loss: 9.014 Training epoch 0, iteration 142/499 | lr: 0.0008066 | global_batch_size: 8 | global_step: 142 | reduced_train_loss: 8.773 | train_step_timing in s: 0.1091 | consumed_samples: 1144 | val_loss: 9.014 Training epoch 0, iteration 143/499 | lr: 0.0008039 | global_batch_size: 8 | global_step: 143 | reduced_train_loss: 8.77 | train_step_timing in s: 0.1156 | consumed_samples: 1152 | val_loss: 9.014 Training epoch 0, iteration 144/499 | lr: 0.0008013 | global_batch_size: 8 | global_step: 144 | reduced_train_loss: 8.799 | train_step_timing in s: 0.1052 | consumed_samples: 1160 | val_loss: 9.014 Training epoch 0, iteration 145/499 | lr: 0.0007986 | global_batch_size: 8 | global_step: 145 | reduced_train_loss: 8.908 | train_step_timing in s: 0.1105 | consumed_samples: 1168 | val_loss: 9.014 Training epoch 0, iteration 146/499 | lr: 0.000796 | global_batch_size: 8 | global_step: 146 | reduced_train_loss: 8.871 | train_step_timing in s: 0.1098 | consumed_samples: 1176 | val_loss: 9.014 Training epoch 0, iteration 147/499 | lr: 0.0007933 | global_batch_size: 8 | global_step: 147 | reduced_train_loss: 8.832 | train_step_timing in s: 0.1063 | consumed_samples: 1184 | val_loss: 9.014 Training epoch 0, iteration 148/499 | lr: 0.0007906 | global_batch_size: 8 | global_step: 148 | reduced_train_loss: 8.841 | train_step_timing in s: 0.1037 | consumed_samples: 1192 | val_loss: 9.014 Training epoch 0, iteration 149/499 | lr: 0.0007879 | global_batch_size: 8 | global_step: 149 | reduced_train_loss: 8.674 | train_step_timing in s: 0.1155 | consumed_samples: 1200 | val_loss: 9.014 Training epoch 0, iteration 150/499 | lr: 0.0007851 | global_batch_size: 8 | global_step: 150 | reduced_train_loss: 8.747 | train_step_timing in s: 0.1151 | consumed_samples: 1208 | val_loss: 9.014 Training epoch 0, iteration 151/499 | lr: 0.0007824 | global_batch_size: 8 | global_step: 151 | reduced_train_loss: 8.61 | train_step_timing in s: 0.1034 | consumed_samples: 1216 | val_loss: 9.014 Training epoch 0, iteration 152/499 | lr: 0.0007797 | global_batch_size: 8 | global_step: 152 | reduced_train_loss: 8.881 | train_step_timing in s: 0.1169 | consumed_samples: 1224 | val_loss: 9.014 Training epoch 0, iteration 153/499 | lr: 0.0007769 | global_batch_size: 8 | global_step: 153 | reduced_train_loss: 8.781 | train_step_timing in s: 0.1065 | consumed_samples: 1232 | val_loss: 9.014 Training epoch 0, iteration 154/499 | lr: 0.0007741 | global_batch_size: 8 | global_step: 154 | reduced_train_loss: 8.805 | train_step_timing in s: 0.1083 | consumed_samples: 1240 | val_loss: 9.014 Training epoch 0, iteration 155/499 | lr: 0.0007714 | global_batch_size: 8 | global_step: 155 | reduced_train_loss: 8.833 | train_step_timing in s: 0.1037 | consumed_samples: 1248 | val_loss: 9.014 Training epoch 0, iteration 156/499 | lr: 0.0007686 | global_batch_size: 8 | global_step: 156 | reduced_train_loss: 8.879 | train_step_timing in s: 0.1168 | consumed_samples: 1256 | val_loss: 9.014 Training epoch 0, iteration 157/499 | lr: 0.0007657 | global_batch_size: 8 | global_step: 157 | reduced_train_loss: 8.866 | train_step_timing in s: 0.1037 | consumed_samples: 1264 | val_loss: 9.014 Training epoch 0, iteration 158/499 | lr: 0.0007629 | global_batch_size: 8 | global_step: 158 | reduced_train_loss: 8.746 | train_step_timing in s: 0.1156 | consumed_samples: 1272 | val_loss: 9.014 Training epoch 0, iteration 159/499 | lr: 0.0007601 | global_batch_size: 8 | global_step: 159 | reduced_train_loss: 8.627 | train_step_timing in s: 0.1059 | consumed_samples: 1280 | val_loss: 9.014 Training epoch 0, iteration 160/499 | lr: 0.0007573 | global_batch_size: 8 | global_step: 160 | reduced_train_loss: 8.833 | train_step_timing in s: 0.11 | consumed_samples: 1288 | val_loss: 9.014 Training epoch 0, iteration 161/499 | lr: 0.0007544 | global_batch_size: 8 | global_step: 161 | reduced_train_loss: 8.821 | train_step_timing in s: 0.1137 | consumed_samples: 1296 | val_loss: 9.014 Training epoch 0, iteration 162/499 | lr: 0.0007515 | global_batch_size: 8 | global_step: 162 | reduced_train_loss: 8.724 | train_step_timing in s: 0.1154 | consumed_samples: 1304 | val_loss: 9.014 Training epoch 0, iteration 163/499 | lr: 0.0007487 | global_batch_size: 8 | global_step: 163 | reduced_train_loss: 8.732 | train_step_timing in s: 0.1177 | consumed_samples: 1312 | val_loss: 9.014 Training epoch 0, iteration 164/499 | lr: 0.0007458 | global_batch_size: 8 | global_step: 164 | reduced_train_loss: 8.867 | train_step_timing in s: 0.1158 | consumed_samples: 1320 | val_loss: 9.014 Training epoch 0, iteration 165/499 | lr: 0.0007429 | global_batch_size: 8 | global_step: 165 | reduced_train_loss: 8.788 | train_step_timing in s: 0.1126 | consumed_samples: 1328 | val_loss: 9.014 Training epoch 0, iteration 166/499 | lr: 0.00074 | global_batch_size: 8 | global_step: 166 | reduced_train_loss: 8.794 | train_step_timing in s: 0.1094 | consumed_samples: 1336 | val_loss: 9.014 Training epoch 0, iteration 167/499 | lr: 0.0007371 | global_batch_size: 8 | global_step: 167 | reduced_train_loss: 8.664 | train_step_timing in s: 0.1146 | consumed_samples: 1344 | val_loss: 9.014 Training epoch 0, iteration 168/499 | lr: 0.0007341 | global_batch_size: 8 | global_step: 168 | reduced_train_loss: 8.951 | train_step_timing in s: 0.1214 | consumed_samples: 1352 | val_loss: 9.014 Training epoch 0, iteration 169/499 | lr: 0.0007312 | global_batch_size: 8 | global_step: 169 | reduced_train_loss: 8.739 | train_step_timing in s: 0.1166 | consumed_samples: 1360 | val_loss: 9.014 Training epoch 0, iteration 170/499 | lr: 0.0007283 | global_batch_size: 8 | global_step: 170 | reduced_train_loss: 8.925 | train_step_timing in s: 0.1161 | consumed_samples: 1368 | val_loss: 9.014 Training epoch 0, iteration 171/499 | lr: 0.0007253 | global_batch_size: 8 | global_step: 171 | reduced_train_loss: 8.909 | train_step_timing in s: 0.1156 | consumed_samples: 1376 | val_loss: 9.014 Training epoch 0, iteration 172/499 | lr: 0.0007223 | global_batch_size: 8 | global_step: 172 | reduced_train_loss: 8.812 | train_step_timing in s: 0.1257 | consumed_samples: 1384 | val_loss: 9.014 Training epoch 0, iteration 173/499 | lr: 0.0007193 | global_batch_size: 8 | global_step: 173 | reduced_train_loss: 8.773 | train_step_timing in s: 0.115 | consumed_samples: 1392 | val_loss: 9.014 Training epoch 0, iteration 174/499 | lr: 0.0007164 | global_batch_size: 8 | global_step: 174 | reduced_train_loss: 8.935 | train_step_timing in s: 0.1116 | consumed_samples: 1400 | val_loss: 9.014 Training epoch 0, iteration 175/499 | lr: 0.0007134 | global_batch_size: 8 | global_step: 175 | reduced_train_loss: 8.857 | train_step_timing in s: 0.1122 | consumed_samples: 1408 | val_loss: 9.014 Training epoch 0, iteration 176/499 | lr: 0.0007104 | global_batch_size: 8 | global_step: 176 | reduced_train_loss: 8.509 | train_step_timing in s: 0.101 | consumed_samples: 1416 | val_loss: 9.014 Training epoch 0, iteration 177/499 | lr: 0.0007073 | global_batch_size: 8 | global_step: 177 | reduced_train_loss: 8.584 | train_step_timing in s: 0.1053 | consumed_samples: 1424 | val_loss: 9.014 Training epoch 0, iteration 178/499 | lr: 0.0007043 | global_batch_size: 8 | global_step: 178 | reduced_train_loss: 8.735 | train_step_timing in s: 0.1095 | consumed_samples: 1432 | val_loss: 9.014 Training epoch 0, iteration 179/499 | lr: 0.0007013 | global_batch_size: 8 | global_step: 179 | reduced_train_loss: 8.703 | train_step_timing in s: 0.1121 | consumed_samples: 1440 | val_loss: 9.014 Training epoch 0, iteration 180/499 | lr: 0.0006982 | global_batch_size: 8 | global_step: 180 | reduced_train_loss: 8.875 | train_step_timing in s: 0.1168 | consumed_samples: 1448 | val_loss: 9.014 Training epoch 0, iteration 181/499 | lr: 0.0006952 | global_batch_size: 8 | global_step: 181 | reduced_train_loss: 8.736 | train_step_timing in s: 0.1166 | consumed_samples: 1456 | val_loss: 9.014 Training epoch 0, iteration 182/499 | lr: 0.0006921 | global_batch_size: 8 | global_step: 182 | reduced_train_loss: 8.841 | train_step_timing in s: 0.1127 | consumed_samples: 1464 | val_loss: 9.014 Training epoch 0, iteration 183/499 | lr: 0.0006891 | global_batch_size: 8 | global_step: 183 | reduced_train_loss: 8.872 | train_step_timing in s: 0.1193 | consumed_samples: 1472 | val_loss: 9.014 Training epoch 0, iteration 184/499 | lr: 0.000686 | global_batch_size: 8 | global_step: 184 | reduced_train_loss: 8.472 | train_step_timing in s: 0.1118 | consumed_samples: 1480 | val_loss: 9.014 Training epoch 0, iteration 185/499 | lr: 0.0006829 | global_batch_size: 8 | global_step: 185 | reduced_train_loss: 8.666 | train_step_timing in s: 0.1097 | consumed_samples: 1488 | val_loss: 9.014 Training epoch 0, iteration 186/499 | lr: 0.0006798 | global_batch_size: 8 | global_step: 186 | reduced_train_loss: 8.78 | train_step_timing in s: 0.1165 | consumed_samples: 1496 | val_loss: 9.014 Training epoch 0, iteration 187/499 | lr: 0.0006767 | global_batch_size: 8 | global_step: 187 | reduced_train_loss: 8.865 | train_step_timing in s: 0.1172 | consumed_samples: 1504 | val_loss: 9.014 Training epoch 0, iteration 188/499 | lr: 0.0006736 | global_batch_size: 8 | global_step: 188 | reduced_train_loss: 8.545 | train_step_timing in s: 0.1121 | consumed_samples: 1512 | val_loss: 9.014 Training epoch 0, iteration 189/499 | lr: 0.0006705 | global_batch_size: 8 | global_step: 189 | reduced_train_loss: 8.549 | train_step_timing in s: 0.1087 | consumed_samples: 1520 | val_loss: 9.014 Training epoch 0, iteration 190/499 | lr: 0.0006674 | global_batch_size: 8 | global_step: 190 | reduced_train_loss: 8.605 | train_step_timing in s: 0.1069 | consumed_samples: 1528 | val_loss: 9.014 Training epoch 0, iteration 191/499 | lr: 0.0006642 | global_batch_size: 8 | global_step: 191 | reduced_train_loss: 8.784 | train_step_timing in s: 0.1092 | consumed_samples: 1536 | val_loss: 9.014 Training epoch 0, iteration 192/499 | lr: 0.0006611 | global_batch_size: 8 | global_step: 192 | reduced_train_loss: 8.796 | train_step_timing in s: 0.1117 | consumed_samples: 1544 | val_loss: 9.014 Training epoch 0, iteration 193/499 | lr: 0.000658 | global_batch_size: 8 | global_step: 193 | reduced_train_loss: 8.726 | train_step_timing in s: 0.1094 | consumed_samples: 1552 | val_loss: 9.014 Training epoch 0, iteration 194/499 | lr: 0.0006548 | global_batch_size: 8 | global_step: 194 | reduced_train_loss: 8.722 | train_step_timing in s: 0.1158 | consumed_samples: 1560 | val_loss: 9.014 Training epoch 0, iteration 195/499 | lr: 0.0006517 | global_batch_size: 8 | global_step: 195 | reduced_train_loss: 8.716 | train_step_timing in s: 0.1119 | consumed_samples: 1568 | val_loss: 9.014 Training epoch 0, iteration 196/499 | lr: 0.0006485 | global_batch_size: 8 | global_step: 196 | reduced_train_loss: 8.68 | train_step_timing in s: 0.1064 | consumed_samples: 1576 | val_loss: 9.014 Training epoch 0, iteration 197/499 | lr: 0.0006453 | global_batch_size: 8 | global_step: 197 | reduced_train_loss: 8.737 | train_step_timing in s: 0.1124 | consumed_samples: 1584 | val_loss: 9.014 Training epoch 0, iteration 198/499 | lr: 0.0006421 | global_batch_size: 8 | global_step: 198 | reduced_train_loss: 8.627 | train_step_timing in s: 0.107 | consumed_samples: 1592 | val_loss: 9.014 Training epoch 0, iteration 199/499 | lr: 0.000639 | global_batch_size: 8 | global_step: 199 | reduced_train_loss: 8.877 | train_step_timing in s: 0.1135 | consumed_samples: 1600 | val_loss: 9.014 Epoch 0, global step 199: 'reduced_train_loss' reached 8.87650 (best 8.83520), saving model to '/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0.ckpt' as top 2 [NeMo I 2025-05-21 01:46:39 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 199 : Start time: 1747791999.580s : Save duration: 0.064s [NeMo I 2025-05-21 01:46:42 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0.ckpt [NeMo I 2025-05-21 01:46:43 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 199 : Start time: 1747792002.995s : Save duration: 0.061s [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0-last.ckpt [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Successfully saved checkpoint from iteration 199 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0.ckpt [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Async checkpoint save for step 200 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0.ckpt) finalized successfully. [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Successfully saved checkpoint from iteration 199 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0-last.ckpt [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Async checkpoint save for step 200 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=9.01-step=199-consumed_samples=1600.0-last.ckpt) finalized successfully. [NeMo I 2025-05-21 01:46:46 nemo_logging:393] Async finalization time took 0.096 s Validation: iteration 1/2 Validation: iteration 2/2 Validation: iteration 3/2 Validation: iteration 4/2 Validation: iteration 5/2 Validation: iteration 6/2 Validation: iteration 7/2 Validation: iteration 8/2 Training epoch 0, iteration 200/499 | lr: 0.0006358 | global_batch_size: 8 | global_step: 200 | reduced_train_loss: 8.678 | train_step_timing in s: 0.114 | consumed_samples: 1608 | val_loss: 8.925 Training epoch 0, iteration 201/499 | lr: 0.0006326 | global_batch_size: 8 | global_step: 201 | reduced_train_loss: 8.769 | train_step_timing in s: 0.1169 | consumed_samples: 1616 | val_loss: 8.925 Training epoch 0, iteration 202/499 | lr: 0.0006294 | global_batch_size: 8 | global_step: 202 | reduced_train_loss: 8.507 | train_step_timing in s: 0.1118 | consumed_samples: 1624 | val_loss: 8.925 Training epoch 0, iteration 203/499 | lr: 0.0006262 | global_batch_size: 8 | global_step: 203 | reduced_train_loss: 8.641 | train_step_timing in s: 0.1179 | consumed_samples: 1632 | val_loss: 8.925 Training epoch 0, iteration 204/499 | lr: 0.000623 | global_batch_size: 8 | global_step: 204 | reduced_train_loss: 8.697 | train_step_timing in s: 0.1194 | consumed_samples: 1640 | val_loss: 8.925 Training epoch 0, iteration 205/499 | lr: 0.0006198 | global_batch_size: 8 | global_step: 205 | reduced_train_loss: 8.614 | train_step_timing in s: 0.1133 | consumed_samples: 1648 | val_loss: 8.925 Training epoch 0, iteration 206/499 | lr: 0.0006165 | global_batch_size: 8 | global_step: 206 | reduced_train_loss: 8.622 | train_step_timing in s: 0.1148 | consumed_samples: 1656 | val_loss: 8.925 Training epoch 0, iteration 207/499 | lr: 0.0006133 | global_batch_size: 8 | global_step: 207 | reduced_train_loss: 8.829 | train_step_timing in s: 0.1203 | consumed_samples: 1664 | val_loss: 8.925 Training epoch 0, iteration 208/499 | lr: 0.0006101 | global_batch_size: 8 | global_step: 208 | reduced_train_loss: 8.684 | train_step_timing in s: 0.1166 | consumed_samples: 1672 | val_loss: 8.925 Training epoch 0, iteration 209/499 | lr: 0.0006068 | global_batch_size: 8 | global_step: 209 | reduced_train_loss: 8.651 | train_step_timing in s: 0.1087 | consumed_samples: 1680 | val_loss: 8.925 Training epoch 0, iteration 210/499 | lr: 0.0006036 | global_batch_size: 8 | global_step: 210 | reduced_train_loss: 8.861 | train_step_timing in s: 0.1207 | consumed_samples: 1688 | val_loss: 8.925 Training epoch 0, iteration 211/499 | lr: 0.0006004 | global_batch_size: 8 | global_step: 211 | reduced_train_loss: 8.94 | train_step_timing in s: 0.1213 | consumed_samples: 1696 | val_loss: 8.925 Training epoch 0, iteration 212/499 | lr: 0.0005971 | global_batch_size: 8 | global_step: 212 | reduced_train_loss: 8.724 | train_step_timing in s: 0.1144 | consumed_samples: 1704 | val_loss: 8.925 Training epoch 0, iteration 213/499 | lr: 0.0005939 | global_batch_size: 8 | global_step: 213 | reduced_train_loss: 8.655 | train_step_timing in s: 0.1153 | consumed_samples: 1712 | val_loss: 8.925 Training epoch 0, iteration 214/499 | lr: 0.0005906 | global_batch_size: 8 | global_step: 214 | reduced_train_loss: 8.581 | train_step_timing in s: 0.1174 | consumed_samples: 1720 | val_loss: 8.925 Training epoch 0, iteration 215/499 | lr: 0.0005873 | global_batch_size: 8 | global_step: 215 | reduced_train_loss: 8.775 | train_step_timing in s: 0.1151 | consumed_samples: 1728 | val_loss: 8.925 Training epoch 0, iteration 216/499 | lr: 0.0005841 | global_batch_size: 8 | global_step: 216 | reduced_train_loss: 8.709 | train_step_timing in s: 0.1147 | consumed_samples: 1736 | val_loss: 8.925 Training epoch 0, iteration 217/499 | lr: 0.0005808 | global_batch_size: 8 | global_step: 217 | reduced_train_loss: 8.716 | train_step_timing in s: 0.1183 | consumed_samples: 1744 | val_loss: 8.925 Training epoch 0, iteration 218/499 | lr: 0.0005775 | global_batch_size: 8 | global_step: 218 | reduced_train_loss: 8.654 | train_step_timing in s: 0.1192 | consumed_samples: 1752 | val_loss: 8.925 Training epoch 0, iteration 219/499 | lr: 0.0005743 | global_batch_size: 8 | global_step: 219 | reduced_train_loss: 8.678 | train_step_timing in s: 0.1206 | consumed_samples: 1760 | val_loss: 8.925 Training epoch 0, iteration 220/499 | lr: 0.000571 | global_batch_size: 8 | global_step: 220 | reduced_train_loss: 8.733 | train_step_timing in s: 0.1174 | consumed_samples: 1768 | val_loss: 8.925 Training epoch 0, iteration 221/499 | lr: 0.0005677 | global_batch_size: 8 | global_step: 221 | reduced_train_loss: 8.871 | train_step_timing in s: 0.1172 | consumed_samples: 1776 | val_loss: 8.925 Training epoch 0, iteration 222/499 | lr: 0.0005644 | global_batch_size: 8 | global_step: 222 | reduced_train_loss: 8.709 | train_step_timing in s: 0.1152 | consumed_samples: 1784 | val_loss: 8.925 Training epoch 0, iteration 223/499 | lr: 0.0005611 | global_batch_size: 8 | global_step: 223 | reduced_train_loss: 8.729 | train_step_timing in s: 0.1156 | consumed_samples: 1792 | val_loss: 8.925 Training epoch 0, iteration 224/499 | lr: 0.0005578 | global_batch_size: 8 | global_step: 224 | reduced_train_loss: 8.701 | train_step_timing in s: 0.1197 | consumed_samples: 1800 | val_loss: 8.925 Training epoch 0, iteration 225/499 | lr: 0.0005545 | global_batch_size: 8 | global_step: 225 | reduced_train_loss: 8.654 | train_step_timing in s: 0.1162 | consumed_samples: 1808 | val_loss: 8.925 Training epoch 0, iteration 226/499 | lr: 0.0005513 | global_batch_size: 8 | global_step: 226 | reduced_train_loss: 8.715 | train_step_timing in s: 0.1165 | consumed_samples: 1816 | val_loss: 8.925 Training epoch 0, iteration 227/499 | lr: 0.000548 | global_batch_size: 8 | global_step: 227 | reduced_train_loss: 8.435 | train_step_timing in s: 0.1037 | consumed_samples: 1824 | val_loss: 8.925 Training epoch 0, iteration 228/499 | lr: 0.0005447 | global_batch_size: 8 | global_step: 228 | reduced_train_loss: 8.721 | train_step_timing in s: 0.11 | consumed_samples: 1832 | val_loss: 8.925 Training epoch 0, iteration 229/499 | lr: 0.0005414 | global_batch_size: 8 | global_step: 229 | reduced_train_loss: 8.808 | train_step_timing in s: 0.1106 | consumed_samples: 1840 | val_loss: 8.925 Training epoch 0, iteration 230/499 | lr: 0.0005381 | global_batch_size: 8 | global_step: 230 | reduced_train_loss: 8.517 | train_step_timing in s: 0.1157 | consumed_samples: 1848 | val_loss: 8.925 Training epoch 0, iteration 231/499 | lr: 0.0005348 | global_batch_size: 8 | global_step: 231 | reduced_train_loss: 8.784 | train_step_timing in s: 0.1179 | consumed_samples: 1856 | val_loss: 8.925 Training epoch 0, iteration 232/499 | lr: 0.0005315 | global_batch_size: 8 | global_step: 232 | reduced_train_loss: 8.761 | train_step_timing in s: 0.1223 | consumed_samples: 1864 | val_loss: 8.925 Training epoch 0, iteration 233/499 | lr: 0.0005282 | global_batch_size: 8 | global_step: 233 | reduced_train_loss: 8.653 | train_step_timing in s: 0.1163 | consumed_samples: 1872 | val_loss: 8.925 Training epoch 0, iteration 234/499 | lr: 0.0005248 | global_batch_size: 8 | global_step: 234 | reduced_train_loss: 8.598 | train_step_timing in s: 0.1116 | consumed_samples: 1880 | val_loss: 8.925 Training epoch 0, iteration 235/499 | lr: 0.0005215 | global_batch_size: 8 | global_step: 235 | reduced_train_loss: 8.525 | train_step_timing in s: 0.1155 | consumed_samples: 1888 | val_loss: 8.925 Training epoch 0, iteration 236/499 | lr: 0.0005182 | global_batch_size: 8 | global_step: 236 | reduced_train_loss: 8.68 | train_step_timing in s: 0.1134 | consumed_samples: 1896 | val_loss: 8.925 Training epoch 0, iteration 237/499 | lr: 0.0005149 | global_batch_size: 8 | global_step: 237 | reduced_train_loss: 8.713 | train_step_timing in s: 0.1103 | consumed_samples: 1904 | val_loss: 8.925 Training epoch 0, iteration 238/499 | lr: 0.0005116 | global_batch_size: 8 | global_step: 238 | reduced_train_loss: 8.534 | train_step_timing in s: 0.1163 | consumed_samples: 1912 | val_loss: 8.925 Training epoch 0, iteration 239/499 | lr: 0.0005083 | global_batch_size: 8 | global_step: 239 | reduced_train_loss: 8.563 | train_step_timing in s: 0.1134 | consumed_samples: 1920 | val_loss: 8.925 Training epoch 0, iteration 240/499 | lr: 0.000505 | global_batch_size: 8 | global_step: 240 | reduced_train_loss: 8.676 | train_step_timing in s: 0.1167 | consumed_samples: 1928 | val_loss: 8.925 Training epoch 0, iteration 241/499 | lr: 0.0005017 | global_batch_size: 8 | global_step: 241 | reduced_train_loss: 8.578 | train_step_timing in s: 0.1183 | consumed_samples: 1936 | val_loss: 8.925 Training epoch 0, iteration 242/499 | lr: 0.0004984 | global_batch_size: 8 | global_step: 242 | reduced_train_loss: 8.664 | train_step_timing in s: 0.1193 | consumed_samples: 1944 | val_loss: 8.925 Training epoch 0, iteration 243/499 | lr: 0.0004951 | global_batch_size: 8 | global_step: 243 | reduced_train_loss: 8.618 | train_step_timing in s: 0.1192 | consumed_samples: 1952 | val_loss: 8.925 Training epoch 0, iteration 244/499 | lr: 0.0004918 | global_batch_size: 8 | global_step: 244 | reduced_train_loss: 8.773 | train_step_timing in s: 0.1087 | consumed_samples: 1960 | val_loss: 8.925 Training epoch 0, iteration 245/499 | lr: 0.0004885 | global_batch_size: 8 | global_step: 245 | reduced_train_loss: 8.672 | train_step_timing in s: 0.1079 | consumed_samples: 1968 | val_loss: 8.925 Training epoch 0, iteration 246/499 | lr: 0.0004852 | global_batch_size: 8 | global_step: 246 | reduced_train_loss: 8.69 | train_step_timing in s: 0.1189 | consumed_samples: 1976 | val_loss: 8.925 Training epoch 0, iteration 247/499 | lr: 0.0004818 | global_batch_size: 8 | global_step: 247 | reduced_train_loss: 8.639 | train_step_timing in s: 0.1105 | consumed_samples: 1984 | val_loss: 8.925 Training epoch 0, iteration 248/499 | lr: 0.0004785 | global_batch_size: 8 | global_step: 248 | reduced_train_loss: 8.686 | train_step_timing in s: 0.1109 | consumed_samples: 1992 | val_loss: 8.925 Training epoch 0, iteration 249/499 | lr: 0.0004752 | global_batch_size: 8 | global_step: 249 | reduced_train_loss: 8.678 | train_step_timing in s: 0.1169 | consumed_samples: 2000 | val_loss: 8.925 Training epoch 0, iteration 250/499 | lr: 0.0004719 | global_batch_size: 8 | global_step: 250 | reduced_train_loss: 8.613 | train_step_timing in s: 0.1113 | consumed_samples: 2008 | val_loss: 8.925 Training epoch 0, iteration 251/499 | lr: 0.0004686 | global_batch_size: 8 | global_step: 251 | reduced_train_loss: 8.893 | train_step_timing in s: 0.1154 | consumed_samples: 2016 | val_loss: 8.925 Training epoch 0, iteration 252/499 | lr: 0.0004653 | global_batch_size: 8 | global_step: 252 | reduced_train_loss: 8.581 | train_step_timing in s: 0.1155 | consumed_samples: 2024 | val_loss: 8.925 Training epoch 0, iteration 253/499 | lr: 0.000462 | global_batch_size: 8 | global_step: 253 | reduced_train_loss: 8.584 | train_step_timing in s: 0.119 | consumed_samples: 2032 | val_loss: 8.925 Training epoch 0, iteration 254/499 | lr: 0.0004587 | global_batch_size: 8 | global_step: 254 | reduced_train_loss: 8.49 | train_step_timing in s: 0.1107 | consumed_samples: 2040 | val_loss: 8.925 Training epoch 0, iteration 255/499 | lr: 0.0004555 | global_batch_size: 8 | global_step: 255 | reduced_train_loss: 8.737 | train_step_timing in s: 0.1188 | consumed_samples: 2048 | val_loss: 8.925 Training epoch 0, iteration 256/499 | lr: 0.0004522 | global_batch_size: 8 | global_step: 256 | reduced_train_loss: 8.565 | train_step_timing in s: 0.1146 | consumed_samples: 2056 | val_loss: 8.925 Training epoch 0, iteration 257/499 | lr: 0.0004489 | global_batch_size: 8 | global_step: 257 | reduced_train_loss: 8.608 | train_step_timing in s: 0.1146 | consumed_samples: 2064 | val_loss: 8.925 Training epoch 0, iteration 258/499 | lr: 0.0004456 | global_batch_size: 8 | global_step: 258 | reduced_train_loss: 8.68 | train_step_timing in s: 0.1124 | consumed_samples: 2072 | val_loss: 8.925 Training epoch 0, iteration 259/499 | lr: 0.0004423 | global_batch_size: 8 | global_step: 259 | reduced_train_loss: 8.688 | train_step_timing in s: 0.1152 | consumed_samples: 2080 | val_loss: 8.925 Training epoch 0, iteration 260/499 | lr: 0.000439 | global_batch_size: 8 | global_step: 260 | reduced_train_loss: 8.662 | train_step_timing in s: 0.1189 | consumed_samples: 2088 | val_loss: 8.925 Training epoch 0, iteration 261/499 | lr: 0.0004357 | global_batch_size: 8 | global_step: 261 | reduced_train_loss: 8.84 | train_step_timing in s: 0.1195 | consumed_samples: 2096 | val_loss: 8.925 Training epoch 0, iteration 262/499 | lr: 0.0004325 | global_batch_size: 8 | global_step: 262 | reduced_train_loss: 8.479 | train_step_timing in s: 0.1086 | consumed_samples: 2104 | val_loss: 8.925 Training epoch 0, iteration 263/499 | lr: 0.0004292 | global_batch_size: 8 | global_step: 263 | reduced_train_loss: 8.783 | train_step_timing in s: 0.1175 | consumed_samples: 2112 | val_loss: 8.925 Training epoch 0, iteration 264/499 | lr: 0.0004259 | global_batch_size: 8 | global_step: 264 | reduced_train_loss: 8.737 | train_step_timing in s: 0.1075 | consumed_samples: 2120 | val_loss: 8.925 Training epoch 0, iteration 265/499 | lr: 0.0004227 | global_batch_size: 8 | global_step: 265 | reduced_train_loss: 8.687 | train_step_timing in s: 0.1152 | consumed_samples: 2128 | val_loss: 8.925 Training epoch 0, iteration 266/499 | lr: 0.0004194 | global_batch_size: 8 | global_step: 266 | reduced_train_loss: 8.595 | train_step_timing in s: 0.1023 | consumed_samples: 2136 | val_loss: 8.925 Training epoch 0, iteration 267/499 | lr: 0.0004161 | global_batch_size: 8 | global_step: 267 | reduced_train_loss: 8.835 | train_step_timing in s: 0.1116 | consumed_samples: 2144 | val_loss: 8.925 Training epoch 0, iteration 268/499 | lr: 0.0004129 | global_batch_size: 8 | global_step: 268 | reduced_train_loss: 8.863 | train_step_timing in s: 0.1194 | consumed_samples: 2152 | val_loss: 8.925 Training epoch 0, iteration 269/499 | lr: 0.0004096 | global_batch_size: 8 | global_step: 269 | reduced_train_loss: 8.71 | train_step_timing in s: 0.1207 | consumed_samples: 2160 | val_loss: 8.925 Training epoch 0, iteration 270/499 | lr: 0.0004064 | global_batch_size: 8 | global_step: 270 | reduced_train_loss: 8.617 | train_step_timing in s: 0.1218 | consumed_samples: 2168 | val_loss: 8.925 Training epoch 0, iteration 271/499 | lr: 0.0004032 | global_batch_size: 8 | global_step: 271 | reduced_train_loss: 8.662 | train_step_timing in s: 0.1185 | consumed_samples: 2176 | val_loss: 8.925 Training epoch 0, iteration 272/499 | lr: 0.0003999 | global_batch_size: 8 | global_step: 272 | reduced_train_loss: 8.729 | train_step_timing in s: 0.1173 | consumed_samples: 2184 | val_loss: 8.925 Training epoch 0, iteration 273/499 | lr: 0.0003967 | global_batch_size: 8 | global_step: 273 | reduced_train_loss: 8.512 | train_step_timing in s: 0.1194 | consumed_samples: 2192 | val_loss: 8.925 Training epoch 0, iteration 274/499 | lr: 0.0003935 | global_batch_size: 8 | global_step: 274 | reduced_train_loss: 8.603 | train_step_timing in s: 0.1064 | consumed_samples: 2200 | val_loss: 8.925 Training epoch 0, iteration 275/499 | lr: 0.0003902 | global_batch_size: 8 | global_step: 275 | reduced_train_loss: 8.645 | train_step_timing in s: 0.1099 | consumed_samples: 2208 | val_loss: 8.925 Training epoch 0, iteration 276/499 | lr: 0.000387 | global_batch_size: 8 | global_step: 276 | reduced_train_loss: 8.726 | train_step_timing in s: 0.1133 | consumed_samples: 2216 | val_loss: 8.925 Training epoch 0, iteration 277/499 | lr: 0.0003838 | global_batch_size: 8 | global_step: 277 | reduced_train_loss: 8.75 | train_step_timing in s: 0.1185 | consumed_samples: 2224 | val_loss: 8.925 Training epoch 0, iteration 278/499 | lr: 0.0003806 | global_batch_size: 8 | global_step: 278 | reduced_train_loss: 8.828 | train_step_timing in s: 0.1147 | consumed_samples: 2232 | val_loss: 8.925 Training epoch 0, iteration 279/499 | lr: 0.0003774 | global_batch_size: 8 | global_step: 279 | reduced_train_loss: 8.372 | train_step_timing in s: 0.1134 | consumed_samples: 2240 | val_loss: 8.925 Training epoch 0, iteration 280/499 | lr: 0.0003742 | global_batch_size: 8 | global_step: 280 | reduced_train_loss: 8.684 | train_step_timing in s: 0.1178 | consumed_samples: 2248 | val_loss: 8.925 Training epoch 0, iteration 281/499 | lr: 0.000371 | global_batch_size: 8 | global_step: 281 | reduced_train_loss: 8.629 | train_step_timing in s: 0.1249 | consumed_samples: 2256 | val_loss: 8.925 Training epoch 0, iteration 282/499 | lr: 0.0003679 | global_batch_size: 8 | global_step: 282 | reduced_train_loss: 8.446 | train_step_timing in s: 0.1169 | consumed_samples: 2264 | val_loss: 8.925 Training epoch 0, iteration 283/499 | lr: 0.0003647 | global_batch_size: 8 | global_step: 283 | reduced_train_loss: 8.571 | train_step_timing in s: 0.116 | consumed_samples: 2272 | val_loss: 8.925 Training epoch 0, iteration 284/499 | lr: 0.0003615 | global_batch_size: 8 | global_step: 284 | reduced_train_loss: 8.629 | train_step_timing in s: 0.1186 | consumed_samples: 2280 | val_loss: 8.925 Training epoch 0, iteration 285/499 | lr: 0.0003583 | global_batch_size: 8 | global_step: 285 | reduced_train_loss: 8.448 | train_step_timing in s: 0.1189 | consumed_samples: 2288 | val_loss: 8.925 Training epoch 0, iteration 286/499 | lr: 0.0003552 | global_batch_size: 8 | global_step: 286 | reduced_train_loss: 8.482 | train_step_timing in s: 0.1189 | consumed_samples: 2296 | val_loss: 8.925 Training epoch 0, iteration 287/499 | lr: 0.000352 | global_batch_size: 8 | global_step: 287 | reduced_train_loss: 8.758 | train_step_timing in s: 0.1163 | consumed_samples: 2304 | val_loss: 8.925 Training epoch 0, iteration 288/499 | lr: 0.0003489 | global_batch_size: 8 | global_step: 288 | reduced_train_loss: 8.529 | train_step_timing in s: 0.1175 | consumed_samples: 2312 | val_loss: 8.925 Training epoch 0, iteration 289/499 | lr: 0.0003458 | global_batch_size: 8 | global_step: 289 | reduced_train_loss: 8.458 | train_step_timing in s: 0.1141 | consumed_samples: 2320 | val_loss: 8.925 Training epoch 0, iteration 290/499 | lr: 0.0003426 | global_batch_size: 8 | global_step: 290 | reduced_train_loss: 8.573 | train_step_timing in s: 0.1191 | consumed_samples: 2328 | val_loss: 8.925 Training epoch 0, iteration 291/499 | lr: 0.0003395 | global_batch_size: 8 | global_step: 291 | reduced_train_loss: 8.582 | train_step_timing in s: 0.1145 | consumed_samples: 2336 | val_loss: 8.925 Training epoch 0, iteration 292/499 | lr: 0.0003364 | global_batch_size: 8 | global_step: 292 | reduced_train_loss: 8.395 | train_step_timing in s: 0.1146 | consumed_samples: 2344 | val_loss: 8.925 Training epoch 0, iteration 293/499 | lr: 0.0003333 | global_batch_size: 8 | global_step: 293 | reduced_train_loss: 8.556 | train_step_timing in s: 0.1184 | consumed_samples: 2352 | val_loss: 8.925 Training epoch 0, iteration 294/499 | lr: 0.0003302 | global_batch_size: 8 | global_step: 294 | reduced_train_loss: 8.441 | train_step_timing in s: 0.1181 | consumed_samples: 2360 | val_loss: 8.925 Training epoch 0, iteration 295/499 | lr: 0.0003271 | global_batch_size: 8 | global_step: 295 | reduced_train_loss: 8.631 | train_step_timing in s: 0.1173 | consumed_samples: 2368 | val_loss: 8.925 Training epoch 0, iteration 296/499 | lr: 0.000324 | global_batch_size: 8 | global_step: 296 | reduced_train_loss: 8.502 | train_step_timing in s: 0.1184 | consumed_samples: 2376 | val_loss: 8.925 Training epoch 0, iteration 297/499 | lr: 0.0003209 | global_batch_size: 8 | global_step: 297 | reduced_train_loss: 8.543 | train_step_timing in s: 0.1156 | consumed_samples: 2384 | val_loss: 8.925 Training epoch 0, iteration 298/499 | lr: 0.0003179 | global_batch_size: 8 | global_step: 298 | reduced_train_loss: 8.474 | train_step_timing in s: 0.1171 | consumed_samples: 2392 | val_loss: 8.925 Training epoch 0, iteration 299/499 | lr: 0.0003148 | global_batch_size: 8 | global_step: 299 | reduced_train_loss: 8.449 | train_step_timing in s: 0.1172 | consumed_samples: 2400 | val_loss: 8.925 Epoch 0, global step 299: 'reduced_train_loss' reached 8.44928 (best 8.44928), saving model to '/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0.ckpt' as top 2 [NeMo I 2025-05-21 01:46:59 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 299 : Start time: 1747792019.231s : Save duration: 0.063s [NeMo I 2025-05-21 01:47:02 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0.ckpt [NeMo I 2025-05-21 01:47:02 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 299 : Start time: 1747792022.719s : Save duration: 0.079s [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0-last.ckpt [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Successfully saved checkpoint from iteration 299 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0.ckpt [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Async checkpoint save for step 300 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Successfully saved checkpoint from iteration 299 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0-last.ckpt [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Async checkpoint save for step 300 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.93-step=299-consumed_samples=2400.0-last.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:06 nemo_logging:393] Async finalization time took 0.107 s Validation: iteration 1/2 Validation: iteration 2/2 Validation: iteration 3/2 Validation: iteration 4/2 Validation: iteration 5/2 Validation: iteration 6/2 Validation: iteration 7/2 Validation: iteration 8/2 Training epoch 0, iteration 300/499 | lr: 0.0003118 | global_batch_size: 8 | global_step: 300 | reduced_train_loss: 8.7 | train_step_timing in s: 0.1214 | consumed_samples: 2408 | val_loss: 8.757 Training epoch 0, iteration 301/499 | lr: 0.0003087 | global_batch_size: 8 | global_step: 301 | reduced_train_loss: 8.425 | train_step_timing in s: 0.1128 | consumed_samples: 2416 | val_loss: 8.757 Training epoch 0, iteration 302/499 | lr: 0.0003057 | global_batch_size: 8 | global_step: 302 | reduced_train_loss: 8.473 | train_step_timing in s: 0.1103 | consumed_samples: 2424 | val_loss: 8.757 Training epoch 0, iteration 303/499 | lr: 0.0003027 | global_batch_size: 8 | global_step: 303 | reduced_train_loss: 8.522 | train_step_timing in s: 0.1165 | consumed_samples: 2432 | val_loss: 8.757 Training epoch 0, iteration 304/499 | lr: 0.0002996 | global_batch_size: 8 | global_step: 304 | reduced_train_loss: 8.547 | train_step_timing in s: 0.1211 | consumed_samples: 2440 | val_loss: 8.757 Training epoch 0, iteration 305/499 | lr: 0.0002966 | global_batch_size: 8 | global_step: 305 | reduced_train_loss: 8.475 | train_step_timing in s: 0.118 | consumed_samples: 2448 | val_loss: 8.757 Training epoch 0, iteration 306/499 | lr: 0.0002936 | global_batch_size: 8 | global_step: 306 | reduced_train_loss: 8.435 | train_step_timing in s: 0.1121 | consumed_samples: 2456 | val_loss: 8.757 Training epoch 0, iteration 307/499 | lr: 0.0002907 | global_batch_size: 8 | global_step: 307 | reduced_train_loss: 8.573 | train_step_timing in s: 0.1233 | consumed_samples: 2464 | val_loss: 8.757 Training epoch 0, iteration 308/499 | lr: 0.0002877 | global_batch_size: 8 | global_step: 308 | reduced_train_loss: 8.439 | train_step_timing in s: 0.114 | consumed_samples: 2472 | val_loss: 8.757 Training epoch 0, iteration 309/499 | lr: 0.0002847 | global_batch_size: 8 | global_step: 309 | reduced_train_loss: 8.217 | train_step_timing in s: 0.1129 | consumed_samples: 2480 | val_loss: 8.757 Training epoch 0, iteration 310/499 | lr: 0.0002817 | global_batch_size: 8 | global_step: 310 | reduced_train_loss: 8.522 | train_step_timing in s: 0.1059 | consumed_samples: 2488 | val_loss: 8.757 Training epoch 0, iteration 311/499 | lr: 0.0002788 | global_batch_size: 8 | global_step: 311 | reduced_train_loss: 8.585 | train_step_timing in s: 0.1179 | consumed_samples: 2496 | val_loss: 8.757 Training epoch 0, iteration 312/499 | lr: 0.0002759 | global_batch_size: 8 | global_step: 312 | reduced_train_loss: 8.467 | train_step_timing in s: 0.1036 | consumed_samples: 2504 | val_loss: 8.757 Training epoch 0, iteration 313/499 | lr: 0.0002729 | global_batch_size: 8 | global_step: 313 | reduced_train_loss: 8.599 | train_step_timing in s: 0.1145 | consumed_samples: 2512 | val_loss: 8.757 Training epoch 0, iteration 314/499 | lr: 0.00027 | global_batch_size: 8 | global_step: 314 | reduced_train_loss: 8.61 | train_step_timing in s: 0.119 | consumed_samples: 2520 | val_loss: 8.757 Training epoch 0, iteration 315/499 | lr: 0.0002671 | global_batch_size: 8 | global_step: 315 | reduced_train_loss: 8.446 | train_step_timing in s: 0.1135 | consumed_samples: 2528 | val_loss: 8.757 Training epoch 0, iteration 316/499 | lr: 0.0002642 | global_batch_size: 8 | global_step: 316 | reduced_train_loss: 8.569 | train_step_timing in s: 0.1155 | consumed_samples: 2536 | val_loss: 8.757 Training epoch 0, iteration 317/499 | lr: 0.0002613 | global_batch_size: 8 | global_step: 317 | reduced_train_loss: 8.626 | train_step_timing in s: 0.1187 | consumed_samples: 2544 | val_loss: 8.757 Training epoch 0, iteration 318/499 | lr: 0.0002585 | global_batch_size: 8 | global_step: 318 | reduced_train_loss: 8.373 | train_step_timing in s: 0.1106 | consumed_samples: 2552 | val_loss: 8.757 Training epoch 0, iteration 319/499 | lr: 0.0002556 | global_batch_size: 8 | global_step: 319 | reduced_train_loss: 8.311 | train_step_timing in s: 0.1171 | consumed_samples: 2560 | val_loss: 8.757 Training epoch 0, iteration 320/499 | lr: 0.0002527 | global_batch_size: 8 | global_step: 320 | reduced_train_loss: 8.824 | train_step_timing in s: 0.1214 | consumed_samples: 2568 | val_loss: 8.757 Training epoch 0, iteration 321/499 | lr: 0.0002499 | global_batch_size: 8 | global_step: 321 | reduced_train_loss: 8.556 | train_step_timing in s: 0.1174 | consumed_samples: 2576 | val_loss: 8.757 Training epoch 0, iteration 322/499 | lr: 0.0002471 | global_batch_size: 8 | global_step: 322 | reduced_train_loss: 8.662 | train_step_timing in s: 0.1177 | consumed_samples: 2584 | val_loss: 8.757 Training epoch 0, iteration 323/499 | lr: 0.0002443 | global_batch_size: 8 | global_step: 323 | reduced_train_loss: 8.473 | train_step_timing in s: 0.1153 | consumed_samples: 2592 | val_loss: 8.757 Training epoch 0, iteration 324/499 | lr: 0.0002414 | global_batch_size: 8 | global_step: 324 | reduced_train_loss: 8.657 | train_step_timing in s: 0.1228 | consumed_samples: 2600 | val_loss: 8.757 Training epoch 0, iteration 325/499 | lr: 0.0002386 | global_batch_size: 8 | global_step: 325 | reduced_train_loss: 8.487 | train_step_timing in s: 0.1188 | consumed_samples: 2608 | val_loss: 8.757 Training epoch 0, iteration 326/499 | lr: 0.0002359 | global_batch_size: 8 | global_step: 326 | reduced_train_loss: 8.536 | train_step_timing in s: 0.1191 | consumed_samples: 2616 | val_loss: 8.757 Training epoch 0, iteration 327/499 | lr: 0.0002331 | global_batch_size: 8 | global_step: 327 | reduced_train_loss: 8.559 | train_step_timing in s: 0.1196 | consumed_samples: 2624 | val_loss: 8.757 Training epoch 0, iteration 328/499 | lr: 0.0002303 | global_batch_size: 8 | global_step: 328 | reduced_train_loss: 8.542 | train_step_timing in s: 0.1177 | consumed_samples: 2632 | val_loss: 8.757 Training epoch 0, iteration 329/499 | lr: 0.0002276 | global_batch_size: 8 | global_step: 329 | reduced_train_loss: 8.485 | train_step_timing in s: 0.1164 | consumed_samples: 2640 | val_loss: 8.757 Training epoch 0, iteration 330/499 | lr: 0.0002249 | global_batch_size: 8 | global_step: 330 | reduced_train_loss: 8.542 | train_step_timing in s: 0.1203 | consumed_samples: 2648 | val_loss: 8.757 Training epoch 0, iteration 331/499 | lr: 0.0002221 | global_batch_size: 8 | global_step: 331 | reduced_train_loss: 8.583 | train_step_timing in s: 0.12 | consumed_samples: 2656 | val_loss: 8.757 Training epoch 0, iteration 332/499 | lr: 0.0002194 | global_batch_size: 8 | global_step: 332 | reduced_train_loss: 8.52 | train_step_timing in s: 0.1205 | consumed_samples: 2664 | val_loss: 8.757 Training epoch 0, iteration 333/499 | lr: 0.0002167 | global_batch_size: 8 | global_step: 333 | reduced_train_loss: 8.258 | train_step_timing in s: 0.1194 | consumed_samples: 2672 | val_loss: 8.757 Training epoch 0, iteration 334/499 | lr: 0.000214 | global_batch_size: 8 | global_step: 334 | reduced_train_loss: 8.522 | train_step_timing in s: 0.1142 | consumed_samples: 2680 | val_loss: 8.757 Training epoch 0, iteration 335/499 | lr: 0.0002114 | global_batch_size: 8 | global_step: 335 | reduced_train_loss: 8.402 | train_step_timing in s: 0.1149 | consumed_samples: 2688 | val_loss: 8.757 Training epoch 0, iteration 336/499 | lr: 0.0002087 | global_batch_size: 8 | global_step: 336 | reduced_train_loss: 8.811 | train_step_timing in s: 0.1199 | consumed_samples: 2696 | val_loss: 8.757 Training epoch 0, iteration 337/499 | lr: 0.0002061 | global_batch_size: 8 | global_step: 337 | reduced_train_loss: 8.524 | train_step_timing in s: 0.1182 | consumed_samples: 2704 | val_loss: 8.757 Training epoch 0, iteration 338/499 | lr: 0.0002034 | global_batch_size: 8 | global_step: 338 | reduced_train_loss: 8.287 | train_step_timing in s: 0.1185 | consumed_samples: 2712 | val_loss: 8.757 Training epoch 0, iteration 339/499 | lr: 0.0002008 | global_batch_size: 8 | global_step: 339 | reduced_train_loss: 8.512 | train_step_timing in s: 0.1201 | consumed_samples: 2720 | val_loss: 8.757 Training epoch 0, iteration 340/499 | lr: 0.0001982 | global_batch_size: 8 | global_step: 340 | reduced_train_loss: 8.567 | train_step_timing in s: 0.117 | consumed_samples: 2728 | val_loss: 8.757 Training epoch 0, iteration 341/499 | lr: 0.0001956 | global_batch_size: 8 | global_step: 341 | reduced_train_loss: 8.621 | train_step_timing in s: 0.1262 | consumed_samples: 2736 | val_loss: 8.757 Training epoch 0, iteration 342/499 | lr: 0.0001931 | global_batch_size: 8 | global_step: 342 | reduced_train_loss: 8.651 | train_step_timing in s: 0.1171 | consumed_samples: 2744 | val_loss: 8.757 Training epoch 0, iteration 343/499 | lr: 0.0001905 | global_batch_size: 8 | global_step: 343 | reduced_train_loss: 8.499 | train_step_timing in s: 0.1183 | consumed_samples: 2752 | val_loss: 8.757 Training epoch 0, iteration 344/499 | lr: 0.0001879 | global_batch_size: 8 | global_step: 344 | reduced_train_loss: 8.429 | train_step_timing in s: 0.1144 | consumed_samples: 2760 | val_loss: 8.757 Training epoch 0, iteration 345/499 | lr: 0.0001854 | global_batch_size: 8 | global_step: 345 | reduced_train_loss: 8.615 | train_step_timing in s: 0.1134 | consumed_samples: 2768 | val_loss: 8.757 Training epoch 0, iteration 346/499 | lr: 0.0001829 | global_batch_size: 8 | global_step: 346 | reduced_train_loss: 8.583 | train_step_timing in s: 0.1137 | consumed_samples: 2776 | val_loss: 8.757 Training epoch 0, iteration 347/499 | lr: 0.0001804 | global_batch_size: 8 | global_step: 347 | reduced_train_loss: 8.61 | train_step_timing in s: 0.1202 | consumed_samples: 2784 | val_loss: 8.757 Training epoch 0, iteration 348/499 | lr: 0.0001779 | global_batch_size: 8 | global_step: 348 | reduced_train_loss: 8.532 | train_step_timing in s: 0.1168 | consumed_samples: 2792 | val_loss: 8.757 Training epoch 0, iteration 349/499 | lr: 0.0001754 | global_batch_size: 8 | global_step: 349 | reduced_train_loss: 8.616 | train_step_timing in s: 0.1197 | consumed_samples: 2800 | val_loss: 8.757 Training epoch 0, iteration 350/499 | lr: 0.000173 | global_batch_size: 8 | global_step: 350 | reduced_train_loss: 8.645 | train_step_timing in s: 0.1183 | consumed_samples: 2808 | val_loss: 8.757 Training epoch 0, iteration 351/499 | lr: 0.0001705 | global_batch_size: 8 | global_step: 351 | reduced_train_loss: 8.582 | train_step_timing in s: 0.118 | consumed_samples: 2816 | val_loss: 8.757 Training epoch 0, iteration 352/499 | lr: 0.0001681 | global_batch_size: 8 | global_step: 352 | reduced_train_loss: 8.361 | train_step_timing in s: 0.118 | consumed_samples: 2824 | val_loss: 8.757 Training epoch 0, iteration 353/499 | lr: 0.0001657 | global_batch_size: 8 | global_step: 353 | reduced_train_loss: 8.478 | train_step_timing in s: 0.1129 | consumed_samples: 2832 | val_loss: 8.757 Training epoch 0, iteration 354/499 | lr: 0.0001633 | global_batch_size: 8 | global_step: 354 | reduced_train_loss: 8.534 | train_step_timing in s: 0.1164 | consumed_samples: 2840 | val_loss: 8.757 Training epoch 0, iteration 355/499 | lr: 0.0001609 | global_batch_size: 8 | global_step: 355 | reduced_train_loss: 8.65 | train_step_timing in s: 0.12 | consumed_samples: 2848 | val_loss: 8.757 Training epoch 0, iteration 356/499 | lr: 0.0001585 | global_batch_size: 8 | global_step: 356 | reduced_train_loss: 8.618 | train_step_timing in s: 0.1194 | consumed_samples: 2856 | val_loss: 8.757 Training epoch 0, iteration 357/499 | lr: 0.0001562 | global_batch_size: 8 | global_step: 357 | reduced_train_loss: 8.491 | train_step_timing in s: 0.1192 | consumed_samples: 2864 | val_loss: 8.757 Training epoch 0, iteration 358/499 | lr: 0.0001538 | global_batch_size: 8 | global_step: 358 | reduced_train_loss: 8.491 | train_step_timing in s: 0.1175 | consumed_samples: 2872 | val_loss: 8.757 Training epoch 0, iteration 359/499 | lr: 0.0001515 | global_batch_size: 8 | global_step: 359 | reduced_train_loss: 8.317 | train_step_timing in s: 0.1164 | consumed_samples: 2880 | val_loss: 8.757 Training epoch 0, iteration 360/499 | lr: 0.0001492 | global_batch_size: 8 | global_step: 360 | reduced_train_loss: 8.594 | train_step_timing in s: 0.1183 | consumed_samples: 2888 | val_loss: 8.757 Training epoch 0, iteration 361/499 | lr: 0.0001469 | global_batch_size: 8 | global_step: 361 | reduced_train_loss: 8.342 | train_step_timing in s: 0.1181 | consumed_samples: 2896 | val_loss: 8.757 Training epoch 0, iteration 362/499 | lr: 0.0001446 | global_batch_size: 8 | global_step: 362 | reduced_train_loss: 8.449 | train_step_timing in s: 0.1162 | consumed_samples: 2904 | val_loss: 8.757 Training epoch 0, iteration 363/499 | lr: 0.0001424 | global_batch_size: 8 | global_step: 363 | reduced_train_loss: 8.426 | train_step_timing in s: 0.1242 | consumed_samples: 2912 | val_loss: 8.757 Training epoch 0, iteration 364/499 | lr: 0.0001401 | global_batch_size: 8 | global_step: 364 | reduced_train_loss: 8.567 | train_step_timing in s: 0.1125 | consumed_samples: 2920 | val_loss: 8.757 Training epoch 0, iteration 365/499 | lr: 0.0001379 | global_batch_size: 8 | global_step: 365 | reduced_train_loss: 8.567 | train_step_timing in s: 0.118 | consumed_samples: 2928 | val_loss: 8.757 Training epoch 0, iteration 366/499 | lr: 0.0001357 | global_batch_size: 8 | global_step: 366 | reduced_train_loss: 8.56 | train_step_timing in s: 0.1213 | consumed_samples: 2936 | val_loss: 8.757 Training epoch 0, iteration 367/499 | lr: 0.0001335 | global_batch_size: 8 | global_step: 367 | reduced_train_loss: 8.39 | train_step_timing in s: 0.1023 | consumed_samples: 2944 | val_loss: 8.757 Training epoch 0, iteration 368/499 | lr: 0.0001313 | global_batch_size: 8 | global_step: 368 | reduced_train_loss: 8.724 | train_step_timing in s: 0.116 | consumed_samples: 2952 | val_loss: 8.757 Training epoch 0, iteration 369/499 | lr: 0.0001291 | global_batch_size: 8 | global_step: 369 | reduced_train_loss: 8.522 | train_step_timing in s: 0.1181 | consumed_samples: 2960 | val_loss: 8.757 Training epoch 0, iteration 370/499 | lr: 0.000127 | global_batch_size: 8 | global_step: 370 | reduced_train_loss: 8.615 | train_step_timing in s: 0.1192 | consumed_samples: 2968 | val_loss: 8.757 Training epoch 0, iteration 371/499 | lr: 0.0001249 | global_batch_size: 8 | global_step: 371 | reduced_train_loss: 8.482 | train_step_timing in s: 0.1157 | consumed_samples: 2976 | val_loss: 8.757 Training epoch 0, iteration 372/499 | lr: 0.0001228 | global_batch_size: 8 | global_step: 372 | reduced_train_loss: 8.278 | train_step_timing in s: 0.1132 | consumed_samples: 2984 | val_loss: 8.757 Training epoch 0, iteration 373/499 | lr: 0.0001207 | global_batch_size: 8 | global_step: 373 | reduced_train_loss: 8.443 | train_step_timing in s: 0.1175 | consumed_samples: 2992 | val_loss: 8.757 Training epoch 0, iteration 374/499 | lr: 0.0001186 | global_batch_size: 8 | global_step: 374 | reduced_train_loss: 8.672 | train_step_timing in s: 0.12 | consumed_samples: 3000 | val_loss: 8.757 Training epoch 0, iteration 375/499 | lr: 0.0001165 | global_batch_size: 8 | global_step: 375 | reduced_train_loss: 8.343 | train_step_timing in s: 0.1152 | consumed_samples: 3008 | val_loss: 8.757 Training epoch 0, iteration 376/499 | lr: 0.0001145 | global_batch_size: 8 | global_step: 376 | reduced_train_loss: 8.431 | train_step_timing in s: 0.1177 | consumed_samples: 3016 | val_loss: 8.757 Training epoch 0, iteration 377/499 | lr: 0.0001125 | global_batch_size: 8 | global_step: 377 | reduced_train_loss: 8.483 | train_step_timing in s: 0.1133 | consumed_samples: 3024 | val_loss: 8.757 Training epoch 0, iteration 378/499 | lr: 0.0001105 | global_batch_size: 8 | global_step: 378 | reduced_train_loss: 8.545 | train_step_timing in s: 0.1174 | consumed_samples: 3032 | val_loss: 8.757 Training epoch 0, iteration 379/499 | lr: 0.0001085 | global_batch_size: 8 | global_step: 379 | reduced_train_loss: 8.425 | train_step_timing in s: 0.1184 | consumed_samples: 3040 | val_loss: 8.757 Training epoch 0, iteration 380/499 | lr: 0.0001065 | global_batch_size: 8 | global_step: 380 | reduced_train_loss: 8.567 | train_step_timing in s: 0.1222 | consumed_samples: 3048 | val_loss: 8.757 Training epoch 0, iteration 381/499 | lr: 0.0001045 | global_batch_size: 8 | global_step: 381 | reduced_train_loss: 8.365 | train_step_timing in s: 0.1173 | consumed_samples: 3056 | val_loss: 8.757 Training epoch 0, iteration 382/499 | lr: 0.0001026 | global_batch_size: 8 | global_step: 382 | reduced_train_loss: 8.538 | train_step_timing in s: 0.1206 | consumed_samples: 3064 | val_loss: 8.757 Training epoch 0, iteration 383/499 | lr: 0.0001007 | global_batch_size: 8 | global_step: 383 | reduced_train_loss: 8.421 | train_step_timing in s: 0.1184 | consumed_samples: 3072 | val_loss: 8.757 Training epoch 0, iteration 384/499 | lr: 9.878e-05 | global_batch_size: 8 | global_step: 384 | reduced_train_loss: 8.446 | train_step_timing in s: 0.1206 | consumed_samples: 3080 | val_loss: 8.757 Training epoch 0, iteration 385/499 | lr: 9.69e-05 | global_batch_size: 8 | global_step: 385 | reduced_train_loss: 8.738 | train_step_timing in s: 0.1562 | consumed_samples: 3088 | val_loss: 8.757 Training epoch 0, iteration 386/499 | lr: 9.504e-05 | global_batch_size: 8 | global_step: 386 | reduced_train_loss: 8.504 | train_step_timing in s: 0.1205 | consumed_samples: 3096 | val_loss: 8.757 Training epoch 0, iteration 387/499 | lr: 9.319e-05 | global_batch_size: 8 | global_step: 387 | reduced_train_loss: 8.532 | train_step_timing in s: 0.1168 | consumed_samples: 3104 | val_loss: 8.757 Training epoch 0, iteration 388/499 | lr: 9.137e-05 | global_batch_size: 8 | global_step: 388 | reduced_train_loss: 8.406 | train_step_timing in s: 0.1201 | consumed_samples: 3112 | val_loss: 8.757 Training epoch 0, iteration 389/499 | lr: 8.956e-05 | global_batch_size: 8 | global_step: 389 | reduced_train_loss: 8.377 | train_step_timing in s: 0.1174 | consumed_samples: 3120 | val_loss: 8.757 Training epoch 0, iteration 390/499 | lr: 8.777e-05 | global_batch_size: 8 | global_step: 390 | reduced_train_loss: 8.515 | train_step_timing in s: 0.1207 | consumed_samples: 3128 | val_loss: 8.757 Training epoch 0, iteration 391/499 | lr: 8.6e-05 | global_batch_size: 8 | global_step: 391 | reduced_train_loss: 8.617 | train_step_timing in s: 0.1245 | consumed_samples: 3136 | val_loss: 8.757 Training epoch 0, iteration 392/499 | lr: 8.425e-05 | global_batch_size: 8 | global_step: 392 | reduced_train_loss: 8.385 | train_step_timing in s: 0.1152 | consumed_samples: 3144 | val_loss: 8.757 Training epoch 0, iteration 393/499 | lr: 8.251e-05 | global_batch_size: 8 | global_step: 393 | reduced_train_loss: 8.395 | train_step_timing in s: 0.1194 | consumed_samples: 3152 | val_loss: 8.757 Training epoch 0, iteration 394/499 | lr: 8.08e-05 | global_batch_size: 8 | global_step: 394 | reduced_train_loss: 8.317 | train_step_timing in s: 0.1148 | consumed_samples: 3160 | val_loss: 8.757 Training epoch 0, iteration 395/499 | lr: 7.91e-05 | global_batch_size: 8 | global_step: 395 | reduced_train_loss: 8.427 | train_step_timing in s: 0.116 | consumed_samples: 3168 | val_loss: 8.757 Training epoch 0, iteration 396/499 | lr: 7.742e-05 | global_batch_size: 8 | global_step: 396 | reduced_train_loss: 8.646 | train_step_timing in s: 0.124 | consumed_samples: 3176 | val_loss: 8.757 Training epoch 0, iteration 397/499 | lr: 7.577e-05 | global_batch_size: 8 | global_step: 397 | reduced_train_loss: 8.504 | train_step_timing in s: 0.1147 | consumed_samples: 3184 | val_loss: 8.757 Training epoch 0, iteration 398/499 | lr: 7.413e-05 | global_batch_size: 8 | global_step: 398 | reduced_train_loss: 8.501 | train_step_timing in s: 0.1141 | consumed_samples: 3192 | val_loss: 8.757 Training epoch 0, iteration 399/499 | lr: 7.251e-05 | global_batch_size: 8 | global_step: 399 | reduced_train_loss: 8.582 | train_step_timing in s: 0.1214 | consumed_samples: 3200 | val_loss: 8.757 Epoch 0, global step 399: 'reduced_train_loss' reached 8.58156 (best 8.44928), saving model to '/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0.ckpt' as top 2 [NeMo I 2025-05-21 01:47:18 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 399 : Start time: 1747792038.671s : Save duration: 0.083s [NeMo I 2025-05-21 01:47:22 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0.ckpt [NeMo I 2025-05-21 01:47:22 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 399 : Start time: 1747792042.561s : Save duration: 0.062s [NeMo I 2025-05-21 01:47:25 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0-last.ckpt [NeMo I 2025-05-21 01:47:26 nemo_logging:393] Successfully saved checkpoint from iteration 399 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0.ckpt [NeMo I 2025-05-21 01:47:26 nemo_logging:393] Async checkpoint save for step 400 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:26 nemo_logging:393] Successfully saved checkpoint from iteration 399 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0-last.ckpt [NeMo I 2025-05-21 01:47:26 nemo_logging:393] Async checkpoint save for step 400 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.76-step=399-consumed_samples=3200.0-last.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:26 nemo_logging:393] Async finalization time took 0.096 s Validation: iteration 1/2 Validation: iteration 2/2 Validation: iteration 3/2 Validation: iteration 4/2 Validation: iteration 5/2 Validation: iteration 6/2 Validation: iteration 7/2 Validation: iteration 8/2 Training epoch 0, iteration 400/499 | lr: 7.091e-05 | global_batch_size: 8 | global_step: 400 | reduced_train_loss: 8.355 | train_step_timing in s: 0.1124 | consumed_samples: 3208 | val_loss: 8.562 Training epoch 0, iteration 401/499 | lr: 6.933e-05 | global_batch_size: 8 | global_step: 401 | reduced_train_loss: 8.572 | train_step_timing in s: 0.1087 | consumed_samples: 3216 | val_loss: 8.562 Training epoch 0, iteration 402/499 | lr: 6.777e-05 | global_batch_size: 8 | global_step: 402 | reduced_train_loss: 8.561 | train_step_timing in s: 0.1087 | consumed_samples: 3224 | val_loss: 8.562 Training epoch 0, iteration 403/499 | lr: 6.623e-05 | global_batch_size: 8 | global_step: 403 | reduced_train_loss: 8.578 | train_step_timing in s: 0.1062 | consumed_samples: 3232 | val_loss: 8.562 Training epoch 0, iteration 404/499 | lr: 6.471e-05 | global_batch_size: 8 | global_step: 404 | reduced_train_loss: 8.328 | train_step_timing in s: 0.1071 | consumed_samples: 3240 | val_loss: 8.562 Training epoch 0, iteration 405/499 | lr: 6.32e-05 | global_batch_size: 8 | global_step: 405 | reduced_train_loss: 8.471 | train_step_timing in s: 0.1057 | consumed_samples: 3248 | val_loss: 8.562 Training epoch 0, iteration 406/499 | lr: 6.172e-05 | global_batch_size: 8 | global_step: 406 | reduced_train_loss: 8.447 | train_step_timing in s: 0.1083 | consumed_samples: 3256 | val_loss: 8.562 Training epoch 0, iteration 407/499 | lr: 6.026e-05 | global_batch_size: 8 | global_step: 407 | reduced_train_loss: 8.537 | train_step_timing in s: 0.1118 | consumed_samples: 3264 | val_loss: 8.562 Training epoch 0, iteration 408/499 | lr: 5.882e-05 | global_batch_size: 8 | global_step: 408 | reduced_train_loss: 8.538 | train_step_timing in s: 0.1134 | consumed_samples: 3272 | val_loss: 8.562 Training epoch 0, iteration 409/499 | lr: 5.739e-05 | global_batch_size: 8 | global_step: 409 | reduced_train_loss: 8.372 | train_step_timing in s: 0.1165 | consumed_samples: 3280 | val_loss: 8.562 Training epoch 0, iteration 410/499 | lr: 5.599e-05 | global_batch_size: 8 | global_step: 410 | reduced_train_loss: 8.549 | train_step_timing in s: 0.1188 | consumed_samples: 3288 | val_loss: 8.562 Training epoch 0, iteration 411/499 | lr: 5.461e-05 | global_batch_size: 8 | global_step: 411 | reduced_train_loss: 8.624 | train_step_timing in s: 0.1189 | consumed_samples: 3296 | val_loss: 8.562 Training epoch 0, iteration 412/499 | lr: 5.324e-05 | global_batch_size: 8 | global_step: 412 | reduced_train_loss: 8.308 | train_step_timing in s: 0.1091 | consumed_samples: 3304 | val_loss: 8.562 Training epoch 0, iteration 413/499 | lr: 5.19e-05 | global_batch_size: 8 | global_step: 413 | reduced_train_loss: 8.398 | train_step_timing in s: 0.111 | consumed_samples: 3312 | val_loss: 8.562 Training epoch 0, iteration 414/499 | lr: 5.058e-05 | global_batch_size: 8 | global_step: 414 | reduced_train_loss: 8.237 | train_step_timing in s: 0.112 | consumed_samples: 3320 | val_loss: 8.562 Training epoch 0, iteration 415/499 | lr: 4.928e-05 | global_batch_size: 8 | global_step: 415 | reduced_train_loss: 8.525 | train_step_timing in s: 0.1177 | consumed_samples: 3328 | val_loss: 8.562 Training epoch 0, iteration 416/499 | lr: 4.8e-05 | global_batch_size: 8 | global_step: 416 | reduced_train_loss: 8.479 | train_step_timing in s: 0.1122 | consumed_samples: 3336 | val_loss: 8.562 Training epoch 0, iteration 417/499 | lr: 4.674e-05 | global_batch_size: 8 | global_step: 417 | reduced_train_loss: 8.471 | train_step_timing in s: 0.118 | consumed_samples: 3344 | val_loss: 8.562 Training epoch 0, iteration 418/499 | lr: 4.55e-05 | global_batch_size: 8 | global_step: 418 | reduced_train_loss: 8.266 | train_step_timing in s: 0.1094 | consumed_samples: 3352 | val_loss: 8.562 Training epoch 0, iteration 419/499 | lr: 4.428e-05 | global_batch_size: 8 | global_step: 419 | reduced_train_loss: 8.357 | train_step_timing in s: 0.1165 | consumed_samples: 3360 | val_loss: 8.562 Training epoch 0, iteration 420/499 | lr: 4.308e-05 | global_batch_size: 8 | global_step: 420 | reduced_train_loss: 8.359 | train_step_timing in s: 0.1127 | consumed_samples: 3368 | val_loss: 8.562 Training epoch 0, iteration 421/499 | lr: 4.19e-05 | global_batch_size: 8 | global_step: 421 | reduced_train_loss: 8.526 | train_step_timing in s: 0.1211 | consumed_samples: 3376 | val_loss: 8.562 Training epoch 0, iteration 422/499 | lr: 4.074e-05 | global_batch_size: 8 | global_step: 422 | reduced_train_loss: 8.485 | train_step_timing in s: 0.1174 | consumed_samples: 3384 | val_loss: 8.562 Training epoch 0, iteration 423/499 | lr: 3.96e-05 | global_batch_size: 8 | global_step: 423 | reduced_train_loss: 8.598 | train_step_timing in s: 0.119 | consumed_samples: 3392 | val_loss: 8.562 Training epoch 0, iteration 424/499 | lr: 3.848e-05 | global_batch_size: 8 | global_step: 424 | reduced_train_loss: 8.338 | train_step_timing in s: 0.111 | consumed_samples: 3400 | val_loss: 8.562 Training epoch 0, iteration 425/499 | lr: 3.739e-05 | global_batch_size: 8 | global_step: 425 | reduced_train_loss: 8.533 | train_step_timing in s: 0.1194 | consumed_samples: 3408 | val_loss: 8.562 Training epoch 0, iteration 426/499 | lr: 3.631e-05 | global_batch_size: 8 | global_step: 426 | reduced_train_loss: 8.424 | train_step_timing in s: 0.118 | consumed_samples: 3416 | val_loss: 8.562 Training epoch 0, iteration 427/499 | lr: 3.526e-05 | global_batch_size: 8 | global_step: 427 | reduced_train_loss: 8.647 | train_step_timing in s: 0.1212 | consumed_samples: 3424 | val_loss: 8.562 Training epoch 0, iteration 428/499 | lr: 3.423e-05 | global_batch_size: 8 | global_step: 428 | reduced_train_loss: 8.412 | train_step_timing in s: 0.1149 | consumed_samples: 3432 | val_loss: 8.562 Training epoch 0, iteration 429/499 | lr: 3.322e-05 | global_batch_size: 8 | global_step: 429 | reduced_train_loss: 8.489 | train_step_timing in s: 0.117 | consumed_samples: 3440 | val_loss: 8.562 Training epoch 0, iteration 430/499 | lr: 3.222e-05 | global_batch_size: 8 | global_step: 430 | reduced_train_loss: 8.392 | train_step_timing in s: 0.1142 | consumed_samples: 3448 | val_loss: 8.562 Training epoch 0, iteration 431/499 | lr: 3.125e-05 | global_batch_size: 8 | global_step: 431 | reduced_train_loss: 8.275 | train_step_timing in s: 0.1131 | consumed_samples: 3456 | val_loss: 8.562 Training epoch 0, iteration 432/499 | lr: 3.031e-05 | global_batch_size: 8 | global_step: 432 | reduced_train_loss: 8.362 | train_step_timing in s: 0.117 | consumed_samples: 3464 | val_loss: 8.562 Training epoch 0, iteration 433/499 | lr: 2.938e-05 | global_batch_size: 8 | global_step: 433 | reduced_train_loss: 8.298 | train_step_timing in s: 0.1171 | consumed_samples: 3472 | val_loss: 8.562 Training epoch 0, iteration 434/499 | lr: 2.847e-05 | global_batch_size: 8 | global_step: 434 | reduced_train_loss: 8.578 | train_step_timing in s: 0.1295 | consumed_samples: 3480 | val_loss: 8.562 Training epoch 0, iteration 435/499 | lr: 2.759e-05 | global_batch_size: 8 | global_step: 435 | reduced_train_loss: 8.354 | train_step_timing in s: 0.1012 | consumed_samples: 3488 | val_loss: 8.562 Training epoch 0, iteration 436/499 | lr: 2.672e-05 | global_batch_size: 8 | global_step: 436 | reduced_train_loss: 8.415 | train_step_timing in s: 0.1162 | consumed_samples: 3496 | val_loss: 8.562 Training epoch 0, iteration 437/499 | lr: 2.588e-05 | global_batch_size: 8 | global_step: 437 | reduced_train_loss: 8.521 | train_step_timing in s: 0.1112 | consumed_samples: 3504 | val_loss: 8.562 Training epoch 0, iteration 438/499 | lr: 2.506e-05 | global_batch_size: 8 | global_step: 438 | reduced_train_loss: 8.327 | train_step_timing in s: 0.107 | consumed_samples: 3512 | val_loss: 8.562 Training epoch 0, iteration 439/499 | lr: 2.426e-05 | global_batch_size: 8 | global_step: 439 | reduced_train_loss: 8.566 | train_step_timing in s: 0.1125 | consumed_samples: 3520 | val_loss: 8.562 Training epoch 0, iteration 440/499 | lr: 2.348e-05 | global_batch_size: 8 | global_step: 440 | reduced_train_loss: 8.473 | train_step_timing in s: 0.111 | consumed_samples: 3528 | val_loss: 8.562 Training epoch 0, iteration 441/499 | lr: 2.273e-05 | global_batch_size: 8 | global_step: 441 | reduced_train_loss: 8.637 | train_step_timing in s: 0.115 | consumed_samples: 3536 | val_loss: 8.562 Training epoch 0, iteration 442/499 | lr: 2.199e-05 | global_batch_size: 8 | global_step: 442 | reduced_train_loss: 8.59 | train_step_timing in s: 0.105 | consumed_samples: 3544 | val_loss: 8.562 Training epoch 0, iteration 443/499 | lr: 2.128e-05 | global_batch_size: 8 | global_step: 443 | reduced_train_loss: 8.497 | train_step_timing in s: 0.1052 | consumed_samples: 3552 | val_loss: 8.562 Training epoch 0, iteration 444/499 | lr: 2.059e-05 | global_batch_size: 8 | global_step: 444 | reduced_train_loss: 8.498 | train_step_timing in s: 0.1156 | consumed_samples: 3560 | val_loss: 8.562 Training epoch 0, iteration 445/499 | lr: 1.992e-05 | global_batch_size: 8 | global_step: 445 | reduced_train_loss: 8.585 | train_step_timing in s: 0.115 | consumed_samples: 3568 | val_loss: 8.562 Training epoch 0, iteration 446/499 | lr: 1.927e-05 | global_batch_size: 8 | global_step: 446 | reduced_train_loss: 8.457 | train_step_timing in s: 0.1127 | consumed_samples: 3576 | val_loss: 8.562 Training epoch 0, iteration 447/499 | lr: 1.864e-05 | global_batch_size: 8 | global_step: 447 | reduced_train_loss: 8.428 | train_step_timing in s: 0.1119 | consumed_samples: 3584 | val_loss: 8.562 Training epoch 0, iteration 448/499 | lr: 1.804e-05 | global_batch_size: 8 | global_step: 448 | reduced_train_loss: 8.33 | train_step_timing in s: 0.1104 | consumed_samples: 3592 | val_loss: 8.562 Training epoch 0, iteration 449/499 | lr: 1.746e-05 | global_batch_size: 8 | global_step: 449 | reduced_train_loss: 8.4 | train_step_timing in s: 0.1054 | consumed_samples: 3600 | val_loss: 8.562 Training epoch 0, iteration 450/499 | lr: 1.69e-05 | global_batch_size: 8 | global_step: 450 | reduced_train_loss: 8.348 | train_step_timing in s: 0.1014 | consumed_samples: 3608 | val_loss: 8.562 Training epoch 0, iteration 451/499 | lr: 1.636e-05 | global_batch_size: 8 | global_step: 451 | reduced_train_loss: 8.459 | train_step_timing in s: 0.1073 | consumed_samples: 3616 | val_loss: 8.562 Training epoch 0, iteration 452/499 | lr: 1.584e-05 | global_batch_size: 8 | global_step: 452 | reduced_train_loss: 8.396 | train_step_timing in s: 0.1066 | consumed_samples: 3624 | val_loss: 8.562 Training epoch 0, iteration 453/499 | lr: 1.534e-05 | global_batch_size: 8 | global_step: 453 | reduced_train_loss: 8.143 | train_step_timing in s: 0.1092 | consumed_samples: 3632 | val_loss: 8.562 Training epoch 0, iteration 454/499 | lr: 1.487e-05 | global_batch_size: 8 | global_step: 454 | reduced_train_loss: 8.48 | train_step_timing in s: 0.1143 | consumed_samples: 3640 | val_loss: 8.562 Training epoch 0, iteration 455/499 | lr: 1.442e-05 | global_batch_size: 8 | global_step: 455 | reduced_train_loss: 8.45 | train_step_timing in s: 0.1095 | consumed_samples: 3648 | val_loss: 8.562 Training epoch 0, iteration 456/499 | lr: 1.399e-05 | global_batch_size: 8 | global_step: 456 | reduced_train_loss: 8.68 | train_step_timing in s: 0.1157 | consumed_samples: 3656 | val_loss: 8.562 Training epoch 0, iteration 457/499 | lr: 1.358e-05 | global_batch_size: 8 | global_step: 457 | reduced_train_loss: 8.359 | train_step_timing in s: 0.1064 | consumed_samples: 3664 | val_loss: 8.562 Training epoch 0, iteration 458/499 | lr: 1.319e-05 | global_batch_size: 8 | global_step: 458 | reduced_train_loss: 8.31 | train_step_timing in s: 0.1135 | consumed_samples: 3672 | val_loss: 8.562 Training epoch 0, iteration 459/499 | lr: 1.283e-05 | global_batch_size: 8 | global_step: 459 | reduced_train_loss: 8.439 | train_step_timing in s: 0.1108 | consumed_samples: 3680 | val_loss: 8.562 Training epoch 0, iteration 460/499 | lr: 1.249e-05 | global_batch_size: 8 | global_step: 460 | reduced_train_loss: 8.484 | train_step_timing in s: 0.1151 | consumed_samples: 3688 | val_loss: 8.562 Training epoch 0, iteration 461/499 | lr: 1.217e-05 | global_batch_size: 8 | global_step: 461 | reduced_train_loss: 8.456 | train_step_timing in s: 0.1078 | consumed_samples: 3696 | val_loss: 8.562 Training epoch 0, iteration 462/499 | lr: 1.187e-05 | global_batch_size: 8 | global_step: 462 | reduced_train_loss: 8.491 | train_step_timing in s: 0.1226 | consumed_samples: 3704 | val_loss: 8.562 Training epoch 0, iteration 463/499 | lr: 1.159e-05 | global_batch_size: 8 | global_step: 463 | reduced_train_loss: 8.365 | train_step_timing in s: 0.1122 | consumed_samples: 3712 | val_loss: 8.562 Training epoch 0, iteration 464/499 | lr: 1.134e-05 | global_batch_size: 8 | global_step: 464 | reduced_train_loss: 8.569 | train_step_timing in s: 0.1185 | consumed_samples: 3720 | val_loss: 8.562 Training epoch 0, iteration 465/499 | lr: 1.111e-05 | global_batch_size: 8 | global_step: 465 | reduced_train_loss: 8.444 | train_step_timing in s: 0.1181 | consumed_samples: 3728 | val_loss: 8.562 Training epoch 0, iteration 466/499 | lr: 1.09e-05 | global_batch_size: 8 | global_step: 466 | reduced_train_loss: 8.536 | train_step_timing in s: 0.1179 | consumed_samples: 3736 | val_loss: 8.562 Training epoch 0, iteration 467/499 | lr: 1.071e-05 | global_batch_size: 8 | global_step: 467 | reduced_train_loss: 8.355 | train_step_timing in s: 0.1109 | consumed_samples: 3744 | val_loss: 8.562 Training epoch 0, iteration 468/499 | lr: 1.054e-05 | global_batch_size: 8 | global_step: 468 | reduced_train_loss: 8.512 | train_step_timing in s: 0.1178 | consumed_samples: 3752 | val_loss: 8.562 Training epoch 0, iteration 469/499 | lr: 1.04e-05 | global_batch_size: 8 | global_step: 469 | reduced_train_loss: 8.599 | train_step_timing in s: 0.1161 | consumed_samples: 3760 | val_loss: 8.562 Training epoch 0, iteration 470/499 | lr: 1.028e-05 | global_batch_size: 8 | global_step: 470 | reduced_train_loss: 8.49 | train_step_timing in s: 0.11 | consumed_samples: 3768 | val_loss: 8.562 Training epoch 0, iteration 471/499 | lr: 1.018e-05 | global_batch_size: 8 | global_step: 471 | reduced_train_loss: 8.403 | train_step_timing in s: 0.1079 | consumed_samples: 3776 | val_loss: 8.562 Training epoch 0, iteration 472/499 | lr: 1.01e-05 | global_batch_size: 8 | global_step: 472 | reduced_train_loss: 8.467 | train_step_timing in s: 0.1154 | consumed_samples: 3784 | val_loss: 8.562 Training epoch 0, iteration 473/499 | lr: 1.004e-05 | global_batch_size: 8 | global_step: 473 | reduced_train_loss: 8.587 | train_step_timing in s: 0.1179 | consumed_samples: 3792 | val_loss: 8.562 Training epoch 0, iteration 474/499 | lr: 1.001e-05 | global_batch_size: 8 | global_step: 474 | reduced_train_loss: 8.59 | train_step_timing in s: 0.1116 | consumed_samples: 3800 | val_loss: 8.562 Training epoch 0, iteration 475/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 475 | reduced_train_loss: 8.273 | train_step_timing in s: 0.1167 | consumed_samples: 3808 | val_loss: 8.562 Training epoch 0, iteration 476/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 476 | reduced_train_loss: 8.507 | train_step_timing in s: 0.1204 | consumed_samples: 3816 | val_loss: 8.562 Training epoch 0, iteration 477/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 477 | reduced_train_loss: 8.538 | train_step_timing in s: 0.1179 | consumed_samples: 3824 | val_loss: 8.562 Training epoch 0, iteration 478/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 478 | reduced_train_loss: 8.322 | train_step_timing in s: 0.1191 | consumed_samples: 3832 | val_loss: 8.562 Training epoch 0, iteration 479/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 479 | reduced_train_loss: 8.473 | train_step_timing in s: 0.1197 | consumed_samples: 3840 | val_loss: 8.562 Training epoch 0, iteration 480/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 480 | reduced_train_loss: 8.406 | train_step_timing in s: 0.1115 | consumed_samples: 3848 | val_loss: 8.562 Training epoch 0, iteration 481/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 481 | reduced_train_loss: 8.419 | train_step_timing in s: 0.1174 | consumed_samples: 3856 | val_loss: 8.562 Training epoch 0, iteration 482/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 482 | reduced_train_loss: 8.382 | train_step_timing in s: 0.1189 | consumed_samples: 3864 | val_loss: 8.562 Training epoch 0, iteration 483/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 483 | reduced_train_loss: 8.416 | train_step_timing in s: 0.1183 | consumed_samples: 3872 | val_loss: 8.562 Training epoch 0, iteration 484/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 484 | reduced_train_loss: 8.458 | train_step_timing in s: 0.1197 | consumed_samples: 3880 | val_loss: 8.562 Training epoch 0, iteration 485/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 485 | reduced_train_loss: 8.42 | train_step_timing in s: 0.1193 | consumed_samples: 3888 | val_loss: 8.562 Training epoch 0, iteration 486/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 486 | reduced_train_loss: 8.551 | train_step_timing in s: 0.1199 | consumed_samples: 3896 | val_loss: 8.562 Training epoch 0, iteration 487/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 487 | reduced_train_loss: 8.333 | train_step_timing in s: 0.1187 | consumed_samples: 3904 | val_loss: 8.562 Training epoch 0, iteration 488/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 488 | reduced_train_loss: 8.548 | train_step_timing in s: 0.1191 | consumed_samples: 3912 | val_loss: 8.562 Training epoch 0, iteration 489/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 489 | reduced_train_loss: 8.564 | train_step_timing in s: 0.1177 | consumed_samples: 3920 | val_loss: 8.562 Training epoch 0, iteration 490/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 490 | reduced_train_loss: 8.4 | train_step_timing in s: 0.1185 | consumed_samples: 3928 | val_loss: 8.562 Training epoch 0, iteration 491/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 491 | reduced_train_loss: 8.59 | train_step_timing in s: 0.1188 | consumed_samples: 3936 | val_loss: 8.562 Training epoch 0, iteration 492/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 492 | reduced_train_loss: 8.352 | train_step_timing in s: 0.1137 | consumed_samples: 3944 | val_loss: 8.562 Training epoch 0, iteration 493/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 493 | reduced_train_loss: 8.476 | train_step_timing in s: 0.1149 | consumed_samples: 3952 | val_loss: 8.562 Training epoch 0, iteration 494/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 494 | reduced_train_loss: 8.577 | train_step_timing in s: 0.1129 | consumed_samples: 3960 | val_loss: 8.562 Training epoch 0, iteration 495/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 495 | reduced_train_loss: 8.425 | train_step_timing in s: 0.1078 | consumed_samples: 3968 | val_loss: 8.562 Training epoch 0, iteration 496/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 496 | reduced_train_loss: 8.423 | train_step_timing in s: 0.1185 | consumed_samples: 3976 | val_loss: 8.562 Training epoch 0, iteration 497/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 497 | reduced_train_loss: 8.28 | train_step_timing in s: 0.1109 | consumed_samples: 3984 | val_loss: 8.562 Training epoch 0, iteration 498/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 498 | reduced_train_loss: 8.478 | train_step_timing in s: 0.1142 | consumed_samples: 3992 | val_loss: 8.562 Training epoch 0, iteration 499/499 | lr: 1e-05 | global_batch_size: 8 | global_step: 499 | reduced_train_loss: 8.211 | train_step_timing in s: 0.1081 | consumed_samples: 4000 | val_loss: 8.562 Epoch 0, global step 499: 'reduced_train_loss' reached 8.21144 (best 8.21144), saving model to '/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0.ckpt' as top 2 [NeMo I 2025-05-21 01:47:38 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 499 : Start time: 1747792058.154s : Save duration: 0.035s [NeMo I 2025-05-21 01:47:41 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0.ckpt [NeMo I 2025-05-21 01:47:41 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 499 : Start time: 1747792061.891s : Save duration: 0.017s [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Scheduled async checkpoint save for /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0-last.ckpt [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Successfully saved checkpoint from iteration 499 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0.ckpt [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Async checkpoint save for step 500 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Successfully saved checkpoint from iteration 499 to /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0-last.ckpt [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Async checkpoint save for step 500 (/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.56-step=499-consumed_samples=4000.0-last.ckpt) finalized successfully. [NeMo I 2025-05-21 01:47:45 nemo_logging:393] Async finalization time took 0.080 s Validation: iteration 1/2 Validation: iteration 2/2 Validation: iteration 3/2 Validation: iteration 4/2 Validation: iteration 5/2 Validation: iteration 6/2 Validation: iteration 7/2 Validation: iteration 8/2 `Trainer.fit` stopped: `max_steps=500` reached. wandb: wandb: You can sync this run to the cloud by running: wandb: wandb sync /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/wandb/offline-run-20250521_014604-1 wandb: Find logs at: ../../../../../home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/wandb/offline-run-20250521_014604-1/logs
# find the checkpoint file
import os
chkpt_dir = f"{single_cell_workdir}/results/geneformer-10m/dev/checkpoints"
last_ckpt = next(d for d in os.listdir(chkpt_dir) if d.endswith("-last"))
pretrained_checkpoint_path = os.path.join(chkpt_dir, last_ckpt)
print(pretrained_checkpoint_path)
/home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=7.94-step=499-consumed_samples=4000.0-last
Running inference.¶
We can see from the above training job that the model was trained 1000 steps. At the end of training, the experiment manager leaves a message about where the resulting .ckpt
checkpoint is written. This file is used for finetuning, inference, or training from an existing set of model weights. See the example produced below from our run:
[NeMo I 2025-03-11 20:32:11 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 499 : Start time: 1741725131.041s : Save duration: 0.014s
[NeMo I 2025-03-11 20:32:14 nemo_logging:393] Scheduled async checkpoint save for /workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0.ckpt
[NeMo I 2025-03-11 20:32:14 nemo_logging:393] Global Checkpoint Save : Rank: 0 : Iteration: 499 : Start time: 1741725134.016s : Save duration: 0.013s
[NeMo I 2025-03-11 20:32:16 nemo_logging:393] Scheduled async checkpoint save for /workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0-last.ckpt
[NeMo I 2025-03-11 20:32:17 nemo_logging:393] Successfully saved checkpoint from iteration 499 to /workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0.ckpt
[NeMo I 2025-03-11 20:32:17 nemo_logging:393] Async checkpoint save for step 500 (/workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0.ckpt) finalized successfully.
[NeMo I 2025-03-11 20:32:17 nemo_logging:393] Successfully saved checkpoint from iteration 499 to /workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0-last.ckpt
[NeMo I 2025-03-11 20:32:17 nemo_logging:393] Async checkpoint save for step 500 (/workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0-last.ckpt) finalized successfully.
[NeMo I 2025-03-11 20:32:17 nemo_logging:393] Async finalization time took 0.090 s
We will take the .ckpt
file logged:
/workspace/bionemo2/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=8.28-step=499-consumed_samples=4000.0-last
and use this for inference.
!infer_geneformer \
--data-dir {test_tutorial_processed_dir} \
--checkpoint-path {pretrained_checkpoint_path} \
--results-path {tutorial_output_dir}
Could not find the bitsandbytes CUDA binary at PosixPath('/usr/local/lib/python3.12/dist-packages/bitsandbytes/libbitsandbytes_cuda129.so') The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. [NeMo W 2025-05-21 01:48:08 nemo_logging:405] Tokenizer vocab file: /home/ubuntu/.cache/bionemo/d8e3ea569bc43768c24aa651aff77722df202078415528497c22394046b08cc3-singlecell-scdltestdata-20241203.tar.gz.untar/cellxgene_2023-12-15_small_processed_scdl/train/geneformer.vocab already exists. Overwriting... [NeMo I 2025-05-21 01:48:08 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Resource already exists, skipping download: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_name_id_dict_gc30M.pkl?download=true [NeMo I 2025-05-21 01:48:08 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:48:08 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Resource already exists, skipping download: https://huggingface.co/ctheodoris/Geneformer/resolve/main/geneformer/gene_dictionaries_30m/gene_median_dictionary_gc30M.pkl?download=true [NeMo I 2025-05-21 01:48:08 nemo_logging:393] No checksum provided, filename exists. Assuming it is complete. [NeMo I 2025-05-21 01:48:08 nemo_logging:393] *************** Preprocessing Finished ************ GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Fixing mis-match between ddp-config & mcore-optimizer config [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has data parallel group : [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has combined group of data parallel and context parallel : [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All data parallel group ranks with context parallel combined: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Ranks 0 has data parallel rank: 0 [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has context parallel group: [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All context parallel group ranks: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Ranks 0 has context parallel rank: 0 [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has model parallel group: [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has tensor model parallel group: [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All tensor model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has tensor model parallel rank: 0 [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has pipeline model parallel group: [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has embedding group: [0] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All pipeline model parallel group ranks: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has pipeline model parallel rank 0 [NeMo I 2025-05-21 01:48:08 nemo_logging:393] All embedding group ranks: [[0]] [NeMo I 2025-05-21 01:48:08 nemo_logging:393] Rank 0 has embedding rank: 0 2025-05-21 01:48:08 - nemo.lightning.pytorch.strategies.megatron_strategy - INFO - Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 1 processes ---------------------------------------------------------------------------------------------------- 2025-05-21 01:48:09 - /workspaces/bionemo-framework/sub-packages/bionemo-llm/src/bionemo/llm/model/config.py - WARNING - Loading /home/ubuntu/.cache/bionemo/singlecell_tutorial/results/geneformer-10m/dev/checkpoints/epoch=0-val_loss=7.94-step=499-consumed_samples=4000.0-last [NeMo I 2025-05-21 01:48:09 nemo_logging:393] Padded vocab_size: 25472, original vocab_size: 25429, dummy tokens: 43. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] [NeMo W 2025-05-21 01:48:09 nemo_logging:405] Could not copy Trainer's 'max_steps' to LR scheduler's 'max_steps'. If you are not using an LR scheduler, this warning can safely be ignored. 2025-05-21 01:48:09 - megatron.core.num_microbatches_calculator - INFO - setting number of microbatches to constant 1 [NeMo I 2025-05-21 01:48:09 nemo_logging:393] > number of parameters on (tensor, pipeline) model parallel rank (0 ,0): 10300032 2025-05-21 01:48:09 - megatron.core.distributed.distributed_data_parallel - INFO - Setting up DistributedDataParallel with config DistributedDataParallelConfig(grad_reduce_in_fp32=True, overlap_grad_reduce=False, overlap_param_gather=False, align_param_gather=False, use_distributed_optimizer=True, num_distributed_optimizer_instances=1, check_for_nan_in_grad=True, check_for_large_grads=False, bucket_size=None, pad_buckets_for_high_nccl_busbw=False, average_in_collective=False, fp8_param_gather=False, use_custom_fsdp=False, data_parallel_sharding_strategy='no_shard', gradient_reduce_div_fusion=True, suggested_communication_unit_size=None, preserve_fp32_weights=True, keep_fp8_transpose_cache_when_using_custom_fsdp=False) 2025-05-21 01:48:09 - megatron.core.distributed.param_and_grad_buffer - INFO - Number of buckets for gradient all-reduce / reduce-scatter: 1 Params for bucket 1 (10300032 elements, 10300032 padded size): module.encoder.layers.5.mlp.linear_fc1.weight module.encoder.layers.3.self_attention.linear_qkv.bias module.encoder.layers.1.self_attention.linear_qkv.bias module.encoder.layers.0.mlp.linear_fc1.bias module.encoder.layers.2.mlp.linear_fc1.layer_norm_bias module.lm_head.dense.bias module.embedding.position_embeddings.weight module.encoder.layers.5.mlp.linear_fc1.layer_norm_bias module.encoder.layers.0.mlp.linear_fc1.layer_norm_weight module.lm_head.layer_norm.bias module.encoder.layers.5.self_attention.linear_proj.bias module.encoder.layers.4.mlp.linear_fc1.layer_norm_weight module.encoder.layers.3.mlp.linear_fc1.layer_norm_bias module.encoder.layers.1.self_attention.linear_proj.weight module.encoder.layers.4.self_attention.linear_qkv.weight module.encoder.layers.0.mlp.linear_fc1.weight module.encoder.layers.1.mlp.linear_fc2.bias module.encoder.layers.0.mlp.linear_fc2.weight module.encoder.layers.5.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.4.mlp.linear_fc2.bias module.encoder.layers.0.mlp.linear_fc1.layer_norm_bias module.encoder.layers.4.self_attention.linear_proj.weight module.encoder.layers.1.self_attention.linear_qkv.weight module.encoder.layers.0.self_attention.linear_qkv.weight module.encoder.layers.2.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.1.self_attention.linear_proj.bias module.encoder.layers.4.self_attention.linear_qkv.bias module.encoder.layers.2.mlp.linear_fc1.layer_norm_weight module.encoder.layers.1.mlp.linear_fc2.weight module.encoder.layers.3.self_attention.linear_qkv.weight module.encoder.layers.2.mlp.linear_fc2.bias module.encoder.layers.0.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.5.mlp.linear_fc2.weight module.encoder.layers.2.mlp.linear_fc1.bias module.encoder.layers.1.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.4.self_attention.linear_proj.bias module.encoder.layers.3.mlp.linear_fc1.weight module.encoder.layers.4.mlp.linear_fc1.bias module.encoder.layers.2.mlp.linear_fc2.weight module.encoder.layers.0.self_attention.linear_proj.bias module.encoder.layers.4.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.1.mlp.linear_fc1.bias module.encoder.layers.3.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.5.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.2.mlp.linear_fc1.weight module.lm_head.dense.weight module.encoder.layers.3.mlp.linear_fc1.bias module.encoder.layers.0.mlp.linear_fc2.bias module.encoder.layers.4.mlp.linear_fc1.weight module.encoder.layers.3.self_attention.linear_proj.bias module.encoder.layers.2.self_attention.linear_qkv.layer_norm_bias module.lm_head.layer_norm.weight module.encoder.layers.4.mlp.linear_fc2.weight module.encoder.layers.1.mlp.linear_fc1.weight module.encoder.layers.3.mlp.linear_fc2.weight module.encoder.layers.2.self_attention.linear_proj.bias module.encoder.layers.5.self_attention.linear_qkv.bias module.encoder.layers.3.self_attention.linear_proj.weight module.encoder.layers.0.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.5.mlp.linear_fc2.bias module.encoder.layers.5.self_attention.linear_proj.weight module.encoder.layers.2.self_attention.linear_qkv.bias module.output_layer.bias module.encoder.layers.4.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.2.self_attention.linear_proj.weight module.encoder.layers.5.mlp.linear_fc1.bias module.encoder.layers.3.self_attention.linear_qkv.layer_norm_bias module.encoder.layers.1.self_attention.linear_qkv.layer_norm_weight module.encoder.layers.0.self_attention.linear_qkv.bias module.encoder.layers.5.self_attention.linear_qkv.weight module.encoder.layers.3.mlp.linear_fc1.layer_norm_weight module.encoder.final_layernorm.bias module.encoder.layers.1.mlp.linear_fc1.layer_norm_weight module.embedding.word_embeddings.weight module.encoder.layers.5.mlp.linear_fc1.layer_norm_weight module.encoder.layers.2.self_attention.linear_qkv.weight module.encoder.layers.1.mlp.linear_fc1.layer_norm_bias module.encoder.layers.0.self_attention.linear_proj.weight module.encoder.final_layernorm.weight module.encoder.layers.4.mlp.linear_fc1.layer_norm_bias module.encoder.layers.3.mlp.linear_fc2.bias 2025-05-21 01:48:09 - megatron.core.optimizer - INFO - Setting up optimizer with config OptimizerConfig(optimizer='adam', lr=0.0001, min_lr=None, decoupled_lr=None, decoupled_min_lr=None, weight_decay=0.01, fp16=False, bf16=True, params_dtype=torch.float32, use_precision_aware_optimizer=False, store_param_remainders=True, main_grads_dtype=torch.float32, main_params_dtype=torch.float32, exp_avg_dtype=torch.float32, exp_avg_sq_dtype=torch.float32, loss_scale=None, initial_loss_scale=4294967296, min_loss_scale=1.0, loss_scale_window=1000, hysteresis=2, adam_beta1=0.9, adam_beta2=0.999, adam_eps=1e-08, sgd_momentum=0.9, use_distributed_optimizer=True, overlap_param_gather_with_optimizer_step=False, optimizer_cpu_offload=False, optimizer_offload_fraction=0.0, use_torch_optimizer_for_cpu_offload=False, overlap_cpu_optimizer_d2h_h2d=False, pin_cpu_grads=True, pin_cpu_params=True, clip_grad=1.0, log_num_zeros_in_grad=False, barrier_with_L1_time=False, timers=None, config_logger_dir='') 2025-05-21 01:48:09 - root - INFO - Instantiating MegatronPretrainingSampler with total_samples: 232 and consumed_samples: 0 [NeMo W 2025-05-21 01:48:09 nemo_logging:405] /usr/local/lib/python3.12/dist-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=19` in the `DataLoader` to improve performance. 2025-05-21 01:48:13 - root - INFO - Inference predictions are stored in /home/ubuntu/.cache/bionemo/singlecell_tutorial/inference_output/predictions__rank_0.pt dict_keys(['token_logits', 'binary_logits', 'embeddings'])
!ls -altrh {tutorial_output_dir}/
tutorial_output_inference_pickle = f"{tutorial_output_dir}/predictions__rank_0.pt"
!ls -altrh {tutorial_output_inference_pickle}
total 128K drwxr-xr-x 7 ubuntu ubuntu 4.0K May 20 20:50 .. drwxr-xr-x 2 ubuntu ubuntu 4.0K May 20 23:09 . -rw-r--r-- 1 ubuntu ubuntu 118K May 21 01:48 predictions__rank_0.pt -rw-r--r-- 1 ubuntu ubuntu 118K May 21 01:48 /home/ubuntu/.cache/bionemo/singlecell_tutorial/inference_output/predictions__rank_0.pt
Load inference result and cluster with UMAP.¶
Now we will inspect our result. First, we expect there to be one prediction for each cell, we can compare the shape of the anndata object to the predictions produced by our model. After this, we can simply pass our embeddings into umap, and view the result! In this case its a very poorly trained model with very few cells, so keep expectations low!
The inference_results .pt file contains one set of hiddens and embeddings for each cell. The hiddens contain the embedding per-token, whereas the embeddings contain the mean embedding for all gene tokens with special tokens (CLS, MASK, etc) removed.
# Load inference results with torch load
import torch
inference_results = torch.load(tutorial_output_inference_pickle)
print(inference_results.keys())
# print(len(inference_results), adata.shape, inference_results[0].keys())
print(inference_results["embeddings"].shape)
dict_keys(['token_logits', 'binary_logits', 'embeddings']) torch.Size([232, 256])
import umap
reducer = umap.UMAP()
embedding = reducer.fit_transform(inference_results["embeddings"].float())
/usr/local/lib/python3.12/dist-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm /usr/local/lib/python3.12/dist-packages/sklearn/utils/deprecation.py:151: FutureWarning: 'force_all_finite' was renamed to 'ensure_all_finite' in 1.6 and will be removed in 1.8. warnings.warn(
print("embedding.shape: ", embedding.shape)
print("adata_test.obs.shape[0]: ", adata_test.obs.shape[0])
assert adata_test.obs.shape[0] == inference_results["embeddings"].shape[0]
embedding.shape: (232, 2) adata_test.obs.shape[0]: 232
from matplotlib import pyplot as plt
results = adata_test.obs.copy()
results["x"] = embedding[:, 0]
results["y"] = embedding[:, 1]
covariates = ["assay", "development_stage", "dataset_id", "sex"]
fig, axes = plt.subplots(nrows=2, ncols=2, sharex=True, sharey=True, figsize=(10, 10))
for ax, covar in zip(axes.flat, covariates):
for cov, cov_df in results.groupby(covar):
ax.scatter(
cov_df.x,
cov_df.y,
s=3,
alpha=0.75,
label=cov,
)
if len(results[covar].unique()) < 8:
ax.legend()
ax.set_title(f"Embeddings by {covar}")
/tmp/ipykernel_518599/3584793474.py:12: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning. for cov, cov_df in results.groupby(covar):
adata_test.obs.columns
Index(['soma_joinid', 'dataset_id', 'assay', 'assay_ontology_term_id', 'cell_type', 'cell_type_ontology_term_id', 'development_stage', 'development_stage_ontology_term_id', 'disease', 'disease_ontology_term_id', 'donor_id', 'is_primary_data', 'self_reported_ethnicity', 'self_reported_ethnicity_ontology_term_id', 'sex', 'sex_ontology_term_id', 'suspension_type', 'tissue', 'tissue_ontology_term_id', 'tissue_general', 'tissue_general_ontology_term_id', 'raw_sum', 'nnz', 'raw_mean_nnz', 'raw_variance_nnz', 'n_measured_vars'], dtype='object')