Don’t Yet Trust the Model, Test the Physics

AI physics models are advancing quickly and are beginning to prove their value in enterprise engineering workflows. But a critical bottleneck remains: rigorous, repeatable evaluation. Comparing a new model against the current state of the art still too often means stitching together datasets, metrics, scripts, and baselines by hand. That keeps evaluation behind a skill curtain and slows down both model development and domain expert adoption. To push the state of the art forward at the speed of light, we need to make evaluation easier for the people who understand the physics, the data, and the edge cases best. Their feedback will enable the AI researchers to surgically operate and build new bleeding edge models. This blog highlights the new and improved PhysicsNeMo CFD module to address the current gaps and strengthen this loop between model developers and model evaluators.

The Consistency Gap

SciML Researchers building the next generation of AI Physics models need a consistent way to compare against current baselines, but today's evaluation process can be fragmented. Teams often scrape different repositories, reproduce inconsistent setups, and hand-build metric pipelines before they can make an apples-to-apples comparison. The target engineering community has high standards for reliability, physical consistency, and generalization; without transparent benchmarks, domain experts are left in a wait-and-see posture.

Validation is important

Holistic benchmarking: Metrics should capture physical consistency and engineering relevance, not just pointwise L2 error.
Consistent benchmarking: Models should be evaluated on shared datasets, metrics, and reporting formats so comparisons are fair.
Leaderboards: A current snapshot of state-of-the-art performance helps guide model development, ablations, and ideation.

The Expertise Gap

Evaluating an AI physics surrogate for a CFD use case, such as aerodynamic flow over a vehicle or conjugate heat transfer over a chip, often requires bespoke code and manual data plumbing. That creates a barrier for domain experts: the engineers who have the physical intuition to recognize non-physical behavior that a standard loss function would not capture.

Accessibility Matters

Verification of physical consistency: A model can achieve a low mean squared error (MSE) while still violating important physical constraints. Domain experts need intuitive diagnostics and visualizations that surface those silent failures.
Stress-testing edge cases: Domain experts know the breaking points of simulation workflows, from extreme turbulence to unusual operating regimes. Accessible evaluation lets them probe models where failures are most likely.
Trust and adoption: For AI to be used in real-world engineering, the black box must be paired with transparent benchmarks that experts can inspect, extend, and discuss in the language of their discipline.

Path Toward Democratization

To bring more experts into the fold, the field needs evaluation workflows that are:

Representative: Pretrained models and datasets should cover meaningful CFD use cases so experts can explore realistic behavior quickly.
Low-code: Engineers should be able to evaluate models on their own data by writing minimal wrappers and adapters.
Collaborative: Open libraries should make it easy to contribute challenge cases, custom metrics, and diagnostics that reveal where current models struggle.

The Bottom Line: If we want AI to solve the next generation of physics problems, the referees of these models should not need to overcome barriers to develop AI expertise first. By making evaluation accessible, we turn domain experts from skeptical observers into active architects of more robust AI.

Introducing PhysicsNeMo CFD

PhysicsNeMo CFD is a sub-module of NVIDIA PhysicsNeMo framework that provides tools for integrating pretrained AI models into engineering and CFD workflows. It includes config-driven workflows for surrogate model evaluation and benchmarking, along with utilities for model wrappers, dataset adapters, metric computation, and postprocessing.

This makes evaluation simpler for engineers and domain experts: many experiments require only a few YAML configuration changes. Advanced users can extend the same workflow with custom datasets, custom model wrappers, and new metrics while still reporting results through a consistent harness.

Let's look at GeoTransolver as an example. After installing the module like any other Python package, you can use the benchmarking workflow to evaluate a pretrained GeoTransolver checkpoint on DrivAerML or your own compatible CFD dataset. The workflow reports metrics such as L2 pressure, L2 turbulent viscosity, L2 velocity, and integrated quantities such as drag and lift coefficient errors. For the default volume benchmark, evaluation comes down to running one script with the right configuration file.

# Volume benchmark
python main.py --config-name=config_volume

The configuration file is designed to be easy to customize. You can swap the model checkpoint, dataset path, inference domain, or metric list from the built-in wrappers, and advanced users can register their own models, datasets, metrics, and visuals.

benchmark:
 ...
  models:
    - name: "geotransolver"
      inference_domain: volume
      checkpoint: /path/to/checkpoint
      ...
  datasets:
    - name: "drivaerml"
      root: /path/to/dataset
     ...

metrics:
  - l2_pressure
  - l2_turbulent_viscosity
  - l2_velocity
...

To make exploration easier, NVIDIA PhysicsNeMo team has also published pretrained checkpoints on Hugging Face for model architectures such as GeoTransolver, DoMINO, MeshGraphNet, and FigConvNet. These checkpoints give teams a starting point for reproducing benchmark results before bringing in their own models or data. The figure and the table below show the kind of validation and analysis that domain experts can perform using the benchmarking workflow with a custom model, dataset or custom metrics.

This is just the beginning. We will continue building PhysicsNeMo CFD to help model builders and domain experts meet on common evaluation ground. A leaderboard can provide a living snapshot of state-of-the-art performance, helping researchers understand where current models excel, where they fail, and what they need to eclipse. Join the conversation on GitHub by sharing issues, datasets, metrics, models, and challenge cases.

Fig 1: Figure showing outputs from the benchmarking notebook using the DrivAerML dataset.

Getting Started

Start with the PhysicsNeMo CFD GitHub repository. Install the package in your Python environment, or clone the repository if you want to run the example workflows locally. From there, choose the benchmark assets you want to evaluate against and point the YAML configuration to the relevant checkpoint, dataset, and metrics.

For the GeoTransolver example above, the workflow is intentionally lightweight: run the benchmark script from the benchmarking workflow with the volume configuration, then swap in a different checkpoint, dataset path, or metric list as needed. Advanced users can register custom model wrappers, dataset adapters, metrics, and visuals without rewriting the full harness.

The output should give both model builders and domain experts a shared readout of model behavior: field-level errors, integrated engineering quantities, physics-aware diagnostics, and visual comparisons that make failure modes easier to inspect.

How can you contribute?

PhysicsNeMo CFD is intended to grow with the community. Contributions can be as small as a bug report or documentation fix, or as substantial as a new benchmark dataset, model wrapper, metric, visualization, or challenge case.

If you are building AI physics models, contribute workflows that make your evaluation reproducible. If you are a domain expert, contribute edge cases and diagnostics that reveal failure modes standard loss curves miss. The goal is to make the benchmark harness a shared meeting ground where model builders and engineering experts can test, compare, and improve CFD surrogates together.

To get involved, open an issue or pull request in the PhysicsNeMo CFD GitHub repository and share what you are evaluating, what data or metrics matter, and where current models need to improve.

Share on Share on Share on LinkedIn Discuss on