Troubleshooting#
ONNX Runtime Error when binding input#
When running an ONNX based model, such as FengWu or Pangu, one may see a runtime error where the model fails to bind input data when using a GPU. The error message may look like.
RuntimeError: Error when binding input: There’s no data transfer registered for copying tensors from Device:[DeviceType:1 MemoryType:0 DeviceId:0]
or
onnxruntime::Provider& onnxruntime::ProviderLibrary::Get() [ONNXRuntimeError] : 1 : FAIL : Failed to load library libonnxruntime_providers_cuda.so with error: libcublasLt.so.11: cannot open shared object file: No such file or directory.
This is an error from ONNX runtime not being installed correctly. If you are using CUDA 12 make sure you manually pip install following the instructions on the ONNX documentation. You may need to manally link the need libraries, see this Github issue for reference.
ImportError: object is not installed, install manually or using pip#
This is an error that arrises typically when the proper optional dependencies are not installed on the system. For example:
>>> from earth2studio.data import CDS
>>> CDS()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/earth2studio/earth2studio/data/cds.py", line 90, in __init__
raise ImportError(
ImportError: cdsapi is not installed, install manually or using `pip install earth2studio[data]`
The error message should indicate what the install group is that needs to be added. In the above example, running the command:
uv pip install earth2studio[data]
# Or with pip
pip install earth2studio[data]
# Or if you are developer
uv sync --extra data
will fix the problem. For additional information refer to the Optional Dependencies section.
Flash Attention has long build time for AIFS#
This is a known issue with the library with several issues on the subject. There are a few options to try outside of just waiting for the build to complete.
If using a docker container is possible, the PyTorch docker container on NGC has flash attention already built inside of it. See PyTorch Docker Container for details on how to install Earth2Studio inside a container.
Speed up the compile time by increasing the number of jobs used during the build process. The upper limit depends on the systems memory, too large may result in a crash:
# Ninja build jobs, increase depending on system memory export MAX_JOBS=8
Disable unused features in the library not needed for inference:
# https://github.com/Dao-AILab/flash-attention/issues/1486 export FLASH_ATTENTION_DISABLE_HDIM128=FALSE export FLASH_ATTENTION_DISABLE_CLUSTER=FALSE export FLASH_ATTENTION_DISABLE_BACKWARD=TRUE export FLASH_ATTENTION_DISABLE_SPLIT=TRUE export FLASH_ATTENTION_DISABLE_LOCAL=TRUE export FLASH_ATTENTION_DISABLE_PAGEDKV=TRUE export FLASH_ATTENTION_DISABLE_FP16=TRUE export FLASH_ATTENTION_DISABLE_FP8=TRUE export FLASH_ATTENTION_DISABLE_APPENDKV=TRUE export FLASH_ATTENTION_DISABLE_VARLEN=TRUE export FLASH_ATTENTION_DISABLE_PACKGQA=TRUE export FLASH_ATTENTION_DISABLE_SOFTCAP=TRUE export FLASH_ATTENTION_DISABLE_HDIM64=TRUE export FLASH_ATTENTION_DISABLE_HDIM96=TRUE export FLASH_ATTENTION_DISABLE_HDIM192=TRUE export FLASH_ATTENTION_DISABLE_HDIM256=TRUE