Common Issues¶
1. Compilation failure due to incorrect CUDA_HOME
¶
In some cases where your default CUDA directory is linked to an old CUDA version (MinkowskiEngine requires CUDA >= 10.0), you might face some compilation issues that give you segmentation fault errors during compilation.
NVCC ...
Segmentation fault
To confirm, you should check your paths.
$ echo $CUDA_HOME
/usr/local/cuda
$ ls -al $CUDA_HOME
..... /usr/local/cuda -> /usr/local/cuda-10.2
$ ls /usr/local/
bin cuda cuda-10.2 cuda-11.0 ...
In this case, make sure you set the environment variable CUDA_HOME
to the right path and install the MinkowskiEngine.
export CUDA_HOME=/usr/local/cuda-10.2; python setup.py install
2. Compilation failure due to incorrect CUDA_HOME
¶
Some applications modify the environment variable CUDA_HOME
on your .bashrc
see #12.
This makes the pytorch CPPExtension module to fail leading to problems like src/common.hpp:40:10: fatal error: cublas_v2.h: No such file or directory
.
If you encounter this issue, try to set your CUDA_HOME
explicitly.
export CUDA_HOME=/usr/local/cuda; python setup.py install
Or you can use the path to nvcc
to automatically set the cuda home.
export CUDA_HOME=$(dirname $(dirname $(which nvcc))); python setup.py install
Compilation failure due to Out Of Memory (OOM)¶
The setup.py
calls the number of CPUs for multi-threaded parallel compilation. However, when installing the MinkowskiEngine on a cluster, sometimes the compilation might fail due to excessive memory usage. Please provide enough memory to the job for fast compilation. Another option when you have a limited memory is to compile without parallel compilation.
cd /path/to/MinkowskiEngine
make # single threaded compilation
python setup.py install
Compilation issues after an upgrade¶
In a rare case, you might face an compilation issue after you upgrade MinkowskiEngine, pytorch or CUDA. In general, when you get an undefined symbol error such (e.g., _ZNK13CoordsManagerILh5EiE8toStringB5cxx11Ev
), or thrust::system::system_error
, try to compile the entire library again using one of the following methods.
Force compiling all object files¶
cd /path/to/MinkowskiEngine
make clean
python setup.py install --force
From a new conda virtual environment¶
If above method doesn’t work, try to create a new conda environment. We found that it sometimes solves the compilation issues.
conda create -n py3-mink-2 python=3.7 anaconda
conda activate py3-mink-2
conda install openblas numpy
conda install pytorch torchvision -c pytorch
Then,
cd /path/to/MinkowskiEngine
conda activate py3-mink-2
make clean
python setup.py install --force
CUDA Version mismatch: undefined symbol
and invalid device function
.¶
In some cases when the conda pytorch uses a different CUDA version, you might get an undefined symbol error or CUDA error: invalid device function
.
Try to reinstall pytorch with the correct CUDA version that you are using to compile MinkowskiEngine.
To find out your CUDA version, run nvcc --version
.
To install the correct CUDA libraries for anaconda pytorch, install cudatoolkit=x.x
along with pytorch. For example,
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch
In this example, we assumed that you are using CUDA 10.1, but please make sure that you are installing the correct version. Then, use the following code snippet to create a new conda environment, and install MinkowskiEngine.
conda create -n py3-mink-2 python=3.7 anaconda
conda activate py3-mink-2
conda install openblas numpy
conda install pytorch torchvision cudatoolkit=10.1 -c pytorch # Make sure to use the correct cudatoolkit version
cd /path/to/MinkowskiEngine
conda activate py3-mink-2
make clean
python setup.py install --force
GPU Out-Of-Memory during training¶
Unlike neural networks with dense tensors where the input batches always require the same bytes, the sparse tensors have different number of non-zero elements or length for different batches, which results in new memory allocation if the current batch is larger than the allocated memory. Such repeated memory allocation will result in Out-Of-Memory error and thus one must clear the GPU cache at a regular interval.
def training(...):
...
sinput = ME.SparseTensor(...)
loss = criterion(...)
loss.backward()
optimizer.step()
...
torch.cuda.empty_cache()
Issues not listed¶
If you have a trouble installing MinkowskiEngine, please feel free to submit an issue on the MinkowskiEngine github page.