Page 1 of 1

Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 12:10 pm
by simtkolos
I am trying to install OpenMM with conda on a machine that has multiple CUDA installations in /usr/local and /usr/local/cuda is a symlink.

Code: Select all

lrwxrwxrwx  1 root root   19 Sep 11 13:57 cuda -> /usr/local/cuda-8.0/
drwxr-xr-x 19 root root 4096 Sep 11 13:45 cuda-10.0/
drwxr-xr-x 17 root root 4096 Jun  5  2017 cuda-8.0/
drwxr-xr-x 18 root root 4096 Jan  8  2019 cuda-9.1/
The /usr/local directory is read-only for me, I cannot change the symlink. When I install OpenMM with conda via

Code: Select all

conda install -c omnia/label/cuda100 -c conda-forge openmm
the GPU platform won't work because it tries to link with CUDA 8.0 libraries. Is there a way to modify the conda command to tell it explicitly to use the /usr/local/cuda-10.0 directory instead of the default /usr/local/cuda?

Thanks,

Istvan

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 12:18 pm
by peastman
There are two distinct issues. The first is what libraries it links to. If it's finding the libraries under /usr/local/cuda, that means you probably have LD_LIBRARY_PATH set to include it. Check what the current value is with

Code: Select all

printenv LD_LIBRARY_PATH
Then you can set it to an appropriate value with, for example

Code: Select all

export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
The second issue is that OpenMM needs to find the CUDA compiler at runtime. You can use the OPENMM_CUDA_COMPILER environment variable to control that:

Code: Select all

export OPENMM_CUDA_COMPILER=/usr/local/cuda-10.0/bin/nvcc

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 12:45 pm
by simtkolos
I tried this already but it shows this error message. My guess was that CUDA 8.0 was still referenced somehow. Does OpenMM link to CUDA only runtime? How do I know my anaconda environment where I installed OpenMM knows the value of my Linux envars?

Code: Select all

$ python -m simtk.testInstallation

OpenMM Version: 7.4.1
Git Revision: 068f120206160d5151c9af0baf810384bba8d052

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Error computing forces with CUDA platform
4 OpenCL - Successfully computed forces

CUDA platform error: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218)

Median difference in forces between platforms:

Reference vs. CPU: 6.29717e-06
Reference vs. OpenCL: 6.75312e-06
CPU vs. OpenCL: 8.07169e-07

All differences are within tolerance.

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 1:02 pm
by peastman
It's possible this is just caused by a cached kernel that was compiled on an earlier run with the 8.0 compiler. It saves compiled kernels in /tmp to speed up context creation. They all get deleted when you reboot. Or you can just delete them yourself. Look in /tmp, and you should see a lot of files whose names are a 20 character hash, followed by the GPU's compute capability and either _32 or _64.

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 1:51 pm
by simtkolos
The machine is a cluster front end, I can't reboot it, and I couldn't find any compiled kernels in /tmp. I tried starting with a clean slate, removing the entire anaconda3 directory, starting a new console, and then

Code: Select all

$ export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64
$ export PATH=/usr/local/cuda-10.0/bin/:$PATH
$ export OPENMM_CUDA_COMPILER=/usr/local/cuda-10.0/bin/nvcc
$ bash Anaconda3-2019.10-Linux-x86_64.sh
$ source .bashrc
# make sure envars are still set
(base)$ printenv | grep OPENMM
(base)$ printenv | grep PATH
# install OpenMM
(base)$ conda install -c omnia/label/cuda100 -c conda-forge openmm
(base)$ python -m simtk.testInstallation

OpenMM Version: 7.4.1
Git Revision: 068f120206160d5151c9af0baf810384bba8d052

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Error computing forces with CUDA platform
4 OpenCL - Successfully computed forces

CUDA platform error: Error loading CUDA module: CUDA_ERROR_INVALID_PTX (218)

Median difference in forces between platforms:

Reference vs. CPU: 6.29856e-06
Reference vs. OpenCL: 6.75312e-06
CPU vs. OpenCL: 8.10363e-07

All differences are within tolerance.

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 1:59 pm
by peastman
Reinstalling wouldn't have any effect on the cached files. Is the environment variable TMPDIR set? If so, that's the directory where they're being created. If not, try setting it to point to a directory you create inside your home directory.

Of course there's another option. We provide prebuilt OpenMM libraries for all CUDA releases since 7.5. If the cluster administrators really want you to be using 8.0, you could just install that one by specifying cuda80 instead of cuda100.

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 2:12 pm
by simtkolos
TMPDIR is not set and setting it to ~/tmp didn't make any difference. CUDA 8.0 will go away from the cluster soon, but there will be multiple CUDA installations available for different applications and there is no guarantee that /usr/local/cuda will point to CUDA-10.0 or CUDA-10.1. I guess, I'd like to use OpenMM with the latest CUDA, though, and that may not be the default on the cluster.

Re: Conda install with explicit reference to CUDA location

Posted: Thu Jan 30, 2020 3:26 pm
by simtkolos
I wonder if this might be a CUDA driver issue. The cluster has a rather old version 390.87.

Re: Conda install with explicit reference to CUDA location

Posted: Mon Feb 03, 2020 9:38 am
by simtkolos
Yes, with the more recent driver it is working, thanks.