CUDA_ERROR_LAUNCH_FAILED Error

The functionality of OpenMM will (eventually) include everything that one would need to run modern molecular simulation.
POST REPLY
User avatar
Saurabh Belsare
Posts: 32
Joined: Sat Aug 14, 2010 8:43 am

CUDA_ERROR_LAUNCH_FAILED Error

Post by Saurabh Belsare » Fri Jun 17, 2016 10:09 am

Hi,

I'm running a protein in water box system with the AMOEBA force field on an iMac NVIDIA GPU on the CUDA platform using OpenMM6.3 compiled from source. My system is ~30k atoms including waters. I'm running an NVT simulation with the Langevin integrator. However, I'm noticing a stochastic error that appears at random times during the simulation, and the simulation dies. The error is as below:
Traceback (most recent call last):
File "equilibrate_nvt.py", line 26, in <module>
simulation.step(1)
File "/usr/local/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 106, in step
self._simulate(endStep=self.currentStep+steps)
File "/usr/local/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 172, in _simulate
self.integrator.step(stepsToGo)
File "/usr/local/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 13622, in step
return _openmm.LangevinIntegrator_step(self, steps)
Exception: Error downloading array diisMatrix: CUDA_ERROR_LAUNCH_FAILED (719)
Exception Exception: 'Error deleting array bondParams: CUDA_ERROR_LAUNCH_FAILED (719)' in <built-in function delete_Context> ignored
Exception Exception: 'Error deleting array langevinParams: CUDA_ERROR_LAUNCH_FAILED (719)' in <built-in function delete_LangevinIntegrator> ignored
python equilibrate_nvt.py 21.69s user 7.23s system 0% cpu 3:00:21.30 total

The first time this error happened after ~10k steps of 1fs. I deleted the simulation and restarted from the beginning using a bin checkpoint file, and it ran ~30k steps before it died with the same error. With another restart, it died within ~5k steps. An NPT simulation of ~50k steps completed before this with the same structure, and the output of the NPT is being used as the starting point for the NVT, via a bin checkpoint file. So the error is rather stochastic. Any idea what might be happening and how to fix it?

Thank you.

Saurabh

User avatar
Peter Eastman
Posts: 2553
Joined: Thu Aug 09, 2007 1:25 pm

Re: CUDA_ERROR_LAUNCH_FAILED Error

Post by Peter Eastman » Fri Jun 17, 2016 10:17 am

Hi Saurabh,

Could you try updating to the most recent version of OpenMM (7.0.1)? I don't have any specific reason to think that will fix the problem, but let's start by ruling out the possibility that you're hitting a bug that's already been fixed in a newer version.

If that doesn't fix it, could you post your files so I can try to reproduce the problem?

Peter

User avatar
Saurabh Belsare
Posts: 32
Joined: Sat Aug 14, 2010 8:43 am

Re: CUDA_ERROR_LAUNCH_FAILED Error

Post by Saurabh Belsare » Fri Jun 17, 2016 10:51 am

Hi Dr. Eastman,

I'm running the same structure with OpenMM 7.0.1 installed using conda on OSx 10.11 with CUDA 7.5 now. It'll take some time to run, and I'll let you know if it still runs into errors.

Thank you.

Saurabh

User avatar
Saurabh Belsare
Posts: 32
Joined: Sat Aug 14, 2010 8:43 am

Re: CUDA_ERROR_LAUNCH_FAILED Error

Post by Saurabh Belsare » Sun Jun 19, 2016 3:09 pm

Hi Dr. Eastman,

Running the job in OpenMM 7.0.1, it completed without any error. Maybe it is something specific to the 6.3 version. Would you recommend completely moving over to 7.0.1 and avoiding 6.3 all together?

Thank you.

Saurabh

POST REPLY