Hi OpenMM,
I am trying to run an AMOEBA simulation of a rather large system (81,000 atoms or so). I minimized and equilibrated the solvent just fine, then did more minimizations and that was fine too. But during the temperature annealing / heating step it first of all runs WAY too long for the amount of integration time and then four days in it crashes with the following error: openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700). The whole traceback is below. I'm not sure what to do. Is the system just too big for the GPU memory perhaps with this force field?
Traceback (most recent call last):
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 384, in <module>
heat("min.pdb", "heat.pdb")
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 265, in heat
sim.step(1)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 134, in step
self._simulate(endStep=self.currentStep+steps)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 204, in _simulate
self.integrator.step(stepsToGo)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/openmm.py", line 8405, in step
return _openmm.LangevinIntegrator_step(self, steps)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
terminate called after throwing an instance of 'OpenMM::OpenMMException'
what(): Error deleting array param1: CUDA_ERROR_ILLEGAL_ADDRESS (700)
/var/spool/slurmd/job16345443/slurm_script: line 24: 31261 Aborted ${SCRIPT} Ab_1.pdb /home/dk758/palmer_scratch/Ab_1_amoeba_10ns_dt3fs.ncdf 10000
error with AMOEBA simulation
- Peter Eastman
- Posts: 2593
- Joined: Thu Aug 09, 2007 1:25 pm
Re: error with AMOEBA simulation
How many steps did it integrate before crashing?it first of all runs WAY too long for the amount of integration time
That's possible. You can use nvidia-smi to check how much memory it's using, and how that compares to the total amount in your GPU. It also tells you the GPU utilitization, so you can tell whether it's actually computing anything. And while you're at it, I would use top to check the CPU utilization and host memory use. That will give you some sense of what's going on.Is the system just too big for the GPU memory perhaps with this force field?
- Daniel Konstantinovsky
- Posts: 77
- Joined: Tue Jun 11, 2019 12:21 pm
Re: error with AMOEBA simulation
It says GPU util is 98-100% and the memory usage is only 343MiB / 8192MiB. The cpu memory usage from top is 0.3% (negligible). It runs for 1000s or even 10000s of steps before suddenly crashing. It's definitely integrating I made sure of that. Do you have any idea what could be causing the problem in this case?
- Peter Eastman
- Posts: 2593
- Joined: Thu Aug 09, 2007 1:25 pm
Re: error with AMOEBA simulation
Memory space is clearly not the problem. That error probably indicates data corruption of some sort. It's basically a segfault on the GPU. It's hard to guess at the cause. Is this problem reproducible, or did it only happen once? Try rebooting and then run the simulation again.
Is that because each time step is too slow, or did it start out running at a normal rate and then hang?it first of all runs WAY too long for the amount of integration time
- Daniel Konstantinovsky
- Posts: 77
- Joined: Tue Jun 11, 2019 12:21 pm
Re: error with AMOEBA simulation
It crashes every time (this is the third or fourth time today). It's a heating / annealing step. It crashes at 170 K for some reason. I think it crashes at different times for different runs. Could it be an AMOEBA implementation bug?
- Peter Eastman
- Posts: 2593
- Joined: Thu Aug 09, 2007 1:25 pm
Re: error with AMOEBA simulation
It's hard to speculate about the cause. Can you post all the files needed to reproduce it?
- Daniel Konstantinovsky
- Posts: 77
- Joined: Tue Jun 11, 2019 12:21 pm
Re: error with AMOEBA simulation
I've determined the (proximal) cause to be blowing up of the simulation (the coordinates eventually go wild spontaneously whether I do NPT, NVT, or NPT heating). I don't know what causes it. I am starting from a minimized structure. I tried to attach the pdb and script but it wouldn't let me. Is there another way I could send the files?
- Peter Eastman
- Posts: 2593
- Joined: Thu Aug 09, 2007 1:25 pm
Re: error with AMOEBA simulation
The FAQ has advice on debugging simulations that blow up. It might be useful here.
How about opening an issue on Github for this problem and attaching the files there?
How about opening an issue on Github for this problem and attaching the files there?