Page 1 of 1

error with AMOEBA simulation

Posted: Tue Mar 07, 2023 8:04 am
by dkonstan
Hi OpenMM,

I am trying to run an AMOEBA simulation of a rather large system (81,000 atoms or so). I minimized and equilibrated the solvent just fine, then did more minimizations and that was fine too. But during the temperature annealing / heating step it first of all runs WAY too long for the amount of integration time and then four days in it crashes with the following error: openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700). The whole traceback is below. I'm not sure what to do. Is the system just too big for the GPU memory perhaps with this force field?

Traceback (most recent call last):
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 384, in <module>
heat("min.pdb", "heat.pdb")
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 265, in heat
sim.step(1)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 134, in step
self._simulate(endStep=self.currentStep+steps)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 204, in _simulate
self.integrator.step(stepsToGo)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/openmm.py", line 8405, in step
return _openmm.LangevinIntegrator_step(self, steps)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
terminate called after throwing an instance of 'OpenMM::OpenMMException'
what(): Error deleting array param1: CUDA_ERROR_ILLEGAL_ADDRESS (700)
/var/spool/slurmd/job16345443/slurm_script: line 24: 31261 Aborted ${SCRIPT} Ab_1.pdb /home/dk758/palmer_scratch/Ab_1_amoeba_10ns_dt3fs.ncdf 10000

Re: error with AMOEBA simulation

Posted: Tue Mar 07, 2023 9:48 am
by peastman
it first of all runs WAY too long for the amount of integration time
How many steps did it integrate before crashing?
Is the system just too big for the GPU memory perhaps with this force field?
That's possible. You can use nvidia-smi to check how much memory it's using, and how that compares to the total amount in your GPU. It also tells you the GPU utilitization, so you can tell whether it's actually computing anything. And while you're at it, I would use top to check the CPU utilization and host memory use. That will give you some sense of what's going on.

Re: error with AMOEBA simulation

Posted: Fri Mar 10, 2023 7:45 am
by dkonstan
It says GPU util is 98-100% and the memory usage is only 343MiB / 8192MiB. The cpu memory usage from top is 0.3% (negligible). It runs for 1000s or even 10000s of steps before suddenly crashing. It's definitely integrating I made sure of that. Do you have any idea what could be causing the problem in this case?

Re: error with AMOEBA simulation

Posted: Tue Mar 14, 2023 11:18 am
by peastman
Memory space is clearly not the problem. That error probably indicates data corruption of some sort. It's basically a segfault on the GPU. It's hard to guess at the cause. Is this problem reproducible, or did it only happen once? Try rebooting and then run the simulation again.
it first of all runs WAY too long for the amount of integration time
Is that because each time step is too slow, or did it start out running at a normal rate and then hang?

Re: error with AMOEBA simulation

Posted: Mon Mar 20, 2023 4:24 pm
by dkonstan
It crashes every time (this is the third or fourth time today). It's a heating / annealing step. It crashes at 170 K for some reason. I think it crashes at different times for different runs. Could it be an AMOEBA implementation bug?

Re: error with AMOEBA simulation

Posted: Mon Mar 20, 2023 5:23 pm
by peastman
It's hard to speculate about the cause. Can you post all the files needed to reproduce it?

Re: error with AMOEBA simulation

Posted: Thu Mar 23, 2023 9:32 am
by dkonstan
I've determined the (proximal) cause to be blowing up of the simulation (the coordinates eventually go wild spontaneously whether I do NPT, NVT, or NPT heating). I don't know what causes it. I am starting from a minimized structure. I tried to attach the pdb and script but it wouldn't let me. Is there another way I could send the files?

Re: error with AMOEBA simulation

Posted: Thu Mar 23, 2023 10:09 am
by peastman
The FAQ has advice on debugging simulations that blow up. It might be useful here.

How about opening an issue on Github for this problem and attaching the files there?