error with AMOEBA simulation

Daniel Konstantinovsky · Post by **Daniel Konstantinovsky** » Tue Mar 07, 2023 8:04 am

Hi OpenMM,

I am trying to run an AMOEBA simulation of a rather large system (81,000 atoms or so). I minimized and equilibrated the solvent just fine, then did more minimizations and that was fine too. But during the temperature annealing / heating step it first of all runs WAY too long for the amount of integration time and then four days in it crashes with the following error: openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700). The whole traceback is below. I'm not sure what to do. Is the system just too big for the GPU memory perhaps with this force field?

Traceback (most recent call last):
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 384, in <module>
heat("min.pdb", "heat.pdb")
File "/home/dk758/project/myTools/MD/eq_md_amoeba_full.py", line 265, in heat
sim.step(1)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 134, in step
self._simulate(endStep=self.currentStep+steps)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/app/simulation.py", line 204, in _simulate
self.integrator.step(stepsToGo)
File "/gpfs/gibbs/project/hammes_schiffer/dk758/conda_envs/sfg/lib/python3.9/site-packages/openmm/openmm.py", line 8405, in step
return _openmm.LangevinIntegrator_step(self, steps)
openmm.OpenMMException: Error invoking kernel: CUDA_ERROR_ILLEGAL_ADDRESS (700)
terminate called after throwing an instance of 'OpenMM::OpenMMException'
what(): Error deleting array param1: CUDA_ERROR_ILLEGAL_ADDRESS (700)
/var/spool/slurmd/job16345443/slurm_script: line 24: 31261 Aborted ${SCRIPT} Ab_1.pdb /home/dk758/palmer_scratch/Ab_1_amoeba_10ns_dt3fs.ncdf 10000

Peter Eastman · Post by **Peter Eastman** » Tue Mar 07, 2023 9:48 am

it first of all runs WAY too long for the amount of integration time

How many steps did it integrate before crashing?

Is the system just too big for the GPU memory perhaps with this force field?

That's possible. You can use nvidia-smi to check how much memory it's using, and how that compares to the total amount in your GPU. It also tells you the GPU utilitization, so you can tell whether it's actually computing anything. And while you're at it, I would use top to check the CPU utilization and host memory use. That will give you some sense of what's going on.

Daniel Konstantinovsky · Post by **Daniel Konstantinovsky** » Fri Mar 10, 2023 7:45 am

It says GPU util is 98-100% and the memory usage is only 343MiB / 8192MiB. The cpu memory usage from top is 0.3% (negligible). It runs for 1000s or even 10000s of steps before suddenly crashing. It's definitely integrating I made sure of that. Do you have any idea what could be causing the problem in this case?

Peter Eastman · Post by **Peter Eastman** » Tue Mar 14, 2023 11:18 am

Memory space is clearly not the problem. That error probably indicates data corruption of some sort. It's basically a segfault on the GPU. It's hard to guess at the cause. Is this problem reproducible, or did it only happen once? Try rebooting and then run the simulation again.

it first of all runs WAY too long for the amount of integration time

Is that because each time step is too slow, or did it start out running at a normal rate and then hang?

Daniel Konstantinovsky · Post by **Daniel Konstantinovsky** » Mon Mar 20, 2023 4:24 pm

It crashes every time (this is the third or fourth time today). It's a heating / annealing step. It crashes at 170 K for some reason. I think it crashes at different times for different runs. Could it be an AMOEBA implementation bug?

Peter Eastman · Post by **Peter Eastman** » Mon Mar 20, 2023 5:23 pm

It's hard to speculate about the cause. Can you post all the files needed to reproduce it?

Daniel Konstantinovsky · Post by **Daniel Konstantinovsky** » Thu Mar 23, 2023 9:32 am

I've determined the (proximal) cause to be blowing up of the simulation (the coordinates eventually go wild spontaneously whether I do NPT, NVT, or NPT heating). I don't know what causes it. I am starting from a minimized structure. I tried to attach the pdb and script but it wouldn't let me. Is there another way I could send the files?

Peter Eastman · Post by **Peter Eastman** » Thu Mar 23, 2023 10:09 am

The FAQ has advice on debugging simulations that blow up. It might be useful here.

How about opening an issue on Github for this problem and attaching the files there?

error with AMOEBA simulation

error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation

Re: error with AMOEBA simulation