Particle position is NAN error

Brian Geiss · Post by **Brian Geiss** » Tue Jun 18, 2013 1:07 pm

I recently upgraded to OpenMM 5.1 and am running it on my Windows 7 PC with a Nvidia GTX 580. I have been running implicit solvent simulations (amber99sb) of a 265 amino acid protein using a script modified from the OpenMM script builder (OpenCL or CUDA, single precision). I can start the simulation, but about 1600000 steps into the run it aborts and gives me the following error:

Minimizing...
Equilibrating...
Running Production...
Traceback (most recent call last):
File "wt.py", line 49, in <module>
simulation.step(100000000)
File "C:\Python26\lib\site-packages\simtk\openmm\app\simulation.py", line 127,
in step
reporter.report(self, state)
File "C:\Python26\lib\site-packages\simtk\openmm\app\pdbreporter.py", line 78,
in report
PDBFile.writeModel(simulation.topology, state.getPositions(), self._out, sel
f._nextModel)
File "C:\Python26\lib\site-packages\simtk\openmm\app\pdbfile.py", line 258, in
writeModel
raise ValueError('Particle position is NaN')
ValueError: Particle position is NaN

I ran similar simulations with these input files using OpenMM 4.1 with no issues. Do you have any ideas about what may be happening?

Thank you for your time.

-Brian

Peter Eastman · Post by **Peter Eastman** » Thu Jun 20, 2013 11:30 am

Hi Brian,

This means your simulation has blown up, but it's hard to say why. There are many different things that can cause that to happen, such as using a time step that's too large or having inappropriate parameter values for the system you're simulating. Since it ran for quite a while before this happened, it's likely to be something subtle.

Is this problem reproducible? Does it always fail like this, or did it just happen once? If it's reproducible, how much variation is there in the number of time steps it gets through before the error occurs?

Can you post your files (your script and, if possible, your input files) so I can take a look at them and try to reproduce the problem?

Peter

Brian Geiss · Post by **Brian Geiss** » Sat Jun 22, 2013 12:36 pm

Hi Peter,

I've been having this problem for the last month when trying to run simulations, and I haven't been able to run past ~1.5ns on any simulation. I've attached the scripts below I use to set up the system and to run the simulation below. Both were derived from the user guide.

Another possible problem may be the GPU itself. Since posting I've run the testinstallation.py script a number of times and sometimes get a positive results for CUDA and OpenCL, and other times can only get a result for the reference platform. I've also seen clFlush exceptions that abort the simulation. The GTX 580 I'm using has been working fairly hard for the last 2 years, so it's possible it's crapping out after running a simulation for a certain period of time due to memory errors or an inability to clear GPU memory effectively. I have ordered a new card and will be testing it next week.

#Setup Script
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *

print('Loading...')
pdb = PDBFile('WT.pdb')
forcefield = ForceField('amber99sb.xml', 'amber99sb_obc.xml')
modeller = Modeller(pdb.topology, pdb.positions)
print('Adding hydrogens...')
modeller.addHydrogens(forcefield, pH=7.0)
print('Minimizing...')
system = forcefield.createSystem(modeller.topology, nonbondedMethod=CutoffNonPeriodic)
integrator = VerletIntegrator(0.001*picoseconds)
simulation = Simulation(modeller.topology, system, integrator)
simulation.context.setPositions(modeller.positions)
simulation.minimizeEnergy(maxIterations=100)
print('Saving...')
positions = simulation.context.getState(getPositions=True).getPositions()
PDBFile.writeFile(simulation.topology, positions, open('WT_Fixed.pdb', 'w'))
print('Done')

#Production Script
from __future__ import print_function
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout

pdb = PDBFile('WT_Fixed.pdb')
forcefield = ForceField('amber99sb.xml', 'amber99sb_obc.xml')
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME, nonbondedCutoff=1*nanometer,
     constraints=HBonds, rigidWater=True)

integrator = LangevinIntegrator(310*kelvin, 1.0/picoseconds, 2.0*femtoseconds)
integrator.setConstraintTolerance(0.00001)
platform = Platform.getPlatformByName('OpenCL')
properties = {'OpenCLPrecision': 'single'}
simulation = Simulation(pdb.topology, system, integrator, platform, properties)
simulation.context.setPositions(pdb.positions)

print('Minimizing...')
simulation.minimizeEnergy()

simulation.context.setVelocitiesToTemperature(310*kelvin)
print('Equilibrating...')
simulation.step(100000)

simulation.reporters.append(PDBReporter('WT_Output.pdb', 10000))
simulation.reporters.append(StateDataReporter(stdout, 10000, step=True, potentialEnergy=True, temperature=True))
simulation.reporters.append(StateDataReporter('WT_Data', 10000, step=True,
    time=True, potentialEnergy=True, kineticEnergy=True, totalEnergy=True,
    temperature=True, volume=True, density=True))

print('Running Production...')
simulation.step(10000000)
print('Done!')

Peter Eastman · Post by **Peter Eastman** » Sat Jun 22, 2013 5:45 pm

Hi Brian,

I wonder if your GPU is overheating? Does it always fail after exactly the same number of time steps? That would suggest it's something deterministic. Or does it vary, but always take roughly the same amount of time before failing? That would sound like overheating. Or is it just as likely to fail on the first step as on the 1 millionth step? That would suggest a completely non-deterministic error.

It might just be a coincidence that you began having problems around the start of summer, but then again it might not.

Peter

Brian Geiss · Post by **Brian Geiss** » Wed Jun 26, 2013 7:05 pm

Peter,

I installed a Nvidia GTX 780 and I am no longer having these issues. I ran a 50ns implicit simulation with no errors and am currently running a 10ns explicit simulation where everything looks to be going well. I think the GPU was either overheating or the memory was failing.

Thanks for your help.

Best,

-Brian

Peter Eastman · Post by **Peter Eastman** » Thu Jun 27, 2013 10:15 am

Ok, great. Glad to hear it's working now.

Peter

Particle position is NAN error

Particle position is NAN error

Re: Particle position is NAN error

Re: Particle position is NAN error

Re: Particle position is NAN error

Re: Particle position is NAN error

Re: Particle position is NAN error