General debugging strategies for "ValueError: Energy is NaN"

The functionality of OpenMM will (eventually) include everything that one would need to run modern molecular simulation.
POST REPLY
User avatar
Mark Williamson
Posts: 31
Joined: Tue Feb 10, 2009 5:06 pm

General debugging strategies for "ValueError: Energy is NaN"

Post by Mark Williamson » Wed Mar 09, 2016 7:15 am

Dear all,

When observing non-deterministic "ValueError: Energy is NaN" from statedatareporter.py's _checkForErrors in standard simulations, are there any general strategies for debugging this? For instance, is there a way to increase the verbosity of _checkForErrors to localise the term that is causing the NaN or is it possible to compile OpenMM in a "debug" mode for more information?

Regards,

Mark

User avatar
Peter Eastman
Posts: 2553
Joined: Thu Aug 09, 2007 1:25 pm

Re: General debugging strategies for "ValueError: Energy is

Post by Peter Eastman » Wed Mar 09, 2016 11:12 am

By the time StateDataReporter notices the problem, the root cause is long past. A good first step is to put each Force into its own force group, then query the energy of each one individually. That way you can figure out which one is causing the problem. You'll probably need to do that at every time step; when one Force starts generating bad forces, that probably means on the next time step the system will blow up and every Force will produce NaNs or infinities. This also lets you see how the problem is appearing. Does the energy abruptly jump to NaN in a single time step? Or can you see it gradually blowing up over several time steps?

A few other questions that might be relevant:

What platform are you using?
What version of OpenMM are you using?
What sort of system is this (size, force field, etc.)?

Peter

User avatar
Mark Williamson
Posts: 31
Joined: Tue Feb 10, 2009 5:06 pm

Re: General debugging strategies for "ValueError: Energy is

Post by Mark Williamson » Fri Mar 11, 2016 7:30 am

Dear Peter,

Thank you for these useful suggestions; the force grouping sounds great for localisation of the problem. I will take these on board for future.

My problem turned out to be a bug in a customer file reporter that I had written. This was being used to write out a minimised structure, however it was not writing the minimised structure, but the structure before the mimisation. Hence, when I started dynamics from this structure, which had two very close atoms, it logically created a massive potential and then generated an nan exception within a few steps of MD. This was all my fault.

For reference it was the CUDA platform with OpenMM 6.3.1; the system size was about 20k atoms, using the amber99sb ff.

Thank again for the pointers,

Mark

POST REPLY