Page 1 of 1

General debugging strategies for "ValueError: Energy is NaN"

Posted: Wed Mar 09, 2016 7:15 am
by mjw
Dear all,

When observing non-deterministic "ValueError: Energy is NaN" from statedatareporter.py's _checkForErrors in standard simulations, are there any general strategies for debugging this? For instance, is there a way to increase the verbosity of _checkForErrors to localise the term that is causing the NaN or is it possible to compile OpenMM in a "debug" mode for more information?

Regards,

Mark

Re: General debugging strategies for "ValueError: Energy is

Posted: Wed Mar 09, 2016 11:12 am
by peastman
By the time StateDataReporter notices the problem, the root cause is long past. A good first step is to put each Force into its own force group, then query the energy of each one individually. That way you can figure out which one is causing the problem. You'll probably need to do that at every time step; when one Force starts generating bad forces, that probably means on the next time step the system will blow up and every Force will produce NaNs or infinities. This also lets you see how the problem is appearing. Does the energy abruptly jump to NaN in a single time step? Or can you see it gradually blowing up over several time steps?

A few other questions that might be relevant:

What platform are you using?
What version of OpenMM are you using?
What sort of system is this (size, force field, etc.)?

Peter

Re: General debugging strategies for "ValueError: Energy is

Posted: Fri Mar 11, 2016 7:30 am
by mjw
Dear Peter,

Thank you for these useful suggestions; the force grouping sounds great for localisation of the problem. I will take these on board for future.

My problem turned out to be a bug in a customer file reporter that I had written. This was being used to write out a minimised structure, however it was not writing the minimised structure, but the structure before the mimisation. Hence, when I started dynamics from this structure, which had two very close atoms, it logically created a massive potential and then generated an nan exception within a few steps of MD. This was all my fault.

For reference it was the CUDA platform with OpenMM 6.3.1; the system size was about 20k atoms, using the amber99sb ff.

Thank again for the pointers,

Mark