Non-Amber FFs/explicit water (gromacs)

David Koes · Post by **David Koes** » Wed Mar 03, 2010 8:45 am

I'm testing out OpenMM on an nVidia geforce 250. CUDA is enabled. Here is my experience attempting to simulate a 51 residue protein. Is the following normal?

I tried a few GROMOS96 force fields and got NaN in the results. I've only had success with the AMBER force field.

Simulating explicit solvent, gromacs simulates 500 steps in 9.7 seconds. OpenMM with CUDA takes 39 seconds.

Using zephyr, I simulate the same protein in implicit solvent with the md integrator for 5000 steps. This takes 8 seconds using CUDA, 20 seconds when using gromacs in a vacuum, and more than 20 minutes using the openmm cpu reference implementation.

This is on linux64 and cpu results are for only a single core. Are these in the ballpark of what is expected? The explicit water results are very disappointing (4-5x slower than a single CPU) and the implicit water is only about 10x faster than a single CPU.

I'm wondering if there is some tweak I can make to the mdp file to get better performance with openmm. Are there some example gromacs input files that are known to work well with openmm that I could try out (particularly with explicit solvent)?

How would I go about understanding and fixing these performance problems?

Thanks,
-Dave

Peter Eastman · Post by **Peter Eastman** » Wed Mar 03, 2010 11:23 am

> I tried a few GROMOS96 force fields and got NaN
> in the results. I've only had success with the
> AMBER force field.

That is expected. We don't currently support all the terms in the GROMOS force field, so it won't work correctly.

> Simulating explicit solvent, gromacs simulates
> 500 steps in 9.7 seconds. OpenMM with CUDA takes
> 39 seconds.

That is definitely not right. CUDA should be quite a bit faster than standard Gromacs. First make sure the results are correct (e.g. you aren't getting NaNs), that you're using identical settings for each run, and that you are only using settings supported by OpenMM. Also make sure it really is using CUDA. That is, when you run mdrun-openmm, it should print out the message, "OpenMM Platform: Cuda". If you still find CUDA to be slower than Gromacs, post your mdp file and I'll take a look at it.

> This takes 8 seconds using CUDA, 20 seconds
> when using gromacs in a vacuum, and more than
> 20 minutes using the openmm cpu reference
> implementation.

That sounds reasonable. Vacuum is a lot faster to simulate than implicit solvent, so those numbers can't really be compared. The reference platform is expected to be slow, since it's written for simplicity and clarity, not speed.

Also, what GPU and CPU are you running on?

Peter

David Koes · Post by **David Koes** » Wed Mar 03, 2010 11:57 am

Hi Peter,

Thanks for your response. mdrun-openmm is reporting that it is using CUDA (device 0 to be precise). The CPU is an Intel Core i7 975 at 3.33 Ghz and the graphics card is a 1GB GeForce GTS 250 clocked at 1.78Ghz with CUDA driver 2.3.

Below is the full mdp file. Most things are just at their defaults. Thanks -Dave

;
; File 'out.mdp' was generated
; By user: dkoes (1000)
; On host: quasar
; At date: Wed Mar 3 13:56:37 2010
;

; VARIOUS PREPROCESSING OPTIONS
; Preprocessor information: use cpp syntax.
; e.g.: -I/home/joe/doe -I/home/mary/hoe
include =
; e.g.: -DI_Want_Cookies -DMe_Too
define =

; RUN CONTROL PARAMETERS
integrator = md
; Start time and timestep in ps
tinit = 0
dt = 0.002
nsteps = 500
; For exact run continuation or redoing part of a run
; Part index is updated automatically on checkpointing (keeps files separate)
simulation_part = 1
init_step = 0
; mode for center of mass motion removal
comm-mode = Linear
; number of steps for center of mass motion removal
nstcomm = 1
; group(s) for center of mass motion removal
comm-grps =

; LANGEVIN DYNAMICS OPTIONS
; Friction coefficient (amu/ps) and random seed
bd-fric = 0
ld-seed = 1993

; ENERGY MINIMIZATION OPTIONS
; Force tolerance and initial step-size
emtol = 10
emstep = 0.01
; Max number of iterations in relax_shells
niter = 20
; Step size (ps^2) for minimization of flexible constraints
fcstep = 0
; Frequency of steepest descents steps when doing CG
nstcgsteep = 1000
nbfgscorr = 10

; TEST PARTICLE INSERTION OPTIONS
rtpi = 0.05

; OUTPUT CONTROL OPTIONS
; Output frequency for coords (x), velocities (v) and forces (f)
nstxout = 1
nstvout = 1000
nstfout = 0
; Output frequency for energies to log file and energy file
nstlog = 10
nstenergy = 100
; Output frequency and precision for xtc file
nstxtcout = 0
xtc-precision = 1000
; This selects the subset of atoms for the xtc file. You can
; select multiple groups. By default all atoms will be written.
xtc-grps =
; Selection of energy groups
energygrps = Protein

; NEIGHBORSEARCHING PARAMETERS
; nblist update frequency
nstlist = 10
; ns algorithm (simple or grid)
ns_type = grid
; Periodic boundary conditions: xyz, no, xy
pbc = xyz
periodic_molecules = no
; nblist cut-off
rlist = 1.0

; OPTIONS FOR ELECTROSTATICS AND VDW
; Method for doing electrostatics
coulombtype = pme
rcoulomb-switch = 0
rcoulomb = 1.0
; Relative dielectric constant for the medium and the reaction field
epsilon_r = 1
epsilon_rf = 1
; Method for doing Van der Waals
vdw-type = Cut-off
; cut-off lengths
rvdw-switch = 0
rvdw = 1.0
; Apply long range dispersion corrections for Energy and Pressure
DispCorr = No
; Extension of the potential lookup tables beyond the cut-off
table-extension = 1
; Seperate tables between energy group pairs
energygrp_table =
; Spacing for the PME/PPPM FFT grid
fourierspacing = 0.12
; FFT grid size, when a value is 0 fourierspacing will be used
fourier_nx = 0
fourier_ny = 0
fourier_nz = 0
; EWALD/PME/PPPM parameters
pme_order = 4
ewald_rtol = 1e-05
ewald_geometry = 3d
epsilon_surface = 0
optimize_fft = no

; IMPLICIT SOLVENT ALGORITHM
implicit_solvent = No

; GENERALIZED BORN ELECTROSTATICS
; Algorithm for calculating Born radii
gb_algorithm = Still
; Frequency of calculating the Born radii inside rlist
nstgbradii = 1
; Cutoff for Born radii calculation; the contribution from atoms
; between rlist and rgbradii is updated every nstlist steps
rgbradii = 2
; Dielectric coefficient of the implicit solvent
gb_epsilon_solvent = 80
; Salt concentration in M for Generalized Born models
gb_saltconc = 0
; Scaling factors used in the OBC GB model. Default values are OBC(II)
gb_obc_alpha = 1
gb_obc_beta = 0.8
gb_obc_gamma = 4.85
; Surface tension (kJ/mol/nm^2) for the SA (nonpolar surface) part of GBSA
; The default value (2.092) corresponds to 0.005 kcal/mol/Angstrom^2.
sa_surface_tension = 2.092

; OPTIONS FOR WEAK COUPLING ALGORITHMS
; Temperature coupling
tcoupl = No
; Groups to couple separately
tc-grps =
; Time constant (ps) and reference temperature (K)
tau-t =
ref-t =
; Pressure coupling
Pcoupl = No
Pcoupltype = Isotropic
; Time constant (ps), compressibility (1/bar) and reference P (bar)
tau-p = 1
compressibility =
ref-p =
; Scaling of reference coordinates, No, All or COM
refcoord_scaling = No
; Random seed for Andersen thermostat
andersen_seed = 815131

; OPTIONS FOR QMMM calculations
QMMM = no
; Groups treated Quantum Mechanically
QMMM-grps =
; QM method
QMmethod =
; QMMM scheme
QMMMscheme = normal
; QM basisset
QMbasis =
; QM charge
QMcharge =
; QM multiplicity
QMmult =
; Surface Hopping
SH =
; CAS space options
CASorbitals =
CASelectrons =
SAon =
SAoff =
SAsteps =
; Scale factor for MM charges
MMChargeScaleFactor = 1
; Optimization of QM subsystem
bOPT =
bTS =

; SIMULATED ANNEALING
; Type of annealing for each temperature group (no/single/periodic)
annealing =
; Number of time points to use for specifying annealing in each group
annealing_npoints =
; List of times at the annealing points for each group
annealing_time =
; Temp. at each annealing point, for each group.
annealing_temp =

; GENERATE VELOCITIES FOR STARTUP RUN
gen_vel = yes
gen_temp = 300.0
gen_seed = 3334

; OPTIONS FOR BONDS
constraints = all-bonds
; Type of constraint algorithm
constraint-algorithm = Lincs
; Do not constrain the start configuration
continuation = no
; Use successive overrelaxation to reduce the number of shake iterations
Shake-SOR = no
; Relative tolerance of shake
shake-tol = 0.0001
; Highest order in the expansion of the constraint coupling matrix
lincs-order = 4
; Number of iterations in the final step of LINCS. 1 is fine for
; normal simulations, but use 2 to conserve energy in NVE runs.
; For energy minimization with constraints it should be 4 to 8.
lincs-iter = 1
; Lincs will write a warning to the stderr if in one step a bond
; rotates over more degrees than
lincs-warnangle = 30
; Convert harmonic bonds to morse potentials
morse = no

; ENERGY GROUP EXCLUSIONS
; Pairs of energy groups for which all non-bonded interactions are excluded
energygrp_excl =

; WALLS
; Number of walls, type, atom types, densities and box-z scale factor for Ewald
nwall = 0
wall_type = 9-3
wall_r_linpot = -1
wall_atomtype =
wall_density =
wall_ewald_zfac = 3

; COM PULLING
; Pull type: no, umbrella, constraint or constant_force
pull = no

; NMR refinement stuff
; Distance restraints type: No, Simple or Ensemble
disre = No
; Force weighting of pairs in one distance restraint: Conservative or Equal
disre-weighting = Conservative
; Use sqrt of the time averaged times the instantaneous violation
disre-mixed = no
disre-fc = 1000
disre-tau = 0
; Output frequency for pair distances to energy file
nstdisreout = 100
; Orientation restraints: No or Yes
orire = no
; Orientation restraints force constant and tau for time averaging
orire-fc = 0
orire-tau = 0
orire-fitgrp =
; Output frequency for trace(SD) and S to energy file
nstorireout = 100
; Dihedral angle restraints: No or Yes
dihre = no
dihre-fc = 1000

; Free energy control stuff
free-energy = no
init-lambda = 0
delta-lambda = 0
sc-alpha = 0
sc-power = 0
sc-sigma = 0.3
couple-moltype =
couple-lambda0 = vdw-q
couple-lambda1 = vdw-q
couple-intramol = no

; Non-equilibrium MD stuff
acc-grps =
accelerate =
freezegrps =
freezedim =
cos-acceleration = 0
deform =

; Electric fields
; Format is number of terms (int) and for all terms an amplitude (real)
; and a phase angle (real)
E-x =
E-xt =
E-y =
E-yt =
E-z =
E-zt =

; User defined thingies
user1-grps =
user2-grps =
userint1 = 0
userint2 = 0
userint3 = 0
userint4 = 0
userreal1 = 0
userreal2 = 0
userreal3 = 0
userreal4 = 0

John Chodera · Post by **John Chodera** » Wed Mar 03, 2010 12:05 pm

Could these output options be a problem?

nstxout = 1
nstlog = 10

I'm not sure if the gromacs-openmm interface honors those options, but that would suggest that the coordinates would be retrieved from the GPU *every* timestep, and the energies computed every 10 timesteps. If this is actually the case, setting nstxout = nstlog = nsteps = 500 might help cut down on overhead of memory transfer to/from the GPU.

Disclaimer: I've never actually used the gromacs-openmm interface. Peter will know for certain. I just noticed that these setting (if actually honored) might be problematic.

Peter Eastman · Post by **Peter Eastman** » Wed Mar 03, 2010 12:38 pm

John is completely right that retrieving data from the GPU at every time step will hurt your performance. What happens when you increase those values?

I also notice that you're using a high end CPU and a low end GPU, which isn't really a valid comparison. The difference between a low end and high end GPU can be huge: easily a factor of 10 in performance, and sometimes much more.

What happens if you switch the coulombtype from PME to Reaction-Field? PME involves irregular memory access which is really hurt by the restricted coalescing rules on G80 series GPUs (like yours).

Peter

David Koes · Post by **David Koes** » Wed Mar 03, 2010 1:44 pm

Changing nstxout and nstlog doesn't seem to have any effect. Switching to Reaction-Field can have a significant impact though (can be as fast as the CPU).

However, as I investigate this further, I think I have problems unrelated to openmm since I am getting a wide range of performance numbers. I suspect the second video card in the system may be causing problems. I'll look into it more when I have more time.

Thanks for the help.
-Dave

David Koes · Post by **David Koes** » Thu Mar 04, 2010 9:28 am

The problem (or at least part of the problem) seems to have been with my setup. The geforce 260 was a second video card in my system and was not actually driving any monitor (I was just going to use it for CUDA). Now that I have it setup to drive a monitor, I get much more consistent results.

With explicit water the openmm run is about twice as fast as a single cpu. With implicit water it is about 10x faster than a single cpu w/explicit water.

-Dave

Non-Amber FFs/explicit water (gromacs)

Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)

RE: Non-Amber FFs/explicit water (gromacs)