Relative performance of various GPUs

Maxim Imakaev · Post by **Maxim Imakaev** » Fri Apr 03, 2015 11:03 am

Hi Peter,

Do you have any data on relative performance of different GPUs when using the most recent version of OpenMM?

I'm debating an upgrade from 680 GTX to modern GPUs. Is there any way to estimate what speedup would we get for CUDA platform (50k particles; single precision; simple LJ and harmonic interactions)?

Thanks,
Max

Peter Eastman · Post by **Peter Eastman** » Fri Apr 03, 2015 12:17 pm

I don't have any precise numbers for you, but I can definitely say that upgrading from a 680 to a 980 would give a significant speed boost. The difference could easily be 50% or more, depending on what you're doing.

Peter

Mark Williamson · Post by **Mark Williamson** » Tue Apr 07, 2015 1:39 am

Whilst not your system or hardware per se, these benchmarks *may* be of use. As part of some internal testing, I essentially ported the JAC benchmark to OpenMM. There are some results at the bottom of this file https://github.com/mjw99/OpenMMJACBench ... cOpenmm.py

As always, scrutinise; I may have a latent error somewhere in the script.

Lee-Ping Wang · Post by **Lee-Ping Wang** » Tue Apr 07, 2015 8:42 am

Hi Mark,

I found that your benchmark reported performance that was almost 40% lower than the benchmark included with the OpenMM source, which also uses the JAC system. I found out where the difference was coming from - mainly the PME settings and the reporters.

This is running on a machine that uses GTX 970's, with OpenMM 6.2.0 and CUDA 6.5.

The benchmark from OpenMM's "example" folder gives 107.6 ns/day:

Code: Select all

leeping@fire-08-03:~/src/OpenMM-6.2.0-Source/examples$ python benchmark.py --platform=CUDA --test=pme --precision=mixed
Platform: CUDA
Precision: mixed

Test: pme (cutoff=0.9)
Step Size: 2 fs
Integrated 36294 steps in 58.2649 seconds
107.639 ns/day

Single precision performance is 116.7 ns/day. I'm not sure which is the more appropriate comparison with AMBER.

Your benchmark code gives 66.1 ns/day:

Code: Select all

leeping@fire-08-03:~/src/OpenMM-6.2.0-Source/OpenMMJACBench$ ./jacOpenmm.py
OpenMM version: 6.2
Platform: CUDA
CudaDeviceIndex: 0
CudaDeviceName: GeForce GTX 970
CudaUseBlockingSync: true
CudaPrecision: mixed
CudaUseCpuPme: false
CudaCompiler: /opt/CUDA/cuda-6.5/bin/nvcc
CudaTempDirectory: /tmp
CudaHostCompiler:

Number of atoms 23558
Number of velocities 23558
#"Step","Time (ps)","Potential Energy (kJ/mole)","Kinetic Energy (kJ/mole)","Total Energy (kJ/mole)","Temperature (K)"
1000,2.0,-299922.81262,60023.4063231,-239899.406296,298.429047812
2000,4.0,-299347.967107,59443.2684173,-239904.69869,295.544672975
3000,6.0,-300064.261097,60157.432007,-239906.82909,299.095407149
4000,8.0,-299617.266031,59718.89729,-239898.368741,296.915066078
5000,10.0,-299270.550608,59365.7670788,-239904.78353,295.159345783
6000,12.0,-299807.276316,59912.6062649,-239894.670052,297.878163451
7000,14.0,-299610.756854,59719.3148251,-239891.442029,296.917142011
8000,16.0,-299462.023189,59576.6332019,-239885.389987,296.207746401
9000,18.0,-299320.968856,59408.6687406,-239912.300116,295.372647607
10000,20.0,-299474.300643,59561.0426143,-239913.258029,296.130231904
26.1318368912 seconds
1306.59184456 is the time needed to run 1 ns
66.126235488 nS/day

After setting nonbondedCutoff = 0.9 nm: 76.8 ns/day
The above change, also turning off the PDBReporter: 102.7 ns/day
The above changes, also turning off StateDataRepoter: 106.7 ns/day

Thus, I would recommend making the above changes to make things consistent with OpenMM's benchmark.

Thanks,

- Lee-Ping

Peter Eastman · Post by **Peter Eastman** » Tue Apr 07, 2015 9:51 am

Also note that the default value OpenMM uses for ewaldErrorTolerance is significantly more accurate than what Amber uses in their published benchmarks. Creating benchmarks that are truly comparable between codes is really, really hard. I've been trying to convince various people in the community to come up with a more rigorously defined set of standard benchmarks, but I haven't gotten a lot of interest from them.

Peter

Lee-Ping Wang · Post by **Lee-Ping Wang** » Tue Apr 07, 2015 10:08 am

Hi Peter,

I know this is getting a bit off topic, but which other settings need to be consistent between codes in order to obtain truly comparable performance (to within 1%, for example)?

Thanks,

- Lee-Ping

Peter Eastman · Post by **Peter Eastman** » Tue Apr 07, 2015 10:41 am

This is a really complicated question. To begin with, how do you even define "comparable", keeping in mind that two codes may be using different integration algorithms, different constraint algorithms, different precision levels at various points in the computation, different accuracy tradeoffs at various places, etc.? To give one simple example: there are four different parameters that affect the accuracy of PME: the direct space cutoff, grid dimensions, b-spline order, and blending parameter. There is no "right" choice for these. By adjusting two or more at once, you can trade off work between different parts of the calculation while keeping the accuracy of the results constant. And since the relative speeds of those different parts of the calculation may vary between programs, the "optimal" choice will vary.

So what you want to do is define a required accuracy, then use whatever settings give the best performance while achieving that level of accuracy. But how do you define "accuracy"? There are many relevant aspects to it: the relative and absolute error in the forces, the absolute error in the energy, the energy drift, etc. And which aspects of accuracy are most important depend on exactly what sort of problem you're working on.

Peter

Lee-Ping Wang · Post by **Lee-Ping Wang** » Tue Apr 07, 2015 11:38 am

I can see why it's important to use accuracy as a requirement, but at the same time it's difficult to come up with a single definition.

If we use the total energy and force to define accuracy, that introduces the question of what's the "correct value" that everyone should agree upon. Should we expect two different codes to produce the same energy and force (to within some threshold)? And if so, can we calculate the most precise possible energies / forces and write it to a file to use as the reference?

Since we're testing the overall performance of the code for MD simulations, I agree it makes sense to include energy conservation in the definition for accuracy. NVE simulations that truly need to conserve energy have a higher requirement for accurate energies and forces, whereas NVT and NPT simulations still need to have approximate energy conservation ("underneath" the friction and random forces) to prevent unphysical heating or the simulation blowing up. Granted, it's possible for the energies and forces to be wrong but consistent with each other, but that's what the energy / force test is good for.

If my suggestions make sense, then we could measure the absolute / relative force error, absolute energy error, and energy drift per picosecond for single, mixed, and double precision. These could serve as candidate values for the accuracy requirement. What do you think?

In addition, should there be any additional rules for the simulation settings (e.g. time step)? I could imagine fine-tuning the time step to maximize "performance" while staying within an accuracy limit, but I'm not sure that's the right approach.

Lee-Ping Wang · Post by **Lee-Ping Wang** » Tue Apr 07, 2015 11:41 am

Also: Here are some performance comparisons from running on different GPUs. They are running on different machines (separated by ---). On the last machine, you can see the performance increases that come from turning ECC off on the Tesla cards, and compiling / running with CUDA 6.5 (vs. 5.5).

OpenMM 6.2, single precision
---
CUDA 6.5, Tesla S2050, ECC on: 25.3 ns/day
---
CUDA 6.5, GTX 970: 116.7 ns/day
---
CUDA 6.5, GTX Titan: 122.5 ns/day
---
CUDA 6.5, GTX Titan: 125.4 ns/day
CUDA 6.5, Tesla K20c, ECC off: 88.6 ns/day
CUDA 6.5, GTX 580: 37.1 ns/day
CUDA 5.5, GTX Titan: 115.3 ns/day
CUDA 5.5, Tesla K20c, ECC on: 77.6 ns/day
CUDA 5.5, Tesla K20c, ECC off: 81.2 ns/day
CUDA 5.5, GTX 580: 36.9 ns/day

Peter Eastman · Post by **Peter Eastman** » Tue Apr 07, 2015 11:57 am

I've attached a writeup I did on this subject some months ago. I've been trying to convince people to help create a new set of benchmarks, but without much success so far. Without this, most MD benchmarks verge on meaningless.

Peter

Relative performance of various GPUs

Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs

Re: Relative performance of various GPUs