CUDA Performance with newer Driver

Dennis Della Corte · Post by **Dennis Della Corte** » Tue Mar 10, 2020 4:04 pm

Hello,

I attempted to benchmark OpenMM on a new Cluster with 8 Nvidia Quadro RTX 5000 and CUDA10.2 installed.
For a test protein with 120 AA with explicit solvent ~50,000 atoms, I get a performance of ~5ns / day on 4 GPUs.
Checking utilization with nvidia-smi shows ~95% usage of GPUs.

I found a post in this forum stating that I need to install OpenMM with conda as compiled for a specific CUDA driver.

Installing omnia/label/cuda101 results in a platform error.
Installing omnia/label/cuda100 allows me to run the benchmark with sub-par performance.

My cluster admins are adamant on CUDA being backward compatible. Could you confirm that us downgrading CUDA10.2 to CUDA10.1 would impact the performance?

Thank you,
Dennis

Peter Eastman · Post by **Peter Eastman** » Tue Mar 10, 2020 5:19 pm

The first thing to check is that it's able to use CUDA. Execute this command:

Code: Select all

python -m simtk.testInstallation

It will list all available platforms and try to do a computation with each one. Does it show that CUDA is working?

Assuming it is, CUDA should be selected by default if you don't explicitly specify a platform to use. But you can check to be sure:

Code: Select all

print(simulation.context.getPlatform().getName())

Assuming it's using CUDA, is 5 ns/day a reasonable speed? That depends on your simulation. If you're using a typical point charge force field and a typical integration step size, that would be very slow. More like you would expect to see on the CPU. On the other hand, if you're using AMOEBA that might be a reasonable speed. You also can easily make a simulation very slow by, for example, writing a trajectory frame to disk after every time step. Can you post your script to show exactly what you're doing?

Also make sure nothing else is running at the same time. If some other program is keeping the GPU or CPU busy, that will slow the simulation down a lot.

Dennis Della Corte · Post by **Dennis Della Corte** » Tue Mar 10, 2020 8:19 pm

Thank you for the speedy reply. Here a lengthy, but hopefully complete description of the situation:

python -m simtk.testInstallation:

OpenMM Version: 7.4.1
Git Revision: 068f120206160d5151c9af0baf810384bba8d052

There are 4 Platforms available:

1 Reference - Successfully computed forces
2 CPU - Successfully computed forces
3 CUDA - Successfully computed forces
4 OpenCL - Successfully computed forces

Median difference in forces between platforms:

Reference vs. CPU: 6.29679e-06
Reference vs. CUDA: 6.73139e-06
CPU vs. CUDA: 7.71297e-07
Reference vs. OpenCL: 6.75312e-06
CPU vs. OpenCL: 8.10037e-07
CUDA vs. OpenCL: 2.22903e-07

All differences are within tolerance.

---------
print(simulation.context.getPlatform().getName())

CUDA

---------
#my test script
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout
import time

#make this an explicit water simulation
pdb = PDBFile('./test.pdb')
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')

modeller = Modeller(pdb.topology, pdb.positions)
#specify a box size
modeller.addSolvent(forcefield, padding=1.0*nanometers)

system = forcefield.createSystem(modeller.topology, nonbondedMethod=NoCutoff,
nonbondedCutoff=1*nanometer, constraints=HBonds)
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.002*picoseconds)

#Putting multiple GPUs in place
#currently we shal only use 4 GPUS until power is in place
num_gpu = 4

gpu_string = ''
for i in range(num_gpu):
gpu_string += str(i)+','

gpu_string = gpu_string[:-1]

platform = Platform.getPlatformByName('CUDA')
properties = {'DeviceIndex': gpu_string, 'Precision': 'double'}

simulation = Simulation(modeller.topology, system, integrator, platform, properties)
simulation.context.setPositions(modeller.positions)
print('Start energy minimization')
simulation.minimizeEnergy()
print('End energy minimization')

simulation.reporters.append(PDBReporter('output.pdb', 10000))
simulation.reporters.append(StateDataReporter(stdout, 10, step=True,
potentialEnergy=True, temperature=True))
#running 2fs setps
#1 ns / 0.002 ps = 500000
start = time.time()

simulation.step(50000) #0.1ns
end = time.time()

with open('./performance_'+str(num_gpu)+'.txt','w') as f:
f.write(str(start)+'\n')
f.write(str(end)+'\n')

------
test.pdb is hydrolyzed globular protein with 163 AA and 2497 atoms.
the output.pdb has a total system size of 50128 atoms.

Here the GPU usage:
(base) edendc@ESCP-GPU01:/data/OpenMM/tests$ nvidia-smi
Tue Mar 10 21:17:04 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.44 Driver Version: 440.44 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 5000 On | 00000000:1A:00.0 Off | Off |
| 35% 57C P2 91W / 230W | 146MiB / 16125MiB | 65% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 5000 On | 00000000:1C:00.0 Off | Off |
| 38% 62C P2 102W / 230W | 138MiB / 16125MiB | 94% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 5000 On | 00000000:1D:00.0 Off | Off |
| 34% 59C P2 121W / 230W | 138MiB / 16125MiB | 91% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 5000 On | 00000000:1E:00.0 Off | Off |
| 39% 63C P2 115W / 230W | 138MiB / 16125MiB | 95% Default |
+-------------------------------+----------------------+----------------------+
| 4 Quadro RTX 5000 On | 00000000:3D:00.0 Off | Off |
| 33% 33C P8 16W / 230W | 11MiB / 16125MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 5 Quadro RTX 5000 On | 00000000:3F:00.0 Off | Off |
| 33% 33C P8 8W / 230W | 11MiB / 16125MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 6 Quadro RTX 5000 On | 00000000:40:00.0 Off | Off |
| 33% 37C P8 17W / 230W | 11MiB / 16125MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 7 Quadro RTX 5000 On | 00000000:41:00.0 Off | Off |
| 33% 35C P8 17W / 230W | 11MiB / 16125MiB | 0% Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 20176 C python 135MiB |
| 1 20176 C python 127MiB |
| 2 20176 C python 127MiB |
| 3 20176 C python 127MiB |
+-----------------------------------------------------------------------------+

------------
The cluster is exclusive for OpenMM simulations and nothing else is running in parallel.

Any advice would be appreciated.

Thank you,
Dennis

Peter Eastman · Post by **Peter Eastman** » Wed Mar 11, 2020 8:55 am

There are a few different things contributing to slow performance. Here's a really big one:

Code: Select all

system = forcefield.createSystem(modeller.topology, nonbondedMethod=NoCutoff,
nonbondedCutoff=1*nanometer, constraints=HBonds)

You don't really mean NoCutoff do you? That's simulating a nonperiodic system (not a box of water, just a droplet of water floating in space, with nothing to prevent molecules from diffusing away). And since there's no cutoff, it's computing the full matrix of 50,000 x 50,000 interactions at every step. That's really slow.

Another big issue is that you're using double precision. The Quadro RTX 5000 has a 32:1 ratio of single:double performance, so using double for everything is really slow. Try switching the precision mode to 'mixed'. That should be a lot faster.

One other thing to note is that scaling with number of GPUs is very poor. Using two GPUs tends to be significantly faster than using one, but not twice as fast. Anything beyond that is as likely to slow it down as speed it up. It's usually better to just run a separate simulation on each GPU.

Peter Eastman · Post by **Peter Eastman** » Wed Mar 11, 2020 8:58 am

Oh, one other thing. Try increasing the reporting interval on the StateDataReporter. Asking it to report the energy every 10 steps is going to affect performance a bit, though not as much as the other things above.

Dennis Della Corte · Post by **Dennis Della Corte** » Wed Mar 11, 2020 6:02 pm

Thank you Peter, this was very helpful. We have now CUDA10.0 installed and I get the following performance in agreement with your description:

Number GPU 1, ns / day = 158.2310109781112
Number GPU 2, ns / day = 214.0911588615499
Number GPU 3, ns / day = 195.70180466219216
Number GPU 4, ns / day = 174.58727532667982

The issue is solved on my end. Thanks again!

CUDA Performance with newer Driver

CUDA Performance with newer Driver

Re: CUDA Performance with newer Driver

Re: CUDA Performance with newer Driver

Re: CUDA Performance with newer Driver

Re: CUDA Performance with newer Driver

Re: CUDA Performance with newer Driver