OpenMM on multiple gpus in one node

Michael Schauperl · Post by **Michael Schauperl** » Fri May 02, 2014 1:07 am

Hi Peter,

I tried to run an OpenMM calculation on two gpus in one node. The reservation of the whole node in the queueing systems seems to work, but I got an error message after a few seconds. Is my peace of python code to start OpenMM correct:

Inputfile:
from __future__ import print_function
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout
import os
import time
import datetime
import sys
from progressreporter import ProgressReporter
from restartreporter import *
import pickle
Platform.loadPluginsFromDirectory(Platform.getDefaultPluginsDirectory())
pdb=PDBFile('ice-water.pdb')
forcefield = ForceField('amoeba2009.xml')
platform = Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed','CudaDeviceIndex':'0,1'}

Error Message:
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
terminate called after throwing an instance of 'OpenMM::OpenMMException'
what(): Error uploading array posq: CUDA_ERROR_INVALID_CONTEXT (201)
0,1
Equilibrating
terminate called recursively
Abort (core dumped)
Warning: no access to tty (Bad file descriptor).
or
Thus no job control in this shell.
terminate called recursively
Equilibrating
terminate called after throwing an instance of 'OpenMM::OpenMMException'
Abort (core dumped)

Any idea what could be wrong? Thanks for your help again.

Michael

Peter Eastman · Post by **Peter Eastman** » Fri May 02, 2014 10:20 am

Hi Michael,

I'm not sure what's causing that. Just to make sure I understand: if you specify only one GPU (either 0 or 1) it works correctly. The error only occurs when you tell it to use both at once. Can you confirm that?

Clusters and queuing systems treat GPUs in a wide variety of ways. Sometimes they just give all jobs direct access to all the GPUs. Sometimes they try to virtualize access to them - if you request two GPUs for example, you will always see exactly two GPUs with indices 0 and 1, even though the machine might have eight GPUs and the particular two you're accessing are actually numbers 4 and 7. Sometimes they run the GPUs in exclusive mode so as soon as a job starts running on one, no other process can access that GPU until it finishes. Sometimes they don't.

Do you have any information about how this machine manages GPU resources? And how are you requesting those resources in your job?

Peter

Michael Schauperl · Post by **Michael Schauperl** » Mon May 05, 2014 6:04 am

Hi Peter,

We have 2gpus per node. Normally you can run 2 diffrent jobs on both gpus. During a job is running on a gpu no other job can access this GPU. We tested the setup of double gpu use already with other programs (e.g. amber) and it worked. It seems to me that the command 'CudaDeviceIndex':'0,1' is not working probably. I can set the CudaDeviceIndex also to 79 and the program still works, which makes absolutely no sense for me, because i do not have 80 gpus in my node.
You have an idea what is probably wrong?

Thanks,

Michael

Peter Eastman · Post by **Peter Eastman** » Mon May 05, 2014 11:11 am

If you specify a nonexistent device index, it just ignores the value you passed and selects one automatically. If you want to determine which device or devices it's really using, you can query the property after your context is created:

Code: Select all

context.getPlatform().getPropertyValue(context, "CudaDeviceIndex")

Peter

Michael Schauperl · Post by **Michael Schauperl** » Tue Jul 15, 2014 5:02 am

Hi Peter,

I still have a problem with running on multiple gpus in one node. Acutally the energies in the calculation are calculated wrong. they are way to high. I checked my inputfiel by running the same calculation on one gpu which worked fine.
I copying my inputfile in here so maybe you can give me a hint.

from __future__ import print_function
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout
import os
import time
import datetime
import sys
from progressreporter import ProgressReporter
from restartreporter import *
import pickle
Platform.loadPluginsFromDirectory(Platform.getDefaultPluginsDirectory())
pdb=PDBFile('water.pdb')
forcefield = ForceField('amber99sbildn.xml','tip3p.xml')
platform = Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed','CudaDeviceIndex':'0,1'}
print (properties['CudaDeviceIndex'])
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME , constraints=None, rigidWater=False, polarization='mutual',mutualInducedTargetEpsilon=1e-05, vdwCutoff=0.9*nanometer,nonbondedCutoff=0.7*nanometer)
pressure=Vec3(1*bar,1*bar,1*bar)
system.addForce(MonteCarloAnisotropicBarostat(pressure,230*kelvin))
integrator = LangevinIntegrator(230*kelvin, 1/picosecond, 0.5*femtoseconds)
simulation = Simulation(pdb.topology, system, integrator,platform, properties)
simulation.context.setPositions(pdb.positions)
simulation.context.setPositions(pdb.positions)
simulation.reporters.append(ProgressReporter('e0.log', 400,200000))
simulation.reporters.append(StateDataReporter('e0.dat', 400, step=True, time=True, potentialEnergy=True, kineticEnergy=True, totalEnergy=True, temperature=True, volume=True, density=True))
simulation.reporters.append(DCDReporter('e0.dcd', 400))
simulation.reporters.append(RestartReporter('e0.restart',400))

print ('Equilibrating')
for i in range(500):
integrator.setTemperature(230*kelvin)
simulation.step(400)

coordinates=simulation.context.getState(getPositions=True).getPositions()
PDBFile.writeFile(simulation.topology, coordinates, open('e0r.pdb','w'))
state = simulation.context.getState(getPositions=True)

Jason Swails · Post by **Jason Swails** » Tue Jul 15, 2014 9:44 am

What version of OpenMM are you using? The 6.0.1 release introduced a fix for multiple GPUs. If you fire up a Python interpreter, you should be able to find the version you're using:

Code: Select all

Python 2.7.6 (default, May 27 2014, 08:22:48) 
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import simtk.openmm as mm
>>> mm.version.full_version
'6.0.1.dev-Unknown'

If you have version 6.0, it will look like:

Code: Select all

Python 2.7.6 (default, May 27 2014, 08:22:48) 
[GCC 4.7.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import simtk.openmm as mm
>>> mm.version.full_version
'6.0.0.dev-Unknown'

Michael Schauperl · Post by **Michael Schauperl** » Thu Jul 17, 2014 12:50 am

Hi Jason,

Seems that i am running 6.0.0. So i will update my version and try again. Thanks for your help

Michael Schauperl · Post by **Michael Schauperl** » Wed Aug 20, 2014 5:35 am

Hi again,

I am now running the latest version of OpenMM but i still have problems with running calculations on multiple gpus.
My input file:
from __future__ import print_function
from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout
import os
import time
import datetime
import sys
from progressreporter import ProgressReporter
from restartreporter import *
import pickle
Platform.loadPluginsFromDirectory(Platform.getDefaultPluginsDirectory())
print (version.full_version)
pdb=PDBFile('test.pdb')
forcefield = ForceField('amber99sbildn.xml','tip3p.xml')
platform = Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'double','CudaDeviceIndex':'0,1'}
print (properties['CudaDeviceIndex'])
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME , constraints=None, rigidWater=False, polarization='mutual',mutualInducedTargetEpsilon=1e-05, vdwCutoff=0.9*nanometer,nonbondedCutoff=0.7*nanometer)
pressure=Vec3(1*bar,1*bar,1*bar)
system.addForce(MonteCarloAnisotropicBarostat(pressure,230*kelvin))
integrator = LangevinIntegrator(230*kelvin, 1/picosecond, 0.5*femtoseconds)
simulation = Simulation(pdb.topology, system, integrator,platform, properties)
simulation.context.setPositions(pdb.positions)
simulation.context.setPositions(pdb.positions)
simulation.reporters.append(ProgressReporter('e0.log', 400,200000))
simulation.reporters.append(StateDataReporter('e0.dat', 400, step=True, time=True, potentialEnergy=True, kineticEnergy=True, totalEnergy=True, temperature=True, volume=True, density=True))
simulation.reporters.append(DCDReporter('e0.dcd', 400))
simulation.reporters.append(RestartReporter('e0.restart',400))
f1=open('./multipoles0.dat', 'w')
f2=open('./coordinates0.dat', 'w')
print ('Equilibrating')
#print(simulation.context.getPlatform().getPropertyValue(context, "CudaDeviceIndex"))
print(simulation.context.getPlatform())
print(simulation.context.getPlatform().getPropertyValue(simulation.context, "CudaDeviceIndex"))
print (properties['CudaDeviceIndex'])
for i in range(500):
integrator.setTemperature(230*kelvin)
simulation.step(400)
state=simulation.context.getState(getPositions=True,getVelocities=True)
positions=state.getPositions()
print(positions,file=f2)

f2.close()
f1.close()
coordinates=simulation.context.getState(getPositions=True).getPositions()
PDBFile.writeFile(simulation.topology, coordinates, open('e0r.pdb','w'))
state = simulation.context.getState(getPositions=True)

It seems that my box is exploding with increasing time and the energy is magnitudes higher than the calculation on one gpu. Any guesses what still could be wrong.

Thanks,

Michael

Peter Eastman · Post by **Peter Eastman** » Sat Aug 23, 2014 10:17 am

Hi Michael,

We recently fixed a bug that I think may be the cause of this problem. Can you try the latest code from the Github repository and see if it now works correctly for you?

By the way, in your script you're using an AMBER force field, but when you call createSystem() you specify some options that are only relevant to AMOEBA force fields (mutualInducedTargetEpsilon, vdwCutoff). Those options are just being ignored, which isn't necessarily a problem. But I don't think you really want to be using a cutoff of only 0.7 nm with an AMBER force field.

Peter

OpenMM on multiple gpus in one node

OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node

Re: OpenMM on multiple gpus in one node