problem with multi-gpu on CUDA & OpenCL

The functionality of OpenMM will (eventually) include everything that one would need to run modern molecular simulation.
User avatar
Andrish Reddy
Posts: 13
Joined: Sun Jan 18, 2015 5:22 pm

problem with multi-gpu on CUDA & OpenCL

Post by Andrish Reddy » Tue Feb 24, 2015 10:54 am

Hi,

I am testing a dual gpu setup on openmm 6.2 using the precompiled binaries for linux64. I am using 2 stock-clocked GTX 970 gpus on pcie x16 slots.
Any simulation I run with CUDA(6.5) or OpenCL, including the ones in the example folder can run fine when selecting either gpu0 or gpu1, however, when setting the device_index to use both gpus, the kinetic energy of the system rises uncontrollably. This happens in single or mixed precision and with CpuPME on or off. Also, the multi-gpu simulation is much slower than the single gpu simulation. Below is the simulatePdb.py example modified to use OpenCL.

Code: Select all

from simtk.openmm.app import *
from simtk.openmm import *
from simtk.unit import *
from sys import stdout

pdb = PDBFile('input.pdb')
forcefield = ForceField('amber99sb.xml', 'tip3p.xml')
system = forcefield.createSystem(pdb.topology, nonbondedMethod=PME, nonbondedCutoff=1*nanometer, constraints=HBonds)
platform=Platform.getPlatformByName('OpenCL')
properties= {'OpenCLDeviceIndex':'0,1'}
integrator = LangevinIntegrator(300*kelvin, 1/picosecond, 0.002*picoseconds)
simulation = Simulation(pdb.topology, system, integrator,platform,properties)
simulation.context.setPositions(pdb.positions)
simulation.minimizeEnergy()
simulation.reporters.append(StateDataReporter(stdout, 50, step=True, potentialEnergy=True, temperature=True))
simulation.step(10000)
Below is the log for this run:

#"Step","Potential Energy (kJ/mole)","Temperature (K)"
50,-143289.496095,245.909874775
100,-139958.111899,272.110301775
150,-114215.561385,316.700758196
200,-107336.857917,384.065929372
250,-104466.09974,423.93340308
300,-98868.0196085,466.339940266
350,-97775.0420195,471.56849486
400,-96959.4877038,521.399823597
450,-95044.0019795,520.631562494
500,-90366.1079388,537.39512434
550,-87967.6253451,587.143930602
600,-85784.9493586,638.835755803
650,-83266.0865824,650.249549937
700,-82771.6383787,673.545471937
750,-83708.5099203,648.471772098
800,-82218.7013378,696.999031019
850,-79660.4578973,701.902414653
900,-78027.1815105,730.205523121
950,-81111.2690258,717.382583402
1000,-78305.0151408,761.715969507
1050,-78424.8226178,775.684179493
1100,-76297.2681665,766.471859579
1150,-74328.9892095,790.375758979
1200,-75652.9245279,771.45014148
1250,-76100.4589809,794.456547983
1300,-74242.3057737,786.647721285
1350,-73308.8197386,794.000280236
1400,-91139.9807079,798.15475238
1450,-73693.4997774,823.836095576
1500,-73564.1600582,801.659993215
1550,-71963.4262446,832.644767762
1600,-82088.7358149,842.411864668
1650,-71027.1576225,875.615962399
1700,-71045.3981724,835.436132064
1750,-73115.6100949,821.712583083
1800,-72885.9820441,824.316882226
1850,-71299.1280814,839.802124269
1900,-73749.3635795,847.46541413
1950,-70593.5485047,874.994929085
2000,-71076.6289913,860.427923095
2050,-69250.9215332,929.020551717
2100,-69688.0195882,858.468196127
2150,-70145.9495218,864.881042899
2200,-70070.6420508,855.229989759
2250,-70972.6404716,876.684647044
2300,-73368.6394847,847.456986144
2350,-70498.6653079,852.211373404
2400,-73393.7433269,841.540402402
2450,-83143.2791198,799.580575518
2500,-75310.6667288,789.212479949
2550,-72520.8406691,818.409262403
2600,-74312.7182005,803.158939944
2650,-72507.156244,865.648296731
2700,-83778.9230248,876.947116398
2750,-68841.9629012,882.184384218
2800,-71474.1287918,860.159650762
2850,-80251.9529956,871.258410504
2900,-71011.1520459,843.141483623
2950,-70853.0838868,875.437565148
3000,-74070.3863632,814.734005975
3050,-70348.2954379,845.553308979
3100,-79673.7381061,841.697885657
3150,-70795.4960605,869.421752352
3200,-71930.0650271,830.136364884
3250,-85659.6520246,862.083616621
3300,-71806.3763306,830.34934637
3350,-90160.9342012,878.953110064
3400,-73416.5672486,819.468863456
3450,-84283.4001168,840.965943964
3500,-69339.2326747,854.424228509
3550,-90164.5280023,934.617927424
3600,-70229.6272752,849.123235113
3650,-83484.7279545,857.351014656
3700,-70870.8985552,856.945681356
3750,-69656.3367639,879.103331461
3800,-69609.316914,857.730569072
3850,-68854.9046235,910.597119969
3900,-69659.9115024,871.541948181
3950,-71197.8759847,888.509047393
4000,-71716.9950954,840.165942864
4050,-88316.9326674,921.924504645
4100,-71586.1396381,846.304444105
4150,-71038.9713079,869.182113823
4200,-72078.7201034,838.25605521
4250,-71604.8033464,864.11913068
4300,-76116.0491979,795.584177477
4350,-72687.589402,857.699514705
4400,-73572.2536795,801.075240778
4450,-87161.9384555,821.515099771
4500,-73758.8648828,800.158419468
4550,-84513.7475515,820.848014074
4600,-73123.2134601,822.821363391
4650,-70376.2537894,892.164312496
4700,-70402.2929489,877.516147283
4750,-89344.8687632,943.255986442
4800,-69372.6904062,888.22516327
4850,-70329.4815555,903.251696993
4900,-69139.2543827,876.986171447
4950,-70827.7149192,869.568221994
5000,-71029.9541367,859.12976914
5050,-71210.0839912,863.165272061
5100,-69262.268479,863.491782618
5150,-72836.4768562,831.511986823
5200,-72668.0347459,810.498447277
5250,-73877.7427845,829.617598971
5300,-72686.1775976,818.567172533
5350,-73806.719405,841.040065123
5400,-71403.0778301,837.387020268
5450,-89014.8606959,826.241859412
5500,-72909.8028232,812.104810315
5550,-85333.5246464,794.629823659
5600,-74059.4217066,790.930816395
5650,-75308.7362964,808.220111028
5700,-78394.2786532,789.584946595
5750,-76855.1562189,793.692985673
5800,-74293.9862375,808.502352956
5850,-89037.606709,788.505736672
5900,-76038.9314105,775.76984851
5950,-75524.2083062,761.426843251
6000,-78226.5324346,757.209273789
6050,-78047.2205819,724.690834857
6100,-74789.4329267,760.494125784
6150,-90567.9840172,746.277569621
6200,-74587.1893491,778.450696001
6250,-73455.2479583,775.215970633
6300,-73380.5103557,793.948306551
6350,-95721.6537123,846.850942183
6400,-71956.0088345,826.900302488
6450,-72049.9516263,822.397939443
6500,-70848.8983282,849.041694517
6550,-73482.3209777,862.030748403
6600,-70479.2953698,860.625532421
6650,-85477.7938617,844.507290471
6700,-72222.3771003,837.635098417
6750,-73765.3116153,797.377246393
6800,-73638.3397141,804.790715625
6850,-76636.9507863,784.190007245
6900,-75240.7675809,798.310722097
6950,-76707.6677924,808.335424669
7000,-74942.5528233,792.808577986
7050,-88451.6503339,765.057649205
7100,-73102.322468,796.645669549
7150,-92643.6579077,847.174196692
7200,-72366.4929135,809.728585104
7250,-74933.4027077,809.959201734
7300,-87034.4458326,804.623994087
7350,-75815.2900243,801.181540709
7400,-74815.5444403,819.571673461
7450,-85554.9279682,815.733170394
7500,-70533.2135903,820.644045352
7550,-89992.6055044,819.125594946
7600,-75883.2718151,820.195136913
7650,-87605.4006845,796.095474391
7700,-70377.8972883,851.770040389
7750,-84986.7565146,832.770041721
7800,-74658.2002299,843.850803664
7850,-75083.5863518,786.04919093
7900,-76414.4294043,797.739403266
7950,-90289.1056992,768.122216763
8000,-74823.4145554,767.748739463
8050,-74602.8822004,802.58744397
8100,-84695.9747501,812.500913581
8150,-73426.370676,810.275221733
8200,-82373.5461226,880.138685803
8250,-71945.1301941,841.225059487
8300,-81966.9222907,888.201934104
8350,-71793.9551773,837.712882994
8400,-88674.1634848,873.545329264
8450,-73555.7890838,823.824245598
8500,-73787.429539,878.802716353
8550,-73758.92614,807.853000147
8600,-90407.7050133,866.238160612
8650,-73080.5317352,806.698543017
8700,-70294.2756294,890.162763837
8750,-72790.5907983,810.531875176
8800,-85071.952518,826.35759962
8850,-87315.9941307,799.303493921
8900,-89390.6509776,837.610193812
8950,-75333.3542301,788.973102745
9000,-81802.9024955,832.847465065
9050,-73787.9508068,805.129056759
9100,-71541.6428496,858.474671182
9150,-81597.7397971,797.563634226
9200,-88507.5460082,829.370002908
9250,-74556.1666884,763.640344431
9300,-73764.8718865,780.051937396
9350,-87597.8904795,791.380541878
9400,-75193.9266375,817.014901876
9450,-75578.1964298,769.030114522
9500,-72904.8180273,818.069441881
9550,-73152.5114506,803.793015288
9600,-70140.5647555,826.297013495
9650,-71152.851086,826.141065039
9700,-71952.3293094,860.246494006
9750,-71944.8677021,868.362989304
9800,-72922.5703675,849.771055715
9850,-70013.8291791,880.599524441
9900,-68920.9863379,908.022914058
9950,-69670.7628336,886.332401214
10000,-70538.7510535,889.697249218

Using either gpu0 or gpu1 gives a reasonable output:

#"Step","Potential Energy (kJ/mole)","Temperature (K)"
50,-144038.230331,29.7016278585
100,-141979.363895,54.3421587212
150,-144264.313609,76.9198547415
200,-138532.858803,98.5802927664
250,-137022.435907,117.144894862
300,-138772.048779,132.593400148
350,-141054.451733,146.766047238
400,-133143.947551,158.791028917
450,-138267.167038,171.39923687
500,-131386.796466,182.611224753
550,-131889.423329,191.552669102
600,-131656.380868,203.228856798
650,-130739.689487,211.236392348
700,-128034.986659,215.680223016
750,-127246.85335,219.475844756
800,-126535.182007,225.977608659
850,-126015.909553,231.902358623
900,-125268.690212,235.369009796
950,-125133.208429,242.955507101
1000,-124633.621744,248.253045893
1050,-123921.427156,245.144481312
1100,-125413.337368,254.01650413
1150,-123225.075615,254.049942569
1200,-122966.359534,260.609602825
1250,-122398.88067,261.641411626
1300,-121843.599713,257.859847483
1350,-121790.534663,264.759274416
1400,-120995.197344,261.509599215
1450,-125354.195058,270.087750093
1500,-120610.012462,272.501442256
1550,-120307.78734,274.28856258
1600,-122033.148111,274.790181189
1650,-120755.232246,278.173477373
1700,-119653.443631,279.795800395
1750,-119393.273863,284.899060797
1800,-118647.379447,279.957062596
1850,-118843.664528,288.678443787
1900,-118426.597834,286.49542428
1950,-117891.65691,282.759337203
2000,-116873.210532,288.886811191
2050,-117968.110865,290.64310007
2100,-119783.265175,288.733935302
2150,-117369.289447,287.413348069
2200,-117507.616832,288.6354119
2250,-117587.481323,296.873956863
2300,-117436.022975,292.469152005
2350,-117207.756921,289.382997338
2400,-117375.014582,295.024873547
2450,-119417.52416,294.195605623
2500,-117209.986148,295.807594445
2550,-118748.994192,295.659864237
2600,-116689.798313,293.028518896
2650,-116703.069377,294.981240272
2700,-116822.228703,299.464246114
2750,-116467.857525,294.22802257
2800,-116830.520484,298.075808663
2850,-116430.707662,293.045421183
2900,-118001.369655,293.632488647
2950,-117609.390319,292.776826911
3000,-116822.120177,300.078003769
3050,-118584.470858,295.134584254
3100,-116684.731993,297.932731431
3150,-122981.675178,298.555360906
3200,-116432.128948,294.500468844
3250,-117907.919013,292.018799925
3300,-116300.674196,290.571190227
3350,-116367.376027,295.456512819
3400,-116096.598651,293.233795074
3450,-118079.76743,293.080003737
3500,-116227.656935,296.39336987
3550,-118683.1218,298.406073232
3600,-117676.041288,292.470581794
3650,-117363.656628,296.689158121
3700,-115954.491296,298.195817129
3750,-116002.324959,299.603566688
3800,-115697.675229,294.941071243
3850,-120033.160028,296.711412741
3900,-115533.432842,296.524143804
3950,-115730.949765,295.705260897
4000,-115717.358506,297.195142773
4050,-115607.332071,298.67849151
4100,-116683.558748,298.91558647
4150,-115155.989015,302.34919872
4200,-115916.8065,297.946403974
4250,-115842.477997,300.024689148
4300,-115721.02429,299.084486819
4350,-115592.462946,302.329066093
4400,-115221.679957,298.394498898
4450,-115590.457487,300.267345566
4500,-115577.562796,298.04767687
4550,-115311.204462,298.583765104
4600,-119785.803493,301.294561161
4650,-115531.496738,298.630296532
4700,-117728.860508,300.017796217
4750,-115585.986797,299.078524187
4800,-115611.348933,296.768281506
4850,-116023.200624,299.933780906
4900,-115897.311817,301.52772078
4950,-119525.823465,299.535513213
5000,-115979.624905,296.130373713
5050,-116048.286814,295.351145459
5100,-115457.811572,297.989372216
5150,-117453.721071,297.193157684
5200,-120388.85024,301.485934375
5250,-115588.853853,295.813281196
5300,-115749.577651,297.333010983
5350,-115759.296119,300.431574211
5400,-115860.403017,298.935410614
5450,-117025.914,299.6594536
5500,-115786.010286,296.166086932
5550,-115104.536892,293.540213475
5600,-115999.352626,297.906270631
5650,-115917.395217,295.743690641
5700,-116116.070001,299.11814454
5750,-119565.888103,296.482351813
5800,-117309.597463,291.713724333
5850,-116125.608714,300.893699215
5900,-118068.275838,298.2837907
5950,-116264.065352,303.867449549
6000,-116106.704936,302.115467085
6050,-116045.150535,299.515851409
6100,-115441.740255,297.086649857
6150,-115440.075545,297.799504964
6200,-115570.348665,299.681283913
6250,-115180.055754,297.966216145
6300,-115780.939967,298.027853813
6350,-117969.757851,300.620423903
6400,-120161.006542,301.726658957
6450,-115789.522743,302.172920516
6500,-115714.896977,300.557437931
6550,-115474.489584,295.357548784
6600,-115609.604065,300.853184605
6650,-115746.189173,305.032695098
6700,-115604.541931,301.251513527
6750,-115516.037632,300.377321849
6800,-115696.075402,299.318929493
6850,-115609.283643,299.059238365
6900,-115455.727873,296.674342781
6950,-115177.579639,296.588669422
7000,-115616.969134,301.299154927
7050,-115283.181303,300.378703398
7100,-115549.447068,299.570043543
7150,-115244.980676,297.180087083
7200,-117481.187841,301.724383032
7250,-115448.847516,299.89082561
7300,-119262.066458,297.726813739
7350,-115717.341403,302.772202343
7400,-117585.326111,300.775597512
7450,-115524.555854,303.276651098
7500,-115335.941554,300.333912413
7550,-115411.112662,299.157376255
7600,-115644.810176,303.445642034
7650,-115382.95087,299.078048391
7700,-115360.203917,298.095040761
7750,-115570.070595,294.097268106
7800,-115919.855035,299.27934908
7850,-115871.066919,297.12245534
7900,-115963.95605,298.996674788
7950,-115891.29473,297.597330088
8000,-115888.191323,298.00115643
8050,-115670.724616,297.075977075
8100,-120673.307803,298.232185812
8150,-117645.874539,302.376079124
8200,-116963.938658,294.028235383
8250,-121350.750057,297.665019201
8300,-115297.828939,296.98496007
8350,-115465.66046,297.847136404
8400,-115569.137423,298.160253389
8450,-115667.679317,297.862209625
8500,-117476.544597,297.477908089
8550,-115420.982999,293.372578397
8600,-115561.310765,293.942340428
8650,-115294.205081,298.541749964
8700,-115448.981786,295.983655273
8750,-115620.148335,298.774256148
8800,-115636.096743,301.502043396
8850,-115472.33918,296.225803965
8900,-115512.943698,297.096575403
8950,-115472.355282,296.299449394
9000,-115325.024422,296.626857919
9050,-117086.626411,296.514182277
9100,-115383.050226,297.629508096
9150,-114262.369625,298.996309508
9200,-115699.296276,301.499924434
9250,-115328.874061,297.538793883
9300,-115786.733324,300.32706895
9350,-115496.719189,299.539848615
9400,-115441.558127,298.407783968
9450,-116740.226308,294.260296158
9500,-118886.105416,301.058058388
9550,-119505.693938,302.141812247
9600,-115303.675389,300.901786796
9650,-115276.327589,299.268576278
9700,-115543.29479,305.625341024
9750,-116847.081224,305.317224619
9800,-115255.059419,301.831340197
9850,-115088.087674,301.220768769
9900,-115268.778455,299.447166793
9950,-117364.766989,303.719075919
10000,-114820.598797,299.871495431

As I mentioned, this happens on both OpenCL & CUDA.
Suggestions welcome!

Andrish

User avatar
Peter Eastman
Posts: 2583
Joined: Thu Aug 09, 2007 1:25 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Peter Eastman » Thu Feb 26, 2015 1:04 pm

I can reproduce this with the OpenCL platform, but not the CUDA platform. I'm looking into it now. It's a little awkward since my only access to a multi-GPU box is through a queuing system, not a machine I can log into directly.

Could you confirm whether you have the same problem with CUDA?

Peter

User avatar
Andrish Reddy
Posts: 13
Joined: Sun Jan 18, 2015 5:22 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Andrish Reddy » Thu Feb 26, 2015 1:36 pm

Hi Peter,

Yes, I was originally working with the CUDA platform and got the error, which prompted me to check OpenCL as well. The error is compounded when using amoeba vs amber, or other non-polarizable forcefields. I am using a small water box of 512 molecules, and the salient parts of my script are:

Code: Select all

forcefield = ForceField('amoeba2013.xml')
system = forcefield.createSystem(struct.topology, nonbondedMethod=PME,nonbondedCutoff=1*nanometer,ewaldErrorTolerance=0.0005,
useDispersionCorrection=True, vdwCutoff=1*nanometer, constraints=None,rigidWater=False,polarization='mutual')
system.addForce(MonteCarloBarostat(1*bar,298*kelvin))
integrator = LangevinIntegrator(298*kelvin, 1/picoseconds ,0.5*femtoseconds)
platform = Platform.getPlatformByName('CUDA')
properties = {'CudaPrecision': 'mixed','CudaUseCpuPme': 'false', 'CudaDeviceIndex': '0,1'}
sim = Simulation(struct.topology, system, integrator,platform,properties)
sim.context.setPositions(struct.positions)
This outputs:
#"Progress (%)","Step","Potential Energy (kJ/mole)","Temperature (K)","Box Volume (nm^3)","Density (g/mL)","Time Remaining"
0.5%,500,9.18804639795e+36,2.83876072046e+16,15.0880037028,0.991337447719,--
... which continues to increase in energy.

Keeping everything the same and changing 'CudaDeviceIndex':'0'
outputs:
#"Progress (%)","Step","Potential Energy (kJ/mole)","Temperature (K)","Box Volume (nm^3)","Density (g/mL)","Time Remaining"
0.5%,500,-18804.6789347,269.60088705,14.7699814018,1.01268259417,--
... which proceeds normally.

Regards,
Andrish

User avatar
Andrish Reddy
Posts: 13
Joined: Sun Jan 18, 2015 5:22 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Andrish Reddy » Thu Feb 26, 2015 1:56 pm

I think it has to do with how PME is calculated in a multi-gpu setup. If I change my cutoff type to 'NoCutoff' or 'CutoffNonPeriodic', then the simulation proceeds normally using both gpus (albeit slower than a single gpu). The amoeba forcefield is only compatible with PME which is why the error cannot be circumvented.

Andrish

User avatar
Peter Eastman
Posts: 2583
Joined: Thu Aug 09, 2007 1:25 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Peter Eastman » Thu Feb 26, 2015 2:53 pm

AMOEBA doesn't support parallelizing across multiple GPUs. It should still work, but you won't get any speedup from it (and you're adding overhead, so it will actually be slower).

Peter

User avatar
Peter Eastman
Posts: 2583
Joined: Thu Aug 09, 2007 1:25 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Peter Eastman » Tue Mar 10, 2015 3:29 pm

Got it! This turned out to be a really subtle problem (actually two different subtle problems), and it took a ton of work to track down and fix. The changes are now merged into the master repository. Try getting them and see if it works for you now.

Peter

User avatar
Mary Van Vleet
Posts: 18
Joined: Wed Dec 09, 2015 11:56 am

Re: problem with multi-gpu on CUDA & OpenCL

Post by Mary Van Vleet » Fri Jun 03, 2016 7:43 am

Peter,

Do you mind explaining what this 'subtle issue' was? I'm running into a very similar problem while trying to run a polarizable simulation with the DrudeLangevinIntegrator. The following combination of options fails (i.e., both the potential and kinetic energy blow up):

FAIL: DrudeLangevinIntegrator, PME, CUDA/OpenCL platforms

Code: Select all

Equilibrating...
#"Step"	"Potential Energy (kJ/mole)"	"Kinetic Energy (kJ/mole)"	"Temperature (K)"	"Density (g/mL)"	"Speed (ns/day)"
1	66543695860.3	1.0927313536e+12	1.25166873887e+13	0.00195788631063	0
2	1.4609232691e+11	5.10219290112e+11	5.84430503705e+12	0.00195788631063	19.5
3	95382839274.1	5.52319970816e+11	6.32654713387e+12	0.00195788631063	26.7
Traceback (most recent call last):
  File "run_water.py", line 76, in <module>
    simulation.step(1000)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 132, in step
    self._simulate(endStep=self.currentStep+steps)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 219, in _simulate
    reporter.report(self, state)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/statedatareporter.py", line 195, in report
    self._checkForErrors(simulation, state)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/statedatareporter.py", line 347, in _checkForErrors
    raise ValueError('Energy is NaN')
ValueError: Energy is NaN
while the following 'similar' options succeed (i.e., run simulations with reasonable energies):

SUCCESS: DrudeLangevinIntegrator, PME, CPU platform
SUCCESS: LangevinIntegrator (exclude DrudeForce in force field), PME, CPU/CUDA/OpenCL platforms
SUCCESS: DrudeLangevinIntegrator, CutoffPeriodic, CUDA/OpenCL platforms

Code: Select all

Equilibrating...
#"Step"	"Potential Energy (kJ/mole)"	"Kinetic Energy (kJ/mole)"	"Temperature (K)"	"Density (g/mL)"	"Speed (ns/day)"
1	1.92531557811	13.7172976477	157.124736934	0.00195788631063	0
2	5.93153777916	8.30424119046	95.1209010691	0.00195788631063	6.59
3	7.66123501795	6.24381742006	71.519784347	0.00195788631063	7.39
4	5.35801695175	8.27854270094	94.8265378127	0.00195788631063	7.7
5	2.22156413564	10.8622071089	124.421112549	0.00195788631063	7.89
6	1.39881599638	10.6314588598	121.778007554	0.00195788631063	7.87
7	2.26718286001	8.97862802129	102.845662615	0.00195788631063	8.17
8	2.57212369066	8.01120880771	91.7643626871	0.00195788631063	8.32
9	2.42550226082	8.13274325559	93.1564785865	0.00195788631063	8.51
10	2.91199637485	7.28362208374	83.4302231549	0.00195788631063	8.7
11	3.8229607156	6.39199080697	73.2170358784	0.00195788631063	8.84
12	3.10452238941	6.78145963662	77.6782051972	0.00195788631063	8.99
13	1.05963196044	8.38527899134	96.0491482701	0.00195788631063	9.12
14	0.380114619061	9.15109072843	104.821136078	0.00195788631063	9.31
...
I am not sure whether or not this simulation is trying to run on multiple GPUs- the node I am on has 4 GPUs available, but I have not specified resource usage, and am unsure what OpenMMs defaults are in this case. I'll try and track that down while I wait for your reply.

Attached is a copy of the files used to run this simulation; I've also tried to include my environment variables in case the problem lies with my computer architecture rather than with OpenMM. Let me know if you need any other information to track this issue down; I'm happy to help however possible.

User avatar
Mary Van Vleet
Posts: 18
Joined: Wed Dec 09, 2015 11:56 am

Re: problem with multi-gpu on CUDA & OpenCL

Post by Mary Van Vleet » Fri Jun 03, 2016 8:18 am

(I can't actually attach the files in this forum, as "the board attachment quota has been reached." with an 11KB file; I will send the files to Peter in an email for now, and will hopefully post them if the forum attachments start working again.)

User avatar
Peter Eastman
Posts: 2583
Joined: Thu Aug 09, 2007 1:25 pm

Re: problem with multi-gpu on CUDA & OpenCL

Post by Peter Eastman » Mon Jun 06, 2016 10:30 am

I don't remember very well. This was over a year ago. It looks like this was the PR that fixed it: https://github.com/pandegroup/openmm/pull/847. I think one of the problems had to do with the neighbor list not getting rebuilt when it shifted work from one GPU to another. But I'd have to spend a while studying the PR (19 modified files!) to figure out exactly.

I'll take a look at your files. Does it only fail when running on multiple GPUs?

Peter

User avatar
Mary Van Vleet
Posts: 18
Joined: Wed Dec 09, 2015 11:56 am

Re: problem with multi-gpu on CUDA & OpenCL

Post by Mary Van Vleet » Mon Jun 06, 2016 12:09 pm

I get (different) errors using single and multiple GPUs:

Single GPU (relevant lines from .py file below)

Code: Select all

...
properties = {'CudaDeviceIndex' : '0'}
simulation = app.Simulation(model_topology, system, integrator, platform, properties)
...
outputs the following error:

Code: Select all

Equilibrating...
#"Step"	"Potential Energy (kJ/mole)"	"Kinetic Energy (kJ/mole)"	"Temperature (K)"	"Density (g/mL)"	"Speed (ns/day)"
1	66543194100.3	1.09273283534e+12	1.25167043612e+13	0.00195788631063	0
2	1.46091229183e+11	5.102212992e+11	5.84432805015e+12	0.00195788631063	20
Traceback (most recent call last):
  File "run_water.py", line 75, in <module>
    simulation.step(1000)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 132, in step
    self._simulate(endStep=self.currentStep+steps)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 219, in _simulate
    reporter.report(self, state)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/statedatareporter.py", line 195, in report
    self._checkForErrors(simulation, state)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/statedatareporter.py", line 347, in _checkForErrors
    raise ValueError('Energy is NaN')
ValueError: Energy is NaN
whereas using multiple GPUS (input below)

Code: Select all

...
properties = {'CudaDeviceIndex' : '0,1'}
simulation = app.Simulation(model_topology, system, integrator, platform, properties)
...
results in the following error:

Code: Select all

Equilibrating...
Traceback (most recent call last):
  File "run_water.py", line 76, in <module>
    simulation.step(1000)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 132, in step
    self._simulate(endStep=self.currentStep+steps)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/app/simulation.py", line 198, in _simulate
    self.integrator.step(stepsToGo)
  File "/srv/home/mvanvleet/anaconda2/lib/python2.7/site-packages/simtk/openmm/openmm.py", line 16442, in step
    return _openmm.DrudeLangevinIntegrator_step(self, steps)
Exception: Error invoking kernel: CUDA_ERROR_INVALID_HANDLE (400)
I tried running a non-polarizable force field using the LangevinIntegrator on multiple GPUs, and this worked fine, so the error does seem to stem from the DrudeLangevinIntegrator itself.

POST REPLY