CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

Wei Chen · Post by **Wei Chen** » Thu Jul 18, 2019 2:06 pm

Hi,

I am trying to run multiple simulations with multiple GPUs in parallel. I do this by creating a bunch of

Simulation

objects and call

Code: Select all

run()

function for each of them with different parameters using

Code: Select all

multiprocessing.Process()

, but I got following error message:

Code: Select all

Exception: Error initializing Context: CUDA_ERROR_NOT_INITIALIZED (3) at /home/kengyangyao/.temp_softwares/openmm/platforms/cuda/src/CudaContext.cpp:179

It seems that this issue might be related to similar one in tensorflow: https://github.com/tensorflow/tensorflow/issues/29876, but not quite sure.

Does anyone have idea how I may fix it?

Thank you!

Peter Eastman · Post by **Peter Eastman** » Mon Jul 22, 2019 2:28 am

Are you creating a separate Context in each process? If you create it in one process and try to access it in another, that probably won't work. If that's not it, can you post code that demonstrates the problem?

Wei Chen · Post by **Wei Chen** » Mon Jul 22, 2019 11:00 am

Sure, basically I have defined following class:

Code: Select all

class Simulation(object):
    def __init__(self):
        pass

    def run(self, output_dcd=None, n_checkpoints=10, gpu_idx='0',
            start_xyz=None, total_steps=1000000, interval=10000, equi_steps=100000,
            config_file='config.yml', plumed_file=None):
        spec = yaml.load(open(config_file).read()).get('simulation', {})
        molecule = spec['molecule']
        box_size = spec['box_size']    # in nm
        temperature = 300
        solvent_type = 'explicit'

        pdb = app.PDBFile(pdb_file)
        forcefield = app.ForceField('amber03.xml', 'tip3p.xml')
        # forcefield = app.ForceField('opls-aam.xml', 'tip3p.xml')
        modeller = app.Modeller(pdb.topology, start_xyz)
        modeller.addHydrogens(forcefield)
        modeller.addSolvent(forcefield, model='tip3p',
                            boxSize=vec3.Vec3(
                                box_size, box_size, box_size) * nn.nanometers,
                            ionicStrength=0 * nn.molar)
        modeller.addExtraParticles(forcefield)
        start_xyz = modeller.getPositions()    # include water and extra particles
        system = forcefield.createSystem(
            modeller.topology, nonbondedMethod=app.PME,
            nonbondedCutoff=1.0 * nn.nanometers,
            constraints=app.HBonds)
        sys_tolology = modeller.topology

        system.addForce(AndersenThermostat(temperature*nn.kelvin, 1/nn.picosecond))
        integrator = LangevinIntegrator(
            temperature*nn.kelvin, 1/nn.picosecond, 0.002*nn.picoseconds)
        platform = Platform.getPlatformByName("CUDA")
        properties = {'CudaDeviceIndex': gpu_idx, 'CudaPrecision': 'mixed'}
        simulation = app.Simulation(
            sys_tolology, system, integrator, platform=platform, platformProperties=properties)
        simulation.context.setPositions(start_xyz)
        simulation.minimizeEnergy()
        with open(output_dcd.replace('.dcd', '.pdb'), 'w') as out_f:
            app.PDBFile.writeModel(sys_tolology, simulation.context.getState(
                getPositions=True).getPositions(), out_f)
        simulation.step(equi_steps)
        simulation.reporters.append(app.DCDReporter(output_dcd, interval))
        simulation.reporters.append(app.StateDataReporter(output_dcd.replace('.dcd', '.txt'), interval,
                                                            time=True, step=True, potentialEnergy=True, kineticEnergy=True, speed=True,
                                                            temperature=True, progress=True, remainingTime=True, volume=True, density=True,
                                                            totalSteps=total_steps + equi_steps))

        return

then I create a list of Simulation objects, and call run() of them with different "gpu_idx" such that each of 2 GPUs is running one simulation at any time:

Code: Select all

sim_list = []

for item in range(10):
    kwargs = {...}  # arguments
    pro = Process(target=Simulation().run, kwargs=kwargs)
    sim_list.append(pro)

. The code to run parallel processes is:

Code: Select all

    def run_parallel_processes(proc_list, num=2):
        assert (isinstance(proc_list[0], Process)), proc_list[0]
        for item in range(len(proc_list) // num):
            for idx in range(num):
                proc_list[num * item + idx].start()
            for idx in range(num):
                proc_list[num * item + idx].join()
        proc_remain = proc_list[-len(proc_list) % num:]
        for item in proc_remain: item.start()
        for item in proc_remain: item.join()
        return
      
run_parallel_processes(sim_list)    # ERROR here

The code is taken from a repository I am developing so that may not be runnable. Let me know if you need additional information. Thanks!

Peter Eastman · Post by **Peter Eastman** » Tue Jul 23, 2019 10:23 am

In the github issue you linked, the poster figured out the problem:

I figured it out! This error was happening because I was importing TensorFlow, and thusly, cuDNN, from earlier on in my program.

Could that be the problem? Perhaps you just need to import openmm inside the run() method rather than earlier?

Wei Chen · Post by **Wei Chen** » Tue Jul 23, 2019 10:38 am

I have tried that, but it did not work.

I am thinking if there is some issue related with Plumed, as all the simulations are run with plumed. If I run simulations from command line (i.e. implement command line API and use `subprocess.check_output()` to run simulation), then parallelization is fine. But if I run them using `Simulation()` objects, there will be the CUDA error.

CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

Re: CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

Re: CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

Re: CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing

Re: CUDA_ERROR_NOT_INITIALIZED error when using Multiple GPUs with multiprocessing