MPI as platform?

max payne · Post by **max payne** » Sat Mar 14, 2009 6:03 pm

After seeing the slides OpenMM API & Roadmap, I have been trying to figure out the concept of MPI (or of any other intra-processes communication mechanism) as Platform in the OpenMM framework. I'm thinking about it in the context cluster computing where a computing node can potentially have also one or more GPUs.

Now the question is natural, if the GPU is already a platform then, is the Cluster with GPUs another Platform? Does it mean that for each combination of “communication platform” (MPI, SHMEM, etc) and “computing platform”(GPUs, Cell, Larrabee, etc) we should implement a different OpenMM Platform?

I hope this thread would encourage some discussion about the portability OpenMM to Clusters (or Clusters of GPUs). Any thoughts on the matter are appreciated.
Thanks,

Max

Peter Eastman · Post by **Peter Eastman** » Mon Mar 16, 2009 1:16 pm

Hi Max,

In the most technical sense, a "platform" simply means an implementation of all the computational kernels. The question of what hardware a particular implementation (i.e. platform) supports is up to the author of the platform.

Right now, for example, we have a Platform that includes lots of CUDA code, and runs on Nvidia GPUs. And we have a different Platform that includes lots of Brook code and runs on AMD GPUs. In the future, we hope to have a single Platform written in OpenCL that runs on both. And that same code could potentially also run on Larrabee, or a CPU, or whatever other hardware supports OpenCL. On the other hand, we might discover that the hardware is different enough that it really needs different algorithms to get good performance. In that case, we would have different Platforms for them, even though they're written in the same language.

Similarly, we're starting to think about GPU clusters. It's not clear yet whether we'll have a single Platform for both single GPUs and GPU clusters, or whether it will turn out to be more convenient to make those separate Platforms.

Peter

max payne · Post by **max payne** » Mon Mar 16, 2009 11:34 pm

Hi Peter,

Thanks for the reply.
Let me first of all congratulate with you and your group for your work. I have been reading the code for several days and it looks really nice, well implemented and most of all very well documented.

Exactly, starting from the definition of “Platform is an implementation of all the computational kernels for a particular architecture”, why would someone want to implement a computational kernel over MPI? I mean, if there is already a nice GPU implementation it would be more convenient to implement a distribution mechanism over MPI (based on some sort of domain decomposition) and to use the GPU kernels as they are after the information have been transferred before a “force” calculation step.

In the same way I think that the concept of “System” (as it is now) would not directly apply to a cluster of GPUs. I think it needs to be distributed with some kind of underline mechanism, maybe based on the concept of domain decomposition.

It could be useful to think at it for instance with an underline Partitioned Global Address Space (PGAS) library (transparent to the OpenMM API user) which automatically distributes the memory allocations across the cluster nodes. In general PGAS libraries (such GASNet or Global Arrays) use an underline MPI or a fast access to the network hardware to implement a virtually global address space. These libraries are used basically with just (global_malloc, global_free, put and get). Someone can think of implementing its own global memory framework mechanisms directly with MPI but I guess it will be too much work to start with.

In this context the main program could look like this (this program will be executed with mpirun):

/* NEW cluster class. This will call the PGAS library initialization (or the plain MPI_init) and initialize the decomposition strategy*/

Cluster cluster(argc, argv, decomposition_strategy, etc etc)

/* If cluster is not NULL the PGAS library will allocates the system masses and constrains globally based on the decomposition strategy otherwise a normal allocation is performed */

System system(N, Nconstrains, cluster);

/* Automatically store the particles information on the cluster node where the particle“i” has been assigned by the PGAS allocation */

system.setParticleMass(i, mass);

/* If cluster is not NULL the allocation of the particles proprieties for the forces is done globally. Particles force proprieties get distributed in the same way the system masses have been distributed */

Force * force= new Force( force proprieties, cluster);
force->setparticleparm(i, proprieties_i);

/* This will check is both allocations (for the force and the system) were done in the same way */

system.addForce(force);

/* positions also get distributed in the same way */
OpenMMContext context(system, integrator, platform, cluster);
context.setPositions(positions);

VerletIntegrator integrator(0.01);

/* before each "force" calculation step there is an underline collection mechanism that localizes the particles for the interactions. Then the normal “Platform” force methods are called. The integration will be performed on the cluster node where the particles reside. Time to time inside the integration call there could be a redistribution of the particles based on load balancing algorithm */

State state = context.getState(State::Energy);

I did not think through this in full detail but I think that using a global address space library (or implementing a very basic one maybe inside the new Cluster class) a first cluster implementation should not be too difficult. Most importantly MPI will be totally transparent to the user and to the developers of any future “Platform”.

Let me know if this sort of “thinking loud” makes any sense. Maybe there are details that in few days I did not see that prevent this for working. As we know the devil is always in the details.
Thanks,

Max

Michael Sherman · Post by **Michael Sherman** » Tue Mar 17, 2009 9:09 am

This appears to have the problem that the same main program wouldn't work on a non-cluster system. The separate-platform idea could be used to shove the cluster down a level so that it would be invisible in user code.

I'm thinking that one would want to develop the program, perhaps run it on a local GPU-accelerated laptop for a while during development, and then do production runs on the cluster. It would be very nice not to have to make code changes for the final step.

Sherm

max payne · Post by **max payne** » Tue Mar 17, 2009 9:42 am

why would not work? if the "main" program is executed with "mpirun -np num_nodes ./main" the "Cluster" will get allocated according to the num_nodes available. If the program is executer normally "./main" the Cluster will get another initialization (maybe NULL) and it will run with just one process. I wrote many programs that behave in this way. Maybe I did not fully get your observation.

Michael Sherman · Post by **Michael Sherman** » Tue Mar 17, 2009 9:54 am

I see your point. Is there some reason it is desirable to have the clusters visible in the OpenMM API rather than buried as a platform? It is my (vague) understanding that OpenCL is following the cluster-is-just-a-platform path but it remains to be seen how that will work out.

Sherm

Peter Eastman · Post by **Peter Eastman** » Tue Mar 17, 2009 10:12 am

This is basically what we have in mind, except that there's no need for a Cluster class. The Platform subclass would take care of all that. The idea behind the OpenMM API is that it's an API for describing *what* computation to do, not *how* to do it. The Platform is then responsible for all the "how" details. So you describe the system you want to simulate, and then the Platform executes that simulation using whatever hardware and algorithms it's written for - a GPU, a cluster, etc.

Peter

MPI as platform?

MPI as platform?

RE: MPI as platform?

RE: MPI as platform?

RE: MPI as platform?

RE: MPI as platform?

RE: MPI as platform?

RE: MPI as platform?