Hi Carlos,
That is a cool rowing simulation.
I will preface my reply with two caveats: 1) I do not have nearly as much experience with Python as I do with Matlab, and 2) If your laptop processor only has 2-4 cores, as is often the case, you will be limited as to what you can do with parallelization.
If you don't specify anything, I think Moco uses all available cores by default (perhaps Nick can confirm this) and thus will solve in parallel automatically. This can be controlled, when necessary, from C++, Matlab, or Python using the "set_parallel()" property as Nick mentioned (e.g., solver.set_parallel(4)).
Running multiple instances of Moco in parallel can only be done effectively if you have lots of cores available. We have workstations with 18-36 cores for doing this. I use parfor (parallel for loop) in Matlab for this. I have not done this in Python, but it looks like there is a Python equivalent:
https://pypi.org/project/parfor/
Again, that is not likely to help on a machine that only has a few processor cores.
And since you mentioned that you are doing metabolic cost minimization, I can add that I have always found those problems to be among the slowest to converge. I suspect it is because metabolic energy consumption, as an optimization objective, is a complicated function of the muscle states, compared with minimizing muscle activation directly, for example (i.e., minimizing one of the states directly).
Best,
Brian
Is MOCO CPU agnostic?
- Carlos Gonçalves
- Posts: 136
- Joined: Wed Jun 08, 2016 4:56 am
Re: Is MOCO CPU agnostic?
Thanks a lot for the explanation Dr. Umberger.
I will soon get access to a more powerful machine at my lab. I will give it a try and post the results here.
I will definitely look at the parfor version in Python. Most of my planning nowadays is organizing the multiple simulations to run at night and evaluating them closely the next day. This feature will be vital for this and other projects.
Best regards.
I will soon get access to a more powerful machine at my lab. I will give it a try and post the results here.
I will definitely look at the parfor version in Python. Most of my planning nowadays is organizing the multiple simulations to run at night and evaluating them closely the next day. This feature will be vital for this and other projects.
Best regards.
- Pasha van Bijlert
- Posts: 227
- Joined: Sun May 10, 2020 3:15 am
Re: Is MOCO CPU agnostic?
Dear Prof. Umberger, dear Carlos,
I missed this interesting discussion. I'm doing two degrees simultaneously, and was focused on graduating (for my non-biomech related degree).
Since my first post I've assembled the workstation (with a ryzen 5900x, so 12 physical cores and 24 threads). If I understand the nested parallelization scheme correctly, in my situation I should limit casadi to 8 cores (which are actually 8 threads), and I could then use matlab's parfor to have three parallel instances of Moco running. This could give me a performance benefit if I'm looking for a gait across a full speed range based on a fixed (or random) initial guess.
This hinges on me interpreting set_parallel(8) as limiting the parallelization to 8 threads, i.e. 4 physical cores.
I'll be working on this in the next couple of weeks, I will report back when on my findings!
Best wishes,
Pasha
I missed this interesting discussion. I'm doing two degrees simultaneously, and was focused on graduating (for my non-biomech related degree).
Since my first post I've assembled the workstation (with a ryzen 5900x, so 12 physical cores and 24 threads). If I understand the nested parallelization scheme correctly, in my situation I should limit casadi to 8 cores (which are actually 8 threads), and I could then use matlab's parfor to have three parallel instances of Moco running. This could give me a performance benefit if I'm looking for a gait across a full speed range based on a fixed (or random) initial guess.
This hinges on me interpreting set_parallel(8) as limiting the parallelization to 8 threads, i.e. 4 physical cores.
I'll be working on this in the next couple of weeks, I will report back when on my findings!
Best wishes,
Pasha
- Carlos Gonçalves
- Posts: 136
- Joined: Wed Jun 08, 2016 4:56 am
Re: Is MOCO CPU agnostic?
Hello Pasha! This is indeed a fascinating subject, especially if someone needs to buy a new computer (like myself).
To add on the mix. I was able to use another computer at my lab remotely. Here is the comparison:
Model: rowing model (upper and lower body) with 2D motion and 18 muscles and five actuators. Contact forces for heel and rowing ergometer. PathSpring for cable force.
SimTime: 0,75s of simulation time.
Control goals: metabolics of lower limb, time, activations
Enviroment: OpenSim 4.2 scripting in Python
Computer 1: 4 treads = 5,6 hours
Computer 2: 12 treads = 1,6 hours
This thing is awesome!
Best regards
To add on the mix. I was able to use another computer at my lab remotely. Here is the comparison:
Model: rowing model (upper and lower body) with 2D motion and 18 muscles and five actuators. Contact forces for heel and rowing ergometer. PathSpring for cable force.
SimTime: 0,75s of simulation time.
Control goals: metabolics of lower limb, time, activations
Enviroment: OpenSim 4.2 scripting in Python
Computer 1: 4 treads = 5,6 hours
Computer 2: 12 treads = 1,6 hours
This thing is awesome!
Best regards
- Pasha van Bijlert
- Posts: 227
- Joined: Sun May 10, 2020 3:15 am
Re: Is MOCO CPU agnostic?
Hello all,
I apologize for my long silence after starting this thread myself. I just ran the promised parallelization test, and I have found essentially the same results that prof. Umberger predicted. My model has 14DoF, 22 muscles, all with activation dynamics and tendon dynamics. This leads to 28 kinematic states, 44 muscle states and 22 controls, and since I'm using implicit tendon dynamics I get another 22 derivatives. That's a total of 116 states, on a grid of 101 collocation points. This leads to the 11717 variables that are being optimized.
I decided to set the optimizer to stop at 20 iterations, because I had no idea how long this would take when limiting to a single core(about an hour, I would find out). I ran this test on a Ryzen 5900x, which has 12 physical cores (but 24 threads), and I clocked the solver for the same problem from 1 to 24 threads. Here's a plot:
Performance starts hitting diminishing returns at about 10 threads, with the best being at 17 threads (19 seconds per iteration). At 8 threads, each iteration takes 30 seconds, which is of course about 50% slower, but allows you to do three parallel optimizations. This leads to double the performance of 17 threads, if you parallelize them using parfor as suggested. It might be worth it, whenever using a new model/problem, to run a test like this, I don't know if it's problem dependent or not. There's two interesting stretches at n*4 +1 threads (13 and 17) where the added uneven core locally increases the speed, and each added core essentially linearly reduces the speed again. I don't know if this is a pattern that would hold in a more rigorous test, though.
Nick, I was wondering if you can confirm something for me? I was initially very confused when I looped from 1:24 in solver.set_parallel, because the second round in the loop gave me a substantial increase in computation time. I later found that you can set the solver to neglect parallelism by using solver.set_parallel(0), so does that mean that set_parallel(1) is the default setting (i.e., all available cores)?
Best wishes,
Pasha
I apologize for my long silence after starting this thread myself. I just ran the promised parallelization test, and I have found essentially the same results that prof. Umberger predicted. My model has 14DoF, 22 muscles, all with activation dynamics and tendon dynamics. This leads to 28 kinematic states, 44 muscle states and 22 controls, and since I'm using implicit tendon dynamics I get another 22 derivatives. That's a total of 116 states, on a grid of 101 collocation points. This leads to the 11717 variables that are being optimized.
I decided to set the optimizer to stop at 20 iterations, because I had no idea how long this would take when limiting to a single core(about an hour, I would find out). I ran this test on a Ryzen 5900x, which has 12 physical cores (but 24 threads), and I clocked the solver for the same problem from 1 to 24 threads. Here's a plot:
Performance starts hitting diminishing returns at about 10 threads, with the best being at 17 threads (19 seconds per iteration). At 8 threads, each iteration takes 30 seconds, which is of course about 50% slower, but allows you to do three parallel optimizations. This leads to double the performance of 17 threads, if you parallelize them using parfor as suggested. It might be worth it, whenever using a new model/problem, to run a test like this, I don't know if it's problem dependent or not. There's two interesting stretches at n*4 +1 threads (13 and 17) where the added uneven core locally increases the speed, and each added core essentially linearly reduces the speed again. I don't know if this is a pattern that would hold in a more rigorous test, though.
Nick, I was wondering if you can confirm something for me? I was initially very confused when I looped from 1:24 in solver.set_parallel, because the second round in the loop gave me a substantial increase in computation time. I later found that you can set the solver to neglect parallelism by using solver.set_parallel(0), so does that mean that set_parallel(1) is the default setting (i.e., all available cores)?
Best wishes,
Pasha
- Nicholas Bianco
- Posts: 1050
- Joined: Thu Oct 04, 2012 8:09 pm
Re: Is MOCO CPU agnostic?
Hi Pasha,
Yes, 'solver.set_parallel(1)' means MocoCasADiSolver will use all available cores, and it is the default setting.
-Nick
Yes, 'solver.set_parallel(1)' means MocoCasADiSolver will use all available cores, and it is the default setting.
-Nick
- Pasha van Bijlert
- Posts: 227
- Joined: Sun May 10, 2020 3:15 am
Re: Is MOCO CPU agnostic?
Thinks Nick!
I reran the same test for 50 iterations and got the exact same results. I might try different problems & models to see if the pattern changes. For now, it's worth noting that if you're planning on doing many single optimization runs, it's seems like it's a good idea to run a test like this. Simulation time when limited to 17 threads is 15% lower than with all 24 threads. These time savings add up if you're pushing your model through a bunch of different speeds.
Best,
Pasha
I reran the same test for 50 iterations and got the exact same results. I might try different problems & models to see if the pattern changes. For now, it's worth noting that if you're planning on doing many single optimization runs, it's seems like it's a good idea to run a test like this. Simulation time when limited to 17 threads is 15% lower than with all 24 threads. These time savings add up if you're pushing your model through a bunch of different speeds.
Best,
Pasha