參考blog:https://www.comsol.com/blogs/added-value-task-parallelism-batch-sweeps/
我們知道并行計(jì)算可以加快計(jì)算速度,但是這個(gè)加快不是無限制的场航,而且這個(gè)速度的加快程度依賴于我們的algorithm的具體寫法。在本文中我們從理論山解釋了parallel comuting的limitations纽甘。同時(shí)展示了怎么借用comsol的batch sweep來improving performance when you reach these limits.
Amdahl’s and Gustafson-Barsis’ laws
算法分為serial algorithm和parallel algorithm亥鬓。通過增加計(jì)算單元(也叫作process或者threads),可以加快paralle algorithm的速度,但是對(duì)于serial algorithm 無效拢肆。我們實(shí)際中寫的algorithm大約是兩種algorithm的一種混合。假定代碼中parallel code 占比為,則serial algorithm為(
)靖诗」郑考慮計(jì)算時(shí)間
,P代表進(jìn)程(Process)數(shù)目呻畸。當(dāng)
時(shí)移盆,計(jì)算時(shí)間記做
,那么當(dāng)active process 為P時(shí),計(jì)算時(shí)間為:
伤为,那么相應(yīng)的speedup為:
Amdahl’s Law
For 100% parallelized code, the sky is the limit. 當(dāng)咒循,speedup會(huì)有一個(gè)limit:
。比如如下圖所示:
Gustafson-Barsis’ Law
Amdahl’s law assumes that the size of the problem is fixed. Yet, by assuming that the size of the problem increases with the number of added processes, then you are utilizing all the processes to an assumed level, and the speedup of the performed computations remains unbounded.
The Cost of Communication
Gustafson-Barsis’ law implies that we are only restricted in the size of the problem we can compute绞愚,but sometimes communication is expensive. Let’s consider an overhead that is dominated by the communication and synchronization required in parallel processing, and model this as time added to the computation time.
In the case of no overhead, the result is as predicted by Amdahl’s law(last picture), but when we start adding overhead, we see that something is happening.
For a quadratic function, the result is worse and, as you might recall from our earlier blog post on distributed memory computing, the increase of communication is quadratic in the case of all-to-all communication. Due to this phenomenon, we cannot expect to have a speedup on a cluster for, say, a small time-dependent problem when adding more and more processes. The amount of communication would increase faster than any gain from added processes. 不過我們此時(shí)考慮的是fixed size的problem叙甸,事實(shí)上,當(dāng)我們?cè)龃髉roblem的size的時(shí)候位衩, “slowdown” effect introduced through communication would be less relevant裆蒸。
Batch Sweeps in COMSOL Multiphysics
As our example model, we will use the electrodeless lamp, which is available in the Model Gallery. This model is small, at around 80,000 degrees of freedom, but needs about 130 time steps in its solution. To make this transient model parametric as well, we will compute the model for several values of the lamp power, namely 50 W, 60 W, 70 W, and 80 W.
On my workstation, a Fujitsu? CELSIUS? equipped with an Intel? Xeon? E5-2643 quad core processor and 16 GB of RAM, the following compute times are received:
Number of Cores | Compute Time per Parameter | Compute Time for Sweep |
---|---|---|
1 | 30 mins | 120 mins |
2 | 21 mins | 82 mins |
3 | 17 mins | 68 mins |
4 | 18 mins | 72 mins |
從上表可以看出,只是增加電腦利用的核數(shù)并不能增加速度糖驴,反而當(dāng)有3核改為4核之后速度變慢了僚祷。
We will now use the batch sweep functionality to parallelize this problem in another way: we will switch from data parallelism to task parallelism. We will create a batch job for each parameter value and see what this does to our computation times.
從上圖可以看出佛致,當(dāng)我們把工作分成同時(shí)工作的四份,每份工作占用一個(gè)核辙谜,速度可以大大加快俺榆。
在我在自己的電腦上測(cè)試squareloop的工作的時(shí)候發(fā)現(xiàn)建立batch sweep確實(shí)也可以加快速度,我的電腦是4core装哆,16G of RAM. 所用時(shí)間是3min41s,
所用時(shí)間是2min20s罐脊,
所用時(shí)間是2min6s。加速效果并不是很明顯
Conclusion
在comsol中設(shè)置并行計(jì)算是個(gè)很復(fù)雜的問題蜕琴,就像怎么選擇求解器一樣萍桌。和要解決的問題,以及計(jì)算機(jī)的性能特點(diǎn)都很有關(guān)系凌简。
Selecting the right parallel configuration is not always easy, and it can be hard to know beforehand how you should “hybridize” your parallel computations. But as in many other cases, experience comes from playing around and testing, and with COMSOL Multiphysics, you have the possibility to do that. Try it yourself with different configurations and different models, and you will soon know how to set the software up in order to get the best performance out of your hardware.