MBG wiki | RecentChanges | Blog | 2021-12-03 | 2021-12-02

NAMD benchmarks and scaling-up, enhanced

The previous tests had three problems :

Estimation of an 'ideal' scale-up : Server is a PIV at 2.6 GHz, the newest nodes are Celerons at 2.4 GHz, and the oldest nodes are PIIIs at 733 MHz. Based on that, it would be expected that the server and celerons would have comparable performance. But, they don't : running NAMD stand-alone on these machines gave the following results

MachineCPUClockingDays per nanosecond of simulationRelative speed
ServerPIV2.6 GHz11.77213.021.00000
New nodesCelerons2.4 GHz17.27842.060.68132
Old nodesPIII733 MHz35.55491.000.33110

The obvious conclusions are (i) Celerons are lousy, (ii) Pentium III was an excellent design for its time, (iii) PIV is OK. The ideal scale-up can be deduced directly from these numbers. For example, all machines in the cluster should give : 1.0+8*0.68132+9*0.33110 = 9.43 times faster than the server alone, while the server plus the newest nodes should have a performance 1.0+8*0.68132 = 6.45 times faster than the server alone. This is not achievable for two reasons (i) communication overhead, (ii) because of the FFT used for the electrostatics calculations there are 'better' and 'worse' number of nodes for any given problem (eg. powers of two (2,4,8,16, ...) are expected to perform better). The results are

Number of nodesDays/nsecObserved scale-upExpected scale-upObserved/ExpectedPerformance relative to a PIII@733
111.7721.01.01.0003.02
28.06661.459351.68130.8684.40
36.05981.942642.36260.8225.87
44.65922.526613.04390.8307.63
53.96222.971073.72530.7978.97
63.72953.156454.40660.7169.53
73.10313.793625.08790.74511.45
82.79404.213315.76920.73012.72
92.56074.597186.45050.71213.88
102.53284.647826.78160.68514.03
112.54904.618287.11270.64913.94
122.47934.748117.44380.63814.34
132.45224.800587.77490.61714.50
142.25825.213008.10600.64315.74
152.16955.426138.43710.64316.38
162.17365.415908.76820.61716.35

NAMD scale-up, corrected graph


Conclusions

The clear conclusion from the analysis above is that the most effective usage of the cluster for molecular dynamics is to run one job on the (server+celerons) using 8 or 9 nodes, and possibly another job on the older nodes (9 nodes again).

The results from a real-life example of using the cluster in this way are shown below :

Simulation descriptionNodesNumber of atomsDays per nanosecond
FliN? (artificial tetramer)server plus 7 celerons (8 nodes)672163.06
T40Y mutant of HrcQb-CPIIIs@733 (9 nodes)596965.10

Because the number of atoms for the second simulation is virtually identical with the one used for the results shown in the tables above, it is easy to calculate just how well the PIIIs perform : A stand-alone PIII needs 35.554 days for 1 nsec. For 9 PIIIs the observed scale-up is 35.554 / 5.10 = 6.971 and the ratio of observed to expected scale-up stands at 0.774 (somewhat better than the one observed with the server + celerons).

Another way to show that it is advantageous to submit two independent jobs is the following : For a 60000-atom system using all nodes simultaneously would have allowed the execution of a 6 nanoseconds simulation in approximately 2.1 * 6 = 12.6 days. For the same period (12.6 days), two independent jobs would have given 4.92 nanoseconds (from the server+celerons) plus 2.47 nsec from the PIIIs, giving a total of 7.39 nanoseconds.