MBG wiki: NAMD benchmarks and scaling-up, enhanced

The previous tests had three problems :

The assumption of an ideal linear scale-up is wrong because the cluster is not homogeneous. The quick way around it would be to assume that the CPU clocking is proportional to the nodes' effective speed. As shown below, this is clearly wrong.
The timings reported in the previous tests were those reported by the program fairly soon after the start-up phase of the simulation. For the tests reported here the simulation time was extended to 10000 steps (20 psec) and the time reported is the minimum observed from the whole run.
In the previous tests, the network connection of node number 9 (aspera) involved an additional hub. This significantly changed the effective bandwidth of this node's connection to the rest of the cluster (see Throughput graphs of raw TCP). For this test, aspera was directly connected to the cluster's main switch.

Estimation of an 'ideal' scale-up : Server is a PIV at 2.6 GHz, the newest nodes are Celerons at 2.4 GHz, and the oldest nodes are PIIIs at 733 MHz. Based on that, it would be expected that the server and celerons would have comparable performance. But, they don't : running NAMD stand-alone on these machines gave the following results

Machine CPU Clocking Days per nanosecond of simulation Relative speed
Server PIV 2.6 GHz 11.7721 3.02 1.00000
New nodes Celerons 2.4 GHz 17.2784 2.06 0.68132
Old nodes PIII 733 MHz 35.5549 1.00 0.33110

The obvious conclusions are (i) Celerons are lousy, (ii) Pentium III was an excellent design for its time, (iii) PIV is OK. The ideal scale-up can be deduced directly from these numbers. For example, all machines in the cluster should give : 1.0+8*0.68132+9*0.33110 = 9.43 times faster than the server alone, while the server plus the newest nodes should have a performance 1.0+8*0.68132 = 6.45 times faster than the server alone. This is not achievable for two reasons (i) communication overhead, (ii) because of the FFT used for the electrostatics calculations there are 'better' and 'worse' number of nodes for any given problem (eg. powers of two (2,4,8,16, ...) are expected to perform better). The results are

Number of nodes Days/nsec Observed scale-up Expected scale-up Observed/Expected Performance relative to a PIII@733
1 11.772 1.0 1.0 1.000 3.02
2 8.0666 1.45935 1.6813 0.868 4.40
3 6.0598 1.94264 2.3626 0.822 5.87
4 4.6592 2.52661 3.0439 0.830 7.63
5 3.9622 2.97107 3.7253 0.797 8.97
6 3.7295 3.15645 4.4066 0.716 9.53
7 3.1031 3.79362 5.0879 0.745 11.45
8 2.7940 4.21331 5.7692 0.730 12.72
9 2.5607 4.59718 6.4505 0.712 13.88
10 2.5328 4.64782 6.7816 0.685 14.03
11 2.5490 4.61828 7.1127 0.649 13.94
12 2.4793 4.74811 7.4438 0.638 14.34
13 2.4522 4.80058 7.7749 0.617 14.50
14 2.2582 5.21300 8.1060 0.643 15.74
15 2.1695 5.42613 8.4371 0.643 16.38
16 2.1736 5.41590 8.7682 0.617 16.35

Conclusions

The clear conclusion from the analysis above is that the most effective usage of the cluster for molecular dynamics is to run one job on the (server+celerons) using 8 or 9 nodes, and possibly another job on the older nodes (9 nodes again).

The results from a real-life example of using the cluster in this way are shown below :

Simulation description Nodes Number of atoms Days per nanosecond
FliN? (artificial tetramer) server plus 7 celerons (8 nodes) 67216 3.06
T40Y mutant of HrcQb-C PIIIs@733 (9 nodes) 59696 5.10

Because the number of atoms for the second simulation is virtually identical with the one used for the results shown in the tables above, it is easy to calculate just how well the PIIIs perform : A stand-alone PIII needs 35.554 days for 1 nsec. For 9 PIIIs the observed scale-up is 35.554 / 5.10 = 6.971 and the ratio of observed to expected scale-up stands at 0.774 (somewhat better than the one observed with the server + celerons).

Another way to show that it is advantageous to submit two independent jobs is the following : For a 60000-atom system using all nodes simultaneously would have allowed the execution of a 6 nanoseconds simulation in approximately 2.1 * 6 = 12.6 days. For the same period (12.6 days), two independent jobs would have given 4.92 nanoseconds (from the server+celerons) plus 2.47 nsec from the PIIIs, giving a total of 7.39 nanoseconds.

Machine	CPU	Clocking	Days per nanosecond of simulation	Relative speed
Server	PIV	2.6 GHz	11.7721	3.02	1.00000
New nodes	Celerons	2.4 GHz	17.2784	2.06	0.68132
Old nodes	PIII	733 MHz	35.5549	1.00	0.33110

Number of nodes	Days/nsec	Observed scale-up	Expected scale-up	Observed/Expected	Performance relative to a PIII@733
1	11.772	1.0	1.0	1.000	3.02
2	8.0666	1.45935	1.6813	0.868	4.40
3	6.0598	1.94264	2.3626	0.822	5.87
4	4.6592	2.52661	3.0439	0.830	7.63
5	3.9622	2.97107	3.7253	0.797	8.97
6	3.7295	3.15645	4.4066	0.716	9.53
7	3.1031	3.79362	5.0879	0.745	11.45
8	2.7940	4.21331	5.7692	0.730	12.72
9	2.5607	4.59718	6.4505	0.712	13.88
10	2.5328	4.64782	6.7816	0.685	14.03
11	2.5490	4.61828	7.1127	0.649	13.94
12	2.4793	4.74811	7.4438	0.638	14.34
13	2.4522	4.80058	7.7749	0.617	14.50
14	2.2582	5.21300	8.1060	0.643	15.74
15	2.1695	5.42613	8.4371	0.643	16.38
16	2.1736	5.41590	8.7682	0.617	16.35

Simulation description	Nodes	Number of atoms	Days per nanosecond
FliN? (artificial tetramer)	server plus 7 celerons (8 nodes)	67216	3.06
T40Y mutant of HrcQb-C	PIIIs@733 (9 nodes)	59696	5.10