- The assumption of an ideal linear scale-up is wrong because the cluster is not homogeneous. The quick way around it would be to assume that the CPU clocking is proportional to the nodes' effective speed. As shown below, this is clearly wrong.
- The timings reported in the previous tests were those reported by the program fairly soon after the start-up phase of the simulation. For the tests reported here the simulation time was extended to 10000 steps (20 psec) and the time reported is the minimum observed from the whole run.
- In the previous tests, the network connection of node number 9 (aspera) involved an additional hub. This significantly changed the effective bandwidth of this node's connection to the rest of the cluster (see Throughput graphs of raw TCP). For this test, aspera was directly connected to the cluster's main switch.
Estimation of an 'ideal' scale-up : Server is a PIV at 2.6 GHz, the newest nodes are Celerons at 2.4 GHz, and the oldest nodes are PIIIs at 733 MHz. Based on that, it would be expected that the server and celerons would have comparable performance. But, they don't : running NAMD stand-alone on these machines gave the following results
Machine | CPU | Clocking | Days per nanosecond of simulation | Relative speed | |
Server | PIV | 2.6 GHz | 11.7721 | 3.02 | 1.00000 |
New nodes | Celerons | 2.4 GHz | 17.2784 | 2.06 | 0.68132 |
Old nodes | PIII | 733 MHz | 35.5549 | 1.00 | 0.33110 |
The obvious conclusions are (i) Celerons are lousy, (ii) Pentium III was an excellent design for its time, (iii) PIV is OK. The ideal scale-up can be deduced directly from these numbers. For example, all machines in the cluster should give : 1.0+8*0.68132+9*0.33110 = 9.43 times faster than the server alone, while the server plus the newest nodes should have a performance 1.0+8*0.68132 = 6.45 times faster than the server alone. This is not achievable for two reasons (i) communication overhead, (ii) because of the FFT used for the electrostatics calculations there are 'better' and 'worse' number of nodes for any given problem (eg. powers of two (2,4,8,16, ...) are expected to perform better). The results are
Number of nodes | Days/nsec | Observed scale-up | Expected scale-up | Observed/Expected | Performance relative to a PIII@733 |
1 | 11.772 | 1.0 | 1.0 | 1.000 | 3.02 |
2 | 8.0666 | 1.45935 | 1.6813 | 0.868 | 4.40 |
3 | 6.0598 | 1.94264 | 2.3626 | 0.822 | 5.87 |
4 | 4.6592 | 2.52661 | 3.0439 | 0.830 | 7.63 |
5 | 3.9622 | 2.97107 | 3.7253 | 0.797 | 8.97 |
6 | 3.7295 | 3.15645 | 4.4066 | 0.716 | 9.53 |
7 | 3.1031 | 3.79362 | 5.0879 | 0.745 | 11.45 |
8 | 2.7940 | 4.21331 | 5.7692 | 0.730 | 12.72 |
9 | 2.5607 | 4.59718 | 6.4505 | 0.712 | 13.88 |
10 | 2.5328 | 4.64782 | 6.7816 | 0.685 | 14.03 |
11 | 2.5490 | 4.61828 | 7.1127 | 0.649 | 13.94 |
12 | 2.4793 | 4.74811 | 7.4438 | 0.638 | 14.34 |
13 | 2.4522 | 4.80058 | 7.7749 | 0.617 | 14.50 |
14 | 2.2582 | 5.21300 | 8.1060 | 0.643 | 15.74 |
15 | 2.1695 | 5.42613 | 8.4371 | 0.643 | 16.38 |
16 | 2.1736 | 5.41590 | 8.7682 | 0.617 | 16.35 |
Conclusions
The clear conclusion from the analysis above is that the most effective usage of the cluster for molecular dynamics is to run one job on the (server+celerons) using 8 or 9 nodes, and possibly another job on the older nodes (9 nodes again).
The results from a real-life example of using the cluster in this way are shown below :
Simulation description | Nodes | Number of atoms | Days per nanosecond |
FliN? (artificial tetramer) | server plus 7 celerons (8 nodes) | 67216 | 3.06 |
T40Y mutant of HrcQb-C | PIIIs@733 (9 nodes) | 59696 | 5.10 |
Because the number of atoms for the second simulation is virtually identical with the one used for the results shown in the tables above, it is easy to calculate just how well the PIIIs perform : A stand-alone PIII needs 35.554 days for 1 nsec. For 9 PIIIs the observed scale-up is 35.554 / 5.10 = 6.971 and the ratio of observed to expected scale-up stands at 0.774 (somewhat better than the one observed with the server + celerons).
Another way to show that it is advantageous to submit two independent jobs is the following : For a 60000-atom system using all nodes simultaneously would have allowed the execution of a 6 nanoseconds simulation in approximately 2.1 * 6 = 12.6 days. For the same period (12.6 days), two independent jobs would have given 4.92 nanoseconds (from the server+celerons) plus 2.47 nsec from the PIIIs, giving a total of 7.39 nanoseconds.