Initial tests :
- icc, mpich, gcc-made cblas
- icc, mpich, intel math kernel library
- gcc 2.95.3, atlas 3.6.0 local compilation
The last one was based on atlas 3.6.0 compiled on one of the celeron nodes (aspera), and it looks reasonably good (reaching 8.2 Gflops).
Optimisation :
- Block size (constant problem size) best seems to be NB=100. Stick to that.
- Broadcasts (constant N, NB) BlongM (BCAST=5) looks best. Stick to that.
- re-refine block size with constant BCAST : NB=40 or 60 ?
- NB 40 or 60, swapping threshold 40 or 60, Ok stick to NB=60, swapping threshold 60.
- NBmin 8 looks better
- Look-ahead depth set to 0
- Final test with respect to problem size Top speed 9.5 Gflops
Final HPL 9-node benchmark, best performance 10.32 Gflops.