- The new kernel was also installed on server (and a new initrd image made). Testing it will have to wait for a reboot.
- A NAMD job was started on 6 new machines excluding the server. This job (which avoids calculation-wise the server) appears to be stable.
On a second thought, both this scenario and the one that follows may be correct (and we may actually have been seeing a double fault).
Previous 'best' explanation
The server uses a script to fetch motherboard temperatures from the nodes. The original version of the script used a 'cat /mfs/... | awk ...' command to do the trick. The trouble was that the reference to mfs caused a significant memory leak on the server side (due to the usage of pipes ?). To avoid memory leakage, the nodes' filesystem was accessed via an rsh command (in the spirit of rsh 190.100.100.111 cat /var/tmp/temp | awk .... This looked like a pretty innocent thing to do, but then again, unix sometimes does work in mysterious ways : since this script was put to use, NAMD jobs (submitted via SGE) started misbehaving. The most annoying misbehaviour was that the job would stop executing with no error message (and with the job still in the queue, SGE thinking that it was running). That the problem was with this script was confirmed by switching back to the memory-leaking version, starting a NAMD job, watch it running without problems for 5-6 hours, switch back to the rsh-based version, and seeing it dying again (actually, not dying, stop executing). Then the light came in the form of the standard output of the parallel environment (a copy of which is present in almost every NAMD directory) :
-catch_rsh /work/sge/default/spool/pc07/active_jobs/337.1/pe_hostfile pc07 pc04 pc14 pc16 pc12 pc15 pc11 pc05 pc13The 'catch_rsh' it is : SGE was wrongly catching the rsh command that the server was using for fetching the temperatures causing mayhem. My best explanation as to why all went wrong is that SGE was feeding the temperature-related rsh sequence into the NAMD stream (but this is hand-waving). The cure was pretty straightforward. Replace :
#!/bin/bash if (($1=="25700")); then /bin/cat /var/tmp/temp | /bin/awk '{print $3}' ...with
#!/bin/bash --noprofile if (($1=="25700")); then /bin/cat /var/tmp/temp | /bin/awk '{print $3}' ...and you're done : with the '--noprofile' bash will no read the SGE configuration, rsh will not be catched by SGE, and all should be back to normal. Or, not ?