MBG wiki | RecentChanges | Blog | 2024-11-24 | 2024-11-23

The current best explanation ...

I now think that the problem was not in userspace applications as implied by the discussion that follows. Instead, my current best explanation shifted to kernel space : the server does not have the new kernel with the OOM killer enabled. The new kernel (shared by all other nodes) is same version and everything, but the addresses of the system calls are different. So, my current hypothesis is that NAMD hits the different server kernel address space and this is what causes the problems. To test the hypothesis :

On a second thought, both this scenario and the one that follows may be correct (and we may actually have been seeing a double fault).


Previous 'best' explanation

The server uses a script to fetch motherboard temperatures from the nodes. The original version of the script used a 'cat /mfs/... | awk ...' command to do the trick. The trouble was that the reference to mfs caused a significant memory leak on the server side (due to the usage of pipes ?). To avoid memory leakage, the nodes' filesystem was accessed via an rsh command (in the spirit of rsh 190.100.100.111 cat /var/tmp/temp | awk .... This looked like a pretty innocent thing to do, but then again, unix sometimes does work in mysterious ways : since this script was put to use, NAMD jobs (submitted via SGE) started misbehaving. The most annoying misbehaviour was that the job would stop executing with no error message (and with the job still in the queue, SGE thinking that it was running). That the problem was with this script was confirmed by switching back to the memory-leaking version, starting a NAMD job, watch it running without problems for 5-6 hours, switch back to the rsh-based version, and seeing it dying again (actually, not dying, stop executing). Then the light came in the form of the standard output of the parallel environment (a copy of which is present in almost every NAMD directory) :

-catch_rsh /work/sge/default/spool/pc07/active_jobs/337.1/pe_hostfile
pc07
pc04
pc14
pc16
pc12
pc15
pc11
pc05
pc13
The 'catch_rsh' it is : SGE was wrongly catching the rsh command that the server was using for fetching the temperatures causing mayhem. My best explanation as to why all went wrong is that SGE was feeding the temperature-related rsh sequence into the NAMD stream (but this is hand-waving). The cure was pretty straightforward. Replace :
#!/bin/bash

if (($1=="25700"));
then
/bin/cat /var/tmp/temp | /bin/awk '{print $3}'
...
with
#!/bin/bash --noprofile

if (($1=="25700"));
then
/bin/cat /var/tmp/temp | /bin/awk '{print $3}'
...
and you're done : with the '--noprofile' bash will no read the SGE configuration, rsh will not be catched by SGE, and all should be back to normal. Or, not ?