September 2008
March 2008
- PC2 still missing in action.
- Several disks ready for last rites ? (pc6, pc13, pc15).
November 2007
- Weather getting cold, start few jobs ,-)
- PC2 dead for good. Pooh.
June 13th, 2007
- Everything still down.
- Brand new UPS for server.
June 8th, 2007
- Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off.
- The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights).
May 22nd - June 7th, 2007
- Electrical work in the building. Air conditioning unit never came back.
- Attempts to fix air condition failed.
- Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below).
- Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?).
- Afternoon of June 7th: looking stable ?
- Server says "broken RAID". One of the disks not responding ?
- Reboot and rebuild array from within RAID BIOS. Looking Ok.
April 20th, 2007
- PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups.
- Make new disk visible via NFS (/public3).
February 14th, 2007
- Again, and again, and again … (this time a thunderstorm to blame).
February 4th, 2007
- Power failure again (and again). Bring everything up again …
January 26th-28th, 2007
- Massive (regional) power failures. Everything went down and stayed down for three days.
- Upon booting, server complained for a filesystem corruption (fixed manually with fsck).
- Synchronise server disks.
December 10th-12th, 2006
- Everything went to hell in a handbasket :
- On the 10th the power went down, everything gone with it.
- Upon rebooting, server died a couple of times.
- Tried to synchronise disks : failed.
- Server's power supply smelled like toasted wires. Replace it with a 400W box.
- Try again to synchronise disks : failed consistently.
- Try to synchronise via BIOS : copy failed, RAID broken.
- Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good.
- Boot server normally, and re-synchronise disks from within unix : Ok.
- Do a 'restore -C' to look for corrupted system files. No surprises here.
- Bring everything up again.
- Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok.
- Back to normal ?
November 26th, 2006
- PC6 replaced with a new box (celeron 2.66, 256 MBytes).
November 1st, 2006
- PC6 dead for good. Replacement tower needed.
- A couple of power units replaced.
July 11th, 2006
- Power supply replaced on PC10.
- Power failure and air-condition servicing. Take everything up again.
May 23rd, 2006
- Server went down again (upon a simple grep). Will probably start worrying soon.
May 10th, 2006
- Power supplies replaced on PC1 & PC8. Looking good.
May 9th, 2006
- All went down due to power overload (AC+boxes+monitors).
- PC1 & PC2 refuse to come back to life.
April 12th, 2006
- Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue.
April 10th, 2006
- Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours.
- Problems expected from nodes: pc1, pc2, pc8.
March 30th, 2006
- Tiny per node load graph added on cluster's front html page.
January, 23rd, 2006
- PC3 and Poppins back from the dead and looking good.
January, 7th, 2006
- Server crashed again upon a large file transfer. Worrying ?
December 13th, 2005
- First tests with the connection to University network → looks good (max 1.1Mbps).
December 12th, 2005
- Μας μάτιασαν …
- PC3 went dead. It looks as if it is dead for good. Send it away …
December 9th, 2005
- Server back from the dead.
- fsck and disk synchronisation → Ok
- Restart (software-wise) cluster and job → Ok (?)
December 7th-8th, 2005
- Server crashed violently twice or thrice.
- Memtest indicated problematic DIMM. Tried to locate it, but problems persisted.
- Send server for a check-up …
October 26th, 2005
- Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot …
August 31st, 2005
- Image back-up of server (excl. /tmp & /home).
- DVD apparently cooperational (in both single-session & multisession modes).
August 28th, 2005
- Power failure. All (including UPSed) went down.
- Take everything up again. PC1 & PC3 had a difficult time restarting.
- Synchronise server's disks.
- Grab the opportunity to install a DVD-RW to the server.
August 26th, 2005
- It appears that the memory leak is indeed due to oMFS.
- Temperature monitoring now done with rsh clusterwide.
August 10th, 2005
- Power failure due to storms. All (non-UPSed) went down.
June 26th, 2005
- Slow but consistent memory leak on newer nodes ?
- Suspecting oMFS usage for temperature monitoring. Try using rsh on aspera.
June 15th, 2005
- PC2 and PC9 back from the dead.
May 27th, 2005
- Power failures had their toll : PC2 and PC9 dead. Replacement pending ?
March 28th, 2005
- Power failure, all (but UPSed) went down.
- Upon rebooting : fsck & memtest clusterwide.
- Server's RAID synchronisation
February 18th, 2005
- Clusterwide alarm system installed (!).
February 16th, 2005
- Cluster & NAMD jobs back to normal after kernel replacement on server.
- S.M.A.R.T. disk monitoring installed clusterwide [1].
February 15th, 2005
- Power failure again. Bring everything up. As a side effect :
- Server now has the new (OOM-killer enabled) kernel running.
- Enable ntpupdate clusterwide to correctly set time upon boot.
February 11th, 2005
- All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, …
- The current best explanation ...
- Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide).
February 9th, 2005
- Power failure overnight : all went down.
- Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived).
February 5th, 2005
- Rolled a kernel (oM 2.4.22-3) with the OOM killer enabled and started using it on pc08 and aspera for testing.
- Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot.
February 1st, 2005
- Cluster-wide motherboard temperature monitoring kernel modules installed.
- Use MRTG to make the temperatures viewable via web interface.
January 17th, 2005
- The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[2]).
November 27th, 2004
- Incremental system dumps (excl. /work and /home)
November 13th, 2004
- Farewell to netscape : firefox with java 1.4.2 installed clusterwide.
October 23rd, 2004
- Image backup of server (excl. /work and /home)
- Image backup of aspera
- Level 0 dump of /usr/local
September 30th, 2004
- RAID : disk synchronisation.
- Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [HOWTOs and FAQs]?
September 20th, 2004
- Time synchronisation deamon (ntpd) installed clusterwide.
- SGE : tight integration with MPICH apparently working.
September 13th, 2004
- pc01's CD-RW return (hopefully repaired).
- Incremental server back-up.
July 20th, 2004
- Scripts and deamons to watch uptimes and maximal uptimes.
- 'Documents' link (and content) added.
July 9th, 2004
- Sun Grid Engine[3] version 5.3 installed cluster-wide. MPI integration pending.
July 8th, 2004
- Incremental server, aspera back-up.
- snmpd to watch traffic on pc13. Add to cluster-view pages.
June 28th, 2004
- Cluster homepage updates :
- Script to allow using MRTG[4] for viewing cluster activity (daily, weekly, montly, yearly).
- Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs.
June 24th, 2004
- Hardware things :
- Sent broken pc01 CD-RW for repair
- Direct link between aspera & main (24-port) switch
- Addition of an 8-port 10/100 switch
June 15th, 2004
- Synchronise server disks
- Image backup of server
- Increase TCP buffer sizes to 4 Mbytes throughout cluster
- DFSAlink /work now rests on server's /tmp (disabled tmpwatch on server)
- UPS on aspera (and visible via www pages)
- Stabilise crontab and chkconfig changes (eg. snmpd)