> == September 2008 ==
> * PC2 & PC8 dead for good.
> == March 2008 ==
> * PC2 still missing in action.
> * Several disks ready for last rites ? (pc6, pc13, pc15).
> == November 2007 ==
> * Weather getting cold, start few jobs ,-)
> * PC2 dead for good. Pooh.
> == June 13th, 2007 ==
> * Everything still down.
> * Brand new UPS for server.
> == June 8th, 2007 ==
> * Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off.
> * The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights).
> == May 22nd - June 7th, 2007 ==
> * Electrical work in the building. Air conditioning unit never came back.
> * Attempts to fix air condition failed.
> * Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below).
> * Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?).
> * Afternoon of June 7th: looking stable ?
> * Server says "broken RAID". One of the disks not responding ?
> * Reboot and rebuild array from within RAID BIOS. Looking Ok.
> == April 20th, 2007 ==
> * PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups.
> * Make new disk visible via NFS (/public3).
> == February 14th, 2007 ==
> * Again, and again, and again ... (this time a thunderstorm to blame).
> == February 4th, 2007 ==
> * Power failure again (and again). Bring everything up again ...
> == January 26th-28th, 2007 ==
> * Massive (regional) power failures. Everything went down and stayed down for three days.
> * Upon booting, server complained for a filesystem corruption (fixed manually with fsck).
> * Synchronise server disks.
> == December 10th-12th, 2006 ==
> * Everything went to hell in a handbasket :
> ** On the 10th the power went down, everything gone with it.
> ** Upon rebooting, server died a couple of times.
> ** Tried to synchronise disks : failed.
> ** Server's power supply smelled like toasted wires. Replace it with a 400W box.
> ** Try again to synchronise disks : failed consistently.
> ** Try to synchronise via BIOS : copy failed, RAID broken.
> ** Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good.
> ** Boot server normally, and re-synchronise disks from within unix : Ok.
> ** Do a 'restore -C' to look for corrupted system files. No surprises here.
> ** Bring everything up again.
> ** Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok.
> ** Back to normal ?
> == November 26th, 2006 ==
> * PC6 replaced with a new box (celeron 2.66, 256 MBytes).
> == November 1st, 2006 ==
> * PC6 dead for good. Replacement tower needed.
> * A couple of power units replaced.
> == July 11th, 2006 ==
> * Power supply replaced on PC10.
> * Power failure and air-condition servicing. Take everything up again.
> == May 23rd, 2006 ==
> * Server went down again (upon a simple grep). Will probably start worrying soon.
> == May 10th, 2006 ==
> * Power supplies replaced on PC1 & PC8. Looking good.
> == May 9th, 2006 ==
> * All went down due to power overload (AC+boxes+monitors).
> * PC1 & PC2 refuse to come back to life.
> == April 12th, 2006 ==
> * Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue.
> == April 10th, 2006 ==
> * Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours.
> * Problems expected from nodes: pc1, pc2, pc8.
> == March 30th, 2006 ==
> * Tiny per node load graph added on cluster's front html page.
> == January, 23rd, 2006 ==
> * PC3 and Poppins back from the dead and looking good.
> == January, 7th, 2006 ==
> * Server crashed again upon a large file transfer. Worrying ?
> == December 13th, 2005 ==
> * First tests with the connection to University network -> looks good (max 1.1Mbps).
> == December 12th, 2005 ==
> * Μας μάτιασαν ...
> * PC3 went dead. It looks as if it is dead for good. Send it away ...
> == December 9th, 2005 ==
> * Server back from the dead.
> * fsck and disk synchronisation -> Ok
> * Restart (software-wise) cluster and job -> Ok (?)
> == December 7th-8th, 2005 ==
> * Server crashed violently twice or thrice.
> * Memtest indicated problematic DIMM. Tried to locate it, but problems persisted.
> * Send server for a check-up ...
> == October 26th, 2005 ==
> * Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot ...
> == August 31st, 2005 ==
> * Image back-up of server (excl. /tmp & /home).
> * DVD apparently cooperational (in both single-session & multisession modes).
> == August 28th, 2005 ==
> * Power failure. All (including UPSed) went down.
> * Take everything up again. PC1 & PC3 had a difficult time restarting.
> * Synchronise server's disks.
> * Grab the opportunity to install a DVD-RW to the server.
> == August 26th, 2005 ==
> * It appears that the memory leak is indeed due to oMFS.
> * Temperature monitoring now done with /rsh/ clusterwide.
> == August 10th, 2005 ==
> * Power failure due to storms. All (non-UPSed) went down.
> == June 26th, 2005 ==
> * Slow but consistent memory leak on newer nodes ?
> * Suspecting oMFS usage for temperature monitoring. Try using /rsh/ on aspera.
> == June 15th, 2005 ==
> * PC2 and PC9 back from the dead.
> == May 27th, 2005 ==
> * Power failures had their toll : PC2 and PC9 dead. Replacement pending ?
> == March 28th, 2005 ==
> * Power failure, all (but UPSed) went down.
> * Upon rebooting : fsck & memtest clusterwide.
> * Server's RAID synchronisation
> == February 18th, 2005 ==
> * Clusterwide alarm system installed (!).
> == February 16th, 2005 ==
> * Cluster & NAMD jobs back to normal after kernel replacement on server.
> * S.M.A.R.T. disk monitoring installed clusterwide [http://smartmontools.sourceforge.net/].
> == February 15th, 2005 ==
> * Power failure again. Bring everything up. As a side effect :
> * Server now has the new (OOM-killer enabled) kernel running.
> * Enable ntpupdate clusterwide to correctly set time upon boot.
> == February 11th, 2005 ==
> * All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, ...
> * [[The current best explanation ...]]
> * Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide).
> == February 9th, 2005 ==
> * Power failure overnight : all went down.
> * Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived).
> == February 5th, 2005 ==
> * Rolled a kernel (oM 2.4.22-3) with the [[OOM killer enabled]] and started using it on pc08 and aspera for testing.
> * Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot.
> == February 1st, 2005 ==
> * Cluster-wide motherboard temperature monitoring kernel modules installed.
> * Use MRTG to make the temperatures viewable /via/ web interface.
> == January 17th, 2005 ==
> * The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[http://netstation.sourceforge.net/]).
> == November 27th, 2004 ==
> * Incremental system dumps (excl. /work and /home)
> == November 13th, 2004 ==
> * Farewell to netscape : firefox with java 1.4.2 installed clusterwide.
> == October 23rd, 2004 ==
> * Image backup of server (excl. /work and /home)
> * Image backup of aspera
> * Level 0 dump of /usr/local
> == September 30th, 2004 ==
> * RAID : disk synchronisation.
> * Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [[HOWTOs and FAQs]]
> == September 20th, 2004 ==
> * Time synchronisation deamon (ntpd) installed clusterwide.
> * SGE : tight integration with MPICH apparently working.
> == September 13th, 2004 ==
> * pc01's CD-RW return (hopefully repaired).
> * Incremental server back-up.
> == July 20th, 2004 ==
> * Scripts and deamons to watch uptimes and maximal uptimes.
> * 'Documents' link (and content) added.
> == July 9th, 2004 ==
> * Sun Grid Engine[http://server.cluster.mbg.gr/pdf/packages/SGE53AdminUserDoc.pdf] version 5.3 installed cluster-wide. MPI integration pending.
> == July 8th, 2004 ==
> * Incremental server, aspera back-up.
> * snmpd to watch traffic on pc13. Add to cluster-view pages.
> == June 28th, 2004 ==
> * Cluster homepage updates :
> ** Script to allow using MRTG[http://people.ee.ethz.ch/~oetiker/webtools/mrtg/] for viewing cluster activity (daily, weekly, montly, yearly).
> ** Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs.
> == June 24th, 2004 ==
> * Hardware things :
> ** Sent broken pc01 CD-RW for repair
> ** Direct link between aspera & main (24-port) switch
> ** Addition of an 8-port 10/100 switch
> == June 15th, 2004 ==