MBG wiki: Editing Maintenance

== September 2008 ==
* PC2 & PC8 dead for good.

== March 2008 ==
* PC2 still missing in action.
* Several disks ready for last rites ? (pc6, pc13, pc15).

== November 2007 ==
* Weather getting cold, start few jobs ,-)
* PC2 dead for good. Pooh.

== June 13th, 2007 ==
* Everything still down.
* Brand new UPS for server.

== June 8th, 2007 ==
* Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off.
* The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights).

== May 22nd - June 7th, 2007 ==
* Electrical work in the building. Air conditioning unit never came back.
* Attempts to fix air condition failed.
* Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below).
* Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?).
* Afternoon of June 7th: looking stable ?
* Server says "broken RAID". One of the disks not responding ?
* Reboot and rebuild array from within RAID BIOS. Looking Ok.

== April 20th, 2007 ==
* PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups.
* Make new disk visible via NFS (/public3).

== February 14th, 2007 ==
* Again, and again, and again ... (this time a thunderstorm to blame).

== February 4th, 2007 ==
* Power failure again (and again). Bring everything up again ...

== January 26th-28th, 2007 ==
* Massive (regional) power failures. Everything went down and stayed down for three days.
* Upon booting, server complained for a filesystem corruption (fixed manually with fsck). 
* Synchronise server disks.

== December 10th-12th, 2006 ==
* Everything went to hell in a handbasket :
** On the 10th the power went down, everything gone with it.
** Upon rebooting, server died a couple of times.
** Tried to synchronise disks : failed.
** Server's power supply smelled like toasted wires. Replace it with a 400W box.
** Try again to synchronise disks : failed consistently.
** Try to synchronise via BIOS : copy failed, RAID broken.
** Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good.
** Boot server normally, and re-synchronise disks from within unix : Ok.
** Do a 'restore -C' to look for corrupted system files. No surprises here.
** Bring everything up again.
** Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok.
** Back to normal ?

== November 26th, 2006 ==
* PC6 replaced with a new box (celeron 2.66, 256 MBytes).

== November 1st, 2006 ==
* PC6 dead for good. Replacement tower needed.
* A couple of power units replaced.

== July 11th, 2006 ==
* Power supply replaced on PC10.
* Power failure and air-condition servicing. Take everything up again.

== May 23rd, 2006 ==
* Server went down again (upon a simple grep). Will probably start worrying soon.

== May 10th, 2006 ==
* Power supplies replaced on PC1 & PC8. Looking good.

== May 9th, 2006 ==
* All went down due to power overload (AC+boxes+monitors).
* PC1 & PC2 refuse to come back to life.

== April 12th, 2006 ==
* Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue.

== April 10th, 2006 ==
* Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours.
* Problems expected from nodes: pc1, pc2, pc8.

== March 30th, 2006 ==
* Tiny per node load graph added on cluster's front html page.

== January, 23rd, 2006 ==
* PC3 and Poppins back from the dead and looking good.

== January, 7th, 2006 ==
* Server crashed again upon a large file transfer. Worrying ?

== December 13th, 2005 ==
* First tests with the connection to University network -> looks good (max 1.1Mbps).

== December 12th, 2005 ==
* Μας μάτιασαν ...
* PC3 went dead. It looks as if it is dead for good. Send it away ...

== December 9th, 2005 ==
* Server back from the dead.
* fsck and disk synchronisation -> Ok
* Restart (software-wise) cluster and job -> Ok (?)

== December 7th-8th, 2005 ==
* Server crashed violently twice or thrice.
* Memtest indicated problematic DIMM. Tried to locate it, but problems persisted.
* Send server for a check-up ...

== October 26th, 2005 ==
* Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot ...

== August 31st, 2005 ==
* Image back-up of server (excl. /tmp & /home).
* DVD apparently cooperational (in both single-session & multisession modes).

== August 28th, 2005 ==
* Power failure. All (including UPSed) went down.
* Take everything up again. PC1 & PC3 had a difficult time restarting.
* Synchronise server's disks.
* Grab the opportunity to install a DVD-RW to the server.

== August 26th, 2005 ==
* It appears that the memory leak is indeed due to oMFS.
* Temperature monitoring now done with /rsh/ clusterwide.

== August 10th, 2005 ==
* Power failure due to storms. All (non-UPSed) went down.

== June 26th, 2005 ==
* Slow but consistent memory leak on newer nodes ? 
* Suspecting oMFS usage for temperature monitoring. Try using /rsh/ on aspera.

== June 15th, 2005 ==
* PC2 and PC9 back from the dead.

== May 27th, 2005 ==
* Power failures had their toll : PC2 and PC9 dead. Replacement pending ?

== March 28th, 2005 ==
* Power failure, all (but UPSed) went down.
* Upon rebooting : fsck & memtest clusterwide.
* Server's RAID synchronisation

== February 18th, 2005 ==
* Clusterwide alarm system installed (!).

== February 16th, 2005 ==
* Cluster & NAMD jobs back to normal after kernel replacement on server.
* S.M.A.R.T. disk monitoring installed clusterwide [http://smartmontools.sourceforge.net/].

== February 15th, 2005 ==
* Power failure again. Bring everything up. As a side effect :
* Server now has the new (OOM-killer enabled) kernel running.
* Enable ntpupdate clusterwide to correctly set time upon boot.

== February 11th, 2005 ==
* All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, ...
* [[The current best explanation ...]]
* Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide).

== February 9th, 2005 ==
* Power failure overnight : all went down.
* Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived).

== February 5th, 2005 ==
* Rolled a kernel (oM 2.4.22-3) with the [[OOM killer enabled]] and started using it on pc08 and aspera for testing.
* Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot.

== February 1st, 2005 ==
* Cluster-wide motherboard temperature monitoring kernel modules installed.
* Use MRTG to make the temperatures viewable /via/ web interface.

== January 17th, 2005 ==
* The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[http://netstation.sourceforge.net/]).

== November 27th, 2004 ==
* Incremental system dumps (excl. /work and /home)

== November 13th, 2004 ==
* Farewell to netscape : firefox with java 1.4.2 installed clusterwide.

== October 23rd, 2004 ==
* Image backup of server (excl. /work and /home)
* Image backup of aspera
* Level 0 dump of /usr/local

== September 30th, 2004 ==
* RAID : disk synchronisation.
* Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [[HOWTOs and FAQs]]

== September 20th, 2004 ==
* Time synchronisation deamon (ntpd) installed clusterwide.
* SGE : tight integration with MPICH apparently working.

== September 13th, 2004 ==
* pc01's CD-RW return (hopefully repaired).
* Incremental server back-up.

== July 20th, 2004 ==
* Scripts and deamons to watch uptimes and maximal uptimes.
* 'Documents' link (and content) added.

== July 9th, 2004 ==
* Sun Grid Engine[http://server.cluster.mbg.gr/pdf/packages/SGE53AdminUserDoc.pdf] version 5.3 installed cluster-wide. MPI integration pending.

== July 8th, 2004 ==
* Incremental server, aspera back-up.
* snmpd to watch traffic on pc13. Add to cluster-view pages.

== June 28th, 2004 ==
* Cluster homepage updates :
** Script to allow using MRTG[http://people.ee.ethz.ch/~oetiker/webtools/mrtg/] for viewing cluster activity (daily, weekly, montly, yearly).
** Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs.

== June 24th, 2004 ==
* Hardware things : 
** Sent broken pc01 CD-RW for repair
** Direct link between aspera & main (24-port) switch
** Addition of an 8-port 10/100 switch

== June 15th, 2004 ==
* Synchronise server disks
* Image backup of server
* Increase TCP buffer sizes to 4 Mbytes throughout cluster
* DFSAlink /work now rests on server's /tmp (disabled tmpwatch on server)
* UPS on aspera (and visible via www pages)
* Stabilise crontab and chkconfig changes (eg. snmpd)

Summary:

This change is a minor edit.

Username:

Replace this text with a file.