MBG wiki | RecentChanges | Blog | 2024-04-26 | 2024-04-25

Maintenance

Difference (from prior major revision)

Changed: 1c1,171

< <b>June 15th, 2004</b>

to

> == September 2008 ==
> * PC2 & PC8 dead for good.
> == March 2008 ==
> * PC2 still missing in action.
> * Several disks ready for last rites ? (pc6, pc13, pc15).
> == November 2007 ==
> * Weather getting cold, start few jobs ,-)
> * PC2 dead for good. Pooh.
> == June 13th, 2007 ==
> * Everything still down.
> * Brand new UPS for server.
> == June 8th, 2007 ==
> * Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off.
> * The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights).
> == May 22nd - June 7th, 2007 ==
> * Electrical work in the building. Air conditioning unit never came back.
> * Attempts to fix air condition failed.
> * Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below).
> * Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?).
> * Afternoon of June 7th: looking stable ?
> * Server says "broken RAID". One of the disks not responding ?
> * Reboot and rebuild array from within RAID BIOS. Looking Ok.
> == April 20th, 2007 ==
> * PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups.
> * Make new disk visible via NFS (/public3).
> == February 14th, 2007 ==
> * Again, and again, and again ... (this time a thunderstorm to blame).
> == February 4th, 2007 ==
> * Power failure again (and again). Bring everything up again ...
> == January 26th-28th, 2007 ==
> * Massive (regional) power failures. Everything went down and stayed down for three days.
> * Upon booting, server complained for a filesystem corruption (fixed manually with fsck).
> * Synchronise server disks.
> == December 10th-12th, 2006 ==
> * Everything went to hell in a handbasket :
> ** On the 10th the power went down, everything gone with it.
> ** Upon rebooting, server died a couple of times.
> ** Tried to synchronise disks : failed.
> ** Server's power supply smelled like toasted wires. Replace it with a 400W box.
> ** Try again to synchronise disks : failed consistently.
> ** Try to synchronise via BIOS : copy failed, RAID broken.
> ** Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good.
> ** Boot server normally, and re-synchronise disks from within unix : Ok.
> ** Do a 'restore -C' to look for corrupted system files. No surprises here.
> ** Bring everything up again.
> ** Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok.
> ** Back to normal ?
> == November 26th, 2006 ==
> * PC6 replaced with a new box (celeron 2.66, 256 MBytes).
> == November 1st, 2006 ==
> * PC6 dead for good. Replacement tower needed.
> * A couple of power units replaced.
> == July 11th, 2006 ==
> * Power supply replaced on PC10.
> * Power failure and air-condition servicing. Take everything up again.
> == May 23rd, 2006 ==
> * Server went down again (upon a simple grep). Will probably start worrying soon.
> == May 10th, 2006 ==
> * Power supplies replaced on PC1 & PC8. Looking good.
> == May 9th, 2006 ==
> * All went down due to power overload (AC+boxes+monitors).
> * PC1 & PC2 refuse to come back to life.
> == April 12th, 2006 ==
> * Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue.
> == April 10th, 2006 ==
> * Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours.
> * Problems expected from nodes: pc1, pc2, pc8.
> == March 30th, 2006 ==
> * Tiny per node load graph added on cluster's front html page.
> == January, 23rd, 2006 ==
> * PC3 and Poppins back from the dead and looking good.
> == January, 7th, 2006 ==
> * Server crashed again upon a large file transfer. Worrying ?
> == December 13th, 2005 ==
> * First tests with the connection to University network -> looks good (max 1.1Mbps).
> == December 12th, 2005 ==
> * Μας μάτιασαν ...
> * PC3 went dead. It looks as if it is dead for good. Send it away ...
> == December 9th, 2005 ==
> * Server back from the dead.
> * fsck and disk synchronisation -> Ok
> * Restart (software-wise) cluster and job -> Ok (?)
> == December 7th-8th, 2005 ==
> * Server crashed violently twice or thrice.
> * Memtest indicated problematic DIMM. Tried to locate it, but problems persisted.
> * Send server for a check-up ...
> == October 26th, 2005 ==
> * Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot ...
> == August 31st, 2005 ==
> * Image back-up of server (excl. /tmp & /home).
> * DVD apparently cooperational (in both single-session & multisession modes).
> == August 28th, 2005 ==
> * Power failure. All (including UPSed) went down.
> * Take everything up again. PC1 & PC3 had a difficult time restarting.
> * Synchronise server's disks.
> * Grab the opportunity to install a DVD-RW to the server.
> == August 26th, 2005 ==
> * It appears that the memory leak is indeed due to oMFS.
> * Temperature monitoring now done with /rsh/ clusterwide.
> == August 10th, 2005 ==
> * Power failure due to storms. All (non-UPSed) went down.
> == June 26th, 2005 ==
> * Slow but consistent memory leak on newer nodes ?
> * Suspecting oMFS usage for temperature monitoring. Try using /rsh/ on aspera.
> == June 15th, 2005 ==
> * PC2 and PC9 back from the dead.
> == May 27th, 2005 ==
> * Power failures had their toll : PC2 and PC9 dead. Replacement pending ?
> == March 28th, 2005 ==
> * Power failure, all (but UPSed) went down.
> * Upon rebooting : fsck & memtest clusterwide.
> * Server's RAID synchronisation
> == February 18th, 2005 ==
> * Clusterwide alarm system installed (!).
> == February 16th, 2005 ==
> * Cluster & NAMD jobs back to normal after kernel replacement on server.
> * S.M.A.R.T. disk monitoring installed clusterwide [http://smartmontools.sourceforge.net/].
> == February 15th, 2005 ==
> * Power failure again. Bring everything up. As a side effect :
> * Server now has the new (OOM-killer enabled) kernel running.
> * Enable ntpupdate clusterwide to correctly set time upon boot.
> == February 11th, 2005 ==
> * All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, ...
> * [[The current best explanation ...]]
> * Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide).
> == February 9th, 2005 ==
> * Power failure overnight : all went down.
> * Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived).
> == February 5th, 2005 ==
> * Rolled a kernel (oM 2.4.22-3) with the [[OOM killer enabled]] and started using it on pc08 and aspera for testing.
> * Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot.
> == February 1st, 2005 ==
> * Cluster-wide motherboard temperature monitoring kernel modules installed.
> * Use MRTG to make the temperatures viewable /via/ web interface.
> == January 17th, 2005 ==
> * The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[http://netstation.sourceforge.net/]).
> == November 27th, 2004 ==
> * Incremental system dumps (excl. /work and /home)
> == November 13th, 2004 ==
> * Farewell to netscape : firefox with java 1.4.2 installed clusterwide.
> == October 23rd, 2004 ==
> * Image backup of server (excl. /work and /home)
> * Image backup of aspera
> * Level 0 dump of /usr/local
> == September 30th, 2004 ==
> * RAID : disk synchronisation.
> * Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [[HOWTOs and FAQs]]
> == September 20th, 2004 ==
> * Time synchronisation deamon (ntpd) installed clusterwide.
> * SGE : tight integration with MPICH apparently working.
> == September 13th, 2004 ==
> * pc01's CD-RW return (hopefully repaired).
> * Incremental server back-up.
> == July 20th, 2004 ==
> * Scripts and deamons to watch uptimes and maximal uptimes.
> * 'Documents' link (and content) added.
> == July 9th, 2004 ==
> * Sun Grid Engine[http://server.cluster.mbg.gr/pdf/packages/SGE53AdminUserDoc.pdf] version 5.3 installed cluster-wide. MPI integration pending.
> == July 8th, 2004 ==
> * Incremental server, aspera back-up.
> * snmpd to watch traffic on pc13. Add to cluster-view pages.
> == June 28th, 2004 ==
> * Cluster homepage updates :
> ** Script to allow using MRTG[http://people.ee.ethz.ch/~oetiker/webtools/mrtg/] for viewing cluster activity (daily, weekly, montly, yearly).
> ** Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs.
> == June 24th, 2004 ==
> * Hardware things :
> ** Sent broken pc01 CD-RW for repair
> ** Direct link between aspera & main (24-port) switch
> ** Addition of an 8-port 10/100 switch
> == June 15th, 2004 ==

Changed: 4c174

< * Increase TCP buffer sizes throughout cluster

to

> * Increase TCP buffer sizes to 4 Mbytes throughout cluster


September 2008

March 2008

November 2007

June 13th, 2007

June 8th, 2007

May 22nd - June 7th, 2007

April 20th, 2007

February 14th, 2007

February 4th, 2007

January 26th-28th, 2007

December 10th-12th, 2006

November 26th, 2006

November 1st, 2006

July 11th, 2006

May 23rd, 2006

May 10th, 2006

May 9th, 2006

April 12th, 2006

April 10th, 2006

March 30th, 2006

January, 23rd, 2006

January, 7th, 2006

December 13th, 2005

December 12th, 2005

December 9th, 2005

December 7th-8th, 2005

October 26th, 2005

August 31st, 2005

August 28th, 2005

August 26th, 2005

August 10th, 2005

June 26th, 2005

June 15th, 2005

May 27th, 2005

March 28th, 2005

February 18th, 2005

February 16th, 2005

February 15th, 2005

February 11th, 2005

February 9th, 2005

February 5th, 2005

February 1st, 2005

January 17th, 2005

November 27th, 2004

November 13th, 2004

October 23rd, 2004

September 30th, 2004

September 20th, 2004

September 13th, 2004

July 20th, 2004

July 9th, 2004

July 8th, 2004

June 28th, 2004

June 24th, 2004

June 15th, 2004