MBG wiki
|
RecentChanges
|
Blog
|
2024-04-20
|
2024-04-19
Editing Maintenance
== September 2008 == * PC2 & PC8 dead for good. == March 2008 == * PC2 still missing in action. * Several disks ready for last rites ? (pc6, pc13, pc15). == November 2007 == * Weather getting cold, start few jobs ,-) * PC2 dead for good. Pooh. == June 13th, 2007 == * Everything still down. * Brand new UPS for server. == June 8th, 2007 == * Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off. * The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights). == May 22nd - June 7th, 2007 == * Electrical work in the building. Air conditioning unit never came back. * Attempts to fix air condition failed. * Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below). * Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?). * Afternoon of June 7th: looking stable ? * Server says "broken RAID". One of the disks not responding ? * Reboot and rebuild array from within RAID BIOS. Looking Ok. == April 20th, 2007 == * PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups. * Make new disk visible via NFS (/public3). == February 14th, 2007 == * Again, and again, and again ... (this time a thunderstorm to blame). == February 4th, 2007 == * Power failure again (and again). Bring everything up again ... == January 26th-28th, 2007 == * Massive (regional) power failures. Everything went down and stayed down for three days. * Upon booting, server complained for a filesystem corruption (fixed manually with fsck). * Synchronise server disks. == December 10th-12th, 2006 == * Everything went to hell in a handbasket : ** On the 10th the power went down, everything gone with it. ** Upon rebooting, server died a couple of times. ** Tried to synchronise disks : failed. ** Server's power supply smelled like toasted wires. Replace it with a 400W box. ** Try again to synchronise disks : failed consistently. ** Try to synchronise via BIOS : copy failed, RAID broken. ** Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good. ** Boot server normally, and re-synchronise disks from within unix : Ok. ** Do a 'restore -C' to look for corrupted system files. No surprises here. ** Bring everything up again. ** Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok. ** Back to normal ? == November 26th, 2006 == * PC6 replaced with a new box (celeron 2.66, 256 MBytes). == November 1st, 2006 == * PC6 dead for good. Replacement tower needed. * A couple of power units replaced. == July 11th, 2006 == * Power supply replaced on PC10. * Power failure and air-condition servicing. Take everything up again. == May 23rd, 2006 == * Server went down again (upon a simple grep). Will probably start worrying soon. == May 10th, 2006 == * Power supplies replaced on PC1 & PC8. Looking good. == May 9th, 2006 == * All went down due to power overload (AC+boxes+monitors). * PC1 & PC2 refuse to come back to life. == April 12th, 2006 == * Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue. == April 10th, 2006 == * Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours. * Problems expected from nodes: pc1, pc2, pc8. == March 30th, 2006 == * Tiny per node load graph added on cluster's front html page. == January, 23rd, 2006 == * PC3 and Poppins back from the dead and looking good. == January, 7th, 2006 == * Server crashed again upon a large file transfer. Worrying ? == December 13th, 2005 == * First tests with the connection to University network -> looks good (max 1.1Mbps). == December 12th, 2005 == * Μας μάτιασαν ... * PC3 went dead. It looks as if it is dead for good. Send it away ... == December 9th, 2005 == * Server back from the dead. * fsck and disk synchronisation -> Ok * Restart (software-wise) cluster and job -> Ok (?) == December 7th-8th, 2005 == * Server crashed violently twice or thrice. * Memtest indicated problematic DIMM. Tried to locate it, but problems persisted. * Send server for a check-up ... == October 26th, 2005 == * Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot ... == August 31st, 2005 == * Image back-up of server (excl. /tmp & /home). * DVD apparently cooperational (in both single-session & multisession modes). == August 28th, 2005 == * Power failure. All (including UPSed) went down. * Take everything up again. PC1 & PC3 had a difficult time restarting. * Synchronise server's disks. * Grab the opportunity to install a DVD-RW to the server. == August 26th, 2005 == * It appears that the memory leak is indeed due to oMFS. * Temperature monitoring now done with /rsh/ clusterwide. == August 10th, 2005 == * Power failure due to storms. All (non-UPSed) went down. == June 26th, 2005 == * Slow but consistent memory leak on newer nodes ? * Suspecting oMFS usage for temperature monitoring. Try using /rsh/ on aspera. == June 15th, 2005 == * PC2 and PC9 back from the dead. == May 27th, 2005 == * Power failures had their toll : PC2 and PC9 dead. Replacement pending ? == March 28th, 2005 == * Power failure, all (but UPSed) went down. * Upon rebooting : fsck & memtest clusterwide. * Server's RAID synchronisation == February 18th, 2005 == * Clusterwide alarm system installed (!). == February 16th, 2005 == * Cluster & NAMD jobs back to normal after kernel replacement on server. * S.M.A.R.T. disk monitoring installed clusterwide [http://smartmontools.sourceforge.net/]. == February 15th, 2005 == * Power failure again. Bring everything up. As a side effect : * Server now has the new (OOM-killer enabled) kernel running. * Enable ntpupdate clusterwide to correctly set time upon boot. == February 11th, 2005 == * All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, ... * [[The current best explanation ...]] * Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide). == February 9th, 2005 == * Power failure overnight : all went down. * Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived). == February 5th, 2005 == * Rolled a kernel (oM 2.4.22-3) with the [[OOM killer enabled]] and started using it on pc08 and aspera for testing. * Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot. == February 1st, 2005 == * Cluster-wide motherboard temperature monitoring kernel modules installed. * Use MRTG to make the temperatures viewable /via/ web interface. == January 17th, 2005 == * The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[http://netstation.sourceforge.net/]). == November 27th, 2004 == * Incremental system dumps (excl. /work and /home) == November 13th, 2004 == * Farewell to netscape : firefox with java 1.4.2 installed clusterwide. == October 23rd, 2004 == * Image backup of server (excl. /work and /home) * Image backup of aspera * Level 0 dump of /usr/local == September 30th, 2004 == * RAID : disk synchronisation. * Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [[HOWTOs and FAQs]] == September 20th, 2004 == * Time synchronisation deamon (ntpd) installed clusterwide. * SGE : tight integration with MPICH apparently working. == September 13th, 2004 == * pc01's CD-RW return (hopefully repaired). * Incremental server back-up. == July 20th, 2004 == * Scripts and deamons to watch uptimes and maximal uptimes. * 'Documents' link (and content) added. == July 9th, 2004 == * Sun Grid Engine[http://server.cluster.mbg.gr/pdf/packages/SGE53AdminUserDoc.pdf] version 5.3 installed cluster-wide. MPI integration pending. == July 8th, 2004 == * Incremental server, aspera back-up. * snmpd to watch traffic on pc13. Add to cluster-view pages. == June 28th, 2004 == * Cluster homepage updates : ** Script to allow using MRTG[http://people.ee.ethz.ch/~oetiker/webtools/mrtg/] for viewing cluster activity (daily, weekly, montly, yearly). ** Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs. == June 24th, 2004 == * Hardware things : ** Sent broken pc01 CD-RW for repair ** Direct link between aspera & main (24-port) switch ** Addition of an 8-port 10/100 switch == June 15th, 2004 == * Synchronise server disks * Image backup of server * Increase TCP buffer sizes to 4 Mbytes throughout cluster * DFSAlink /work now rests on server's /tmp (disabled tmpwatch on server) * UPS on aspera (and visible via www pages) * Stabilise crontab and chkconfig changes (eg. snmpd)
Summary:
This change is a minor edit.
Username:
Replace this text with a file.