MBG wiki: Maintenance

September 2008

PC2 & PC8 dead for good.

March 2008

PC2 still missing in action.
Several disks ready for last rites ? (pc6, pc13, pc15).

November 2007

Weather getting cold, start few jobs ,-)
PC2 dead for good. Pooh.

June 13th, 2007

Everything still down.
Brand new UPS for server.

June 8th, 2007

Actually, the air conditioning unit wasn't stable. Cluster room became a water park. Switch everything off.
The water spraying exercise had its toll: server's UPS misbehaving (replace battery & overload lights).

May 22nd - June 7th, 2007

Electrical work in the building. Air conditioning unit never came back.
Attempts to fix air condition failed.
Attempts to use the second air conditioning unit initially unsuccessful, but it finally came along (so to speak, see below).
Second unit suddenly started spraying PCs with water. One monitor lost in the process. Server escaped (?).
Afternoon of June 7th: looking stable ?
Server says "broken RAID". One of the disks not responding ?
Reboot and rebuild array from within RAID BIOS. Looking Ok.

April 20th, 2007

PC14's disk gone for good. Buy a new 80G disk, reinstall RH7.3, restore level 0 dump (from 2005), copy users' areas, /usr/local, passwd and groups.
Make new disk visible via NFS (/public3).

February 14th, 2007

Again, and again, and again … (this time a thunderstorm to blame).

February 4th, 2007

Power failure again (and again). Bring everything up again …

January 26th-28th, 2007

Massive (regional) power failures. Everything went down and stayed down for three days.
Upon booting, server complained for a filesystem corruption (fixed manually with fsck).
Synchronise server disks.

December 10th-12th, 2006

Everything went to hell in a handbasket :
- On the 10th the power went down, everything gone with it.
- Upon rebooting, server died a couple of times.
- Tried to synchronise disks : failed.
- Server's power supply smelled like toasted wires. Replace it with a 400W box.
- Try again to synchronise disks : failed consistently.
- Try to synchronise via BIOS : copy failed, RAID broken.
- Get a new 80G disk (western digital), rebuilt array, make the new disk bootable (everything from BIOS) : looking good.
- Boot server normally, and re-synchronise disks from within unix : Ok.
- Do a 'restore -C' to look for corrupted system files. No surprises here.
- Bring everything up again.
- Test it : copy a 32G simulation from server to poppins via NFS : Looks Ok.
- Back to normal ?

November 26th, 2006

PC6 replaced with a new box (celeron 2.66, 256 MBytes).

November 1st, 2006

PC6 dead for good. Replacement tower needed.
A couple of power units replaced.

July 11th, 2006

Power supply replaced on PC10.
Power failure and air-condition servicing. Take everything up again.

May 23rd, 2006

Server went down again (upon a simple grep). Will probably start worrying soon.

May 10th, 2006

Power supplies replaced on PC1 & PC8. Looking good.

May 9th, 2006

All went down due to power overload (AC+boxes+monitors).
PC1 & PC2 refuse to come back to life.

April 12th, 2006

Server went dead reproducibly during a 'less' on a large file. Memtest looks ok. Continue.

April 10th, 2006

Pc1 went dead, possibly due to i2c bus problems. Thankfully, it agreed to boot again after a couple of hours.
Problems expected from nodes: pc1, pc2, pc8.

March 30th, 2006

Tiny per node load graph added on cluster's front html page.

January, 23rd, 2006

PC3 and Poppins back from the dead and looking good.

January, 7th, 2006

Server crashed again upon a large file transfer. Worrying ?

December 13th, 2005

First tests with the connection to University network → looks good (max 1.1Mbps).

December 12th, 2005

Μας μάτιασαν …
PC3 went dead. It looks as if it is dead for good. Send it away …

December 9th, 2005

Server back from the dead.
fsck and disk synchronisation → Ok
Restart (software-wise) cluster and job → Ok (?)

December 7th-8th, 2005

Server crashed violently twice or thrice.
Memtest indicated problematic DIMM. Tried to locate it, but problems persisted.
Send server for a check-up …

October 26th, 2005

Last power failure damaged pc3's sensors ? Or not ? Wait for next reboot …

August 31st, 2005

Image back-up of server (excl. /tmp & /home).
DVD apparently cooperational (in both single-session & multisession modes).

August 28th, 2005

Power failure. All (including UPSed) went down.
Take everything up again. PC1 & PC3 had a difficult time restarting.
Synchronise server's disks.
Grab the opportunity to install a DVD-RW to the server.

August 26th, 2005

It appears that the memory leak is indeed due to oMFS.
Temperature monitoring now done with rsh clusterwide.

August 10th, 2005

Power failure due to storms. All (non-UPSed) went down.

June 26th, 2005

Slow but consistent memory leak on newer nodes ?
Suspecting oMFS usage for temperature monitoring. Try using rsh on aspera.

June 15th, 2005

PC2 and PC9 back from the dead.

May 27th, 2005

Power failures had their toll : PC2 and PC9 dead. Replacement pending ?

March 28th, 2005

Power failure, all (but UPSed) went down.
Upon rebooting : fsck & memtest clusterwide.
Server's RAID synchronisation

February 18th, 2005

Clusterwide alarm system installed (!).

February 16th, 2005

Cluster & NAMD jobs back to normal after kernel replacement on server.
S.M.A.R.T. disk monitoring installed clusterwide [1].

February 15th, 2005

Power failure again. Bring everything up. As a side effect :
Server now has the new (OOM-killer enabled) kernel running.
Enable ntpupdate clusterwide to correctly set time upon boot.

February 11th, 2005

All sorts of NAMD stability problems : processes loosing communication, inexpicable crashes, …
The current best explanation ...
Gave-up on having windoze as the dafault boot choice : GNU/Linux-oM is now the default (clusterwide).

February 9th, 2005

Power failure overnight : all went down.
Upon rebooting : synchronise server RAID, fsck all cluster disks (it appears that they survived).

February 5th, 2005

Rolled a kernel (oM 2.4.22-3) with the OOM killer enabled and started using it on pc08 and aspera for testing.
Copy the new kernel in /boot on all machines (except the server) to be ready to go upon the next reboot.

February 1st, 2005

Cluster-wide motherboard temperature monitoring kernel modules installed.
Use MRTG to make the temperatures viewable via web interface.

January 17th, 2005

The server now offers DHCPd & tftpd (needed for dumb X-terminals based on netstation[2]).

November 27th, 2004

Incremental system dumps (excl. /work and /home)

November 13th, 2004

Farewell to netscape : firefox with java 1.4.2 installed clusterwide.

October 23rd, 2004

Image backup of server (excl. /work and /home)
Image backup of aspera
Level 0 dump of /usr/local

September 30th, 2004

RAID : disk synchronisation.
Grid Engine fully functional ? (excluding checkpointing, which may not even be feasible), see [HOWTOs and FAQs]?

September 20th, 2004

Time synchronisation deamon (ntpd) installed clusterwide.
SGE : tight integration with MPICH apparently working.

September 13th, 2004

pc01's CD-RW return (hopefully repaired).
Incremental server back-up.

July 20th, 2004

Scripts and deamons to watch uptimes and maximal uptimes.
'Documents' link (and content) added.

July 9th, 2004

Sun Grid Engine[3] version 5.3 installed cluster-wide. MPI integration pending.

July 8th, 2004

Incremental server, aspera back-up.
snmpd to watch traffic on pc13. Add to cluster-view pages.

June 28th, 2004

Cluster homepage updates :
- Script to allow using MRTG[4] for viewing cluster activity (daily, weekly, montly, yearly).
- Modification of openmosixwebview page to include the MRTG graphs, the network traffic graphs, and the running jobs.

June 24th, 2004

Hardware things :
- Sent broken pc01 CD-RW for repair
- Direct link between aspera & main (24-port) switch
- Addition of an 8-port 10/100 switch

June 15th, 2004

Synchronise server disks
Image backup of server
Increase TCP buffer sizes to 4 Mbytes throughout cluster
DFSAlink /work now rests on server's /tmp (disabled tmpwatch on server)
UPS on aspera (and visible via www pages)
Stabilise crontab and chkconfig changes (eg. snmpd)