We’re sure you noticed we suffered a bit of a disaster with the server over the past few months, and more specifically 4 weeks ago!
That time, monitors alerted us of a failed drive in this server, we contacted the Host to have this swapped and the replacement was scheduled but a second drive also started producing errors before the drive was replaced.
This server runs a RAID10 array which means two drives could fail and it should stay online however in this instance the bad drive quickly corrupted the RAID array.
Both drives were replaced by Softlayer and we attempted a repair using fsck but this failed, leaving us with no choice but to restore the server from a backup which took 10 hours to complete then another filesystem check took a few more hours.
Moving forward we’ll be making a number of changes to ensure this doesn’t happen again and to improve restoration times and communication on any future issues.
1. This is the 6th drive failure we’ve had on this server despite having replaced each drive more than once and the RAID controller card. We no longer trust this hardware and will be replacing it with a new server as soon as possible - an exact date and time will be emailed to you when we receive and setup the hardware.
2. Another server will be bought online in the next hour in Dallas so files are restored faster. The current backup servers are hosted in Los Angeles so restores to Dallas take slightly longer than they should.
3. We're getting ready to move everything over into the new machine hoping it's more stable than the last piece of crap we called a server that made me lose even more hair than I'm already losing.
4. The new server is now online and has been setup. The transfer is being started in the next hour and there are a few points to note. There will be no downtime during the transfer
so pretty much it's business as usual and keep posting. This is pretty much just a heads up