View Single Post
  #1  
Old 05-13-2011, 12:33 AM
Rogean Rogean is offline
¯\_(ツ)_/¯

Rogean's Avatar

Join Date: Oct 2009
Location: Massachusetts
Posts: 5,381
Default Outage Summary May 12th

Just to give people an idea of what happened.

Sometime after 6 PM EST the data center experienced a power outage affecting the rack that our servers are in (more info posted below on this). The servers were back up a few minutes later, however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data. The data center provided me with a KVM Over IP (Kind of like a remote monitor, keyboard, and mouse) to confirm that it could not be restored.

One of the safeguards I have put in place is active real-time replication from the database software to another database server (one of our other zone servers). Both servers lost power but the second server didn't incur any hard drive issues, although it still resulted in data corruption due to an interrupted replication stream and write processes that weren't completed (The raid controllers have batteries, but the database software being halted causes issues as well). That data was run through the database repair program to fix all the issues with incomplete writes and other oddities that had happened, and it did notify me that the character data had lost 3 entries due to corrupt data. (These characters were later identified, rolled back, and compensated).

Since the replicated database only holds select data (pretty much only player data), I had to first restore from the last *full* database backup, taken on May 7th, which included all necessary background information such as database table structure and content. After that, I had to overwrite that database with the more specific player data from the replicated copy, and then lastly I had to re-merge in our latest content patch. The secondary replication server then had to be reconfigured to be the new primary server, as well as continue to run the zones it had before.

After all that it was just a matter of cleaning up internal configurations and etc, and getting the server booted up. After the server was up and we got through the initial rush of resurrections and got the 3 affected characters restored, I went ahead and set up the real-time replication to another separate server, in-case of another disaster :P.

Figured I would give everyone a glimpse of some of the work involved.

Here is what the data center said about the power failure:

Quote:
Please be advised that the Netriplex AVL01 data center facility team has reported that the UPS on Power Bus 1 momentarily dropped the critical load during a utility power blip impacting some customers in our AVL01 datacenter. Our records indicate that some or all of your infrastructure is being powered by this Bus. This power Bus currently serves approximately 12 percent of the customers in our AVL01 data center.

Facilities management is currently investigating this issue and has called our UPS service & support vendor to come onsite to investigate this issue with our electrical team. Further information is not known at this time.
Quote:
This update is to inform you that our UPS emergency service vendor is onsite investigating the issue at this time. An estimated time of resolution is still unknown.
Quote:
This update is to inform you that our UPS service vendor has determined that the new batteries recently installed in the UPS on power bus 1 may have a factory flaw or defect which caused them to drop the load this afternoon during the utility blip. This is based on the fact that the older batteries which were in place last week held the load during a slightly longer utility blip. Additionally, they have informed us that when two or more new batteries are found to have failed so quickly, it is indicative of a batch defect during production. They are recommending a complete replacement of all batteries immediately. We are currently working with our battery vendor in Charlotte to provide new batteries from a different manufacturer. The current ETA is 9:00 AM Friday morning.
__________________
Sean "Rogean" Norton
Project 1999 Co-Manager

Project 1999 Setup Guide