Outage Summary May 12th [Archive]

View Full Version : Outage Summary May 12th

Rogean

05-13-2011, 12:33 AM

Just to give people an idea of what happened.

Sometime after 6 PM EST the data center experienced a power outage affecting the rack that our servers are in (more info posted below on this). The servers were back up a few minutes later, however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data. The data center provided me with a KVM Over IP (Kind of like a remote monitor, keyboard, and mouse) to confirm that it could not be restored.

One of the safeguards I have put in place is active real-time replication from the database software to another database server (one of our other zone servers). Both servers lost power but the second server didn't incur any hard drive issues, although it still resulted in data corruption due to an interrupted replication stream and write processes that weren't completed (The raid controllers have batteries, but the database software being halted causes issues as well). That data was run through the database repair program to fix all the issues with incomplete writes and other oddities that had happened, and it did notify me that the character data had lost 3 entries due to corrupt data. (These characters were later identified, rolled back, and compensated).

Since the replicated database only holds select data (pretty much only player data), I had to first restore from the last *full* database backup, taken on May 7th, which included all necessary background information such as database table structure and content. After that, I had to overwrite that database with the more specific player data from the replicated copy, and then lastly I had to re-merge in our latest content patch. The secondary replication server then had to be reconfigured to be the new primary server, as well as continue to run the zones it had before.

After all that it was just a matter of cleaning up internal configurations and etc, and getting the server booted up. After the server was up and we got through the initial rush of resurrections and got the 3 affected characters restored, I went ahead and set up the real-time replication to another separate server, in-case of another disaster :P.

Figured I would give everyone a glimpse of some of the work involved.

Here is what the data center said about the power failure:

Please be advised that the Netriplex AVL01 data center facility team has reported that the UPS on Power Bus 1 momentarily dropped the critical load during a utility power blip impacting some customers in our AVL01 datacenter. Our records indicate that some or all of your infrastructure is being powered by this Bus. This power Bus currently serves approximately 12 percent of the customers in our AVL01 data center.

Facilities management is currently investigating this issue and has called our UPS service & support vendor to come onsite to investigate this issue with our electrical team. Further information is not known at this time.
This update is to inform you that our UPS emergency service vendor is onsite investigating the issue at this time. An estimated time of resolution is still unknown. This update is to inform you that our UPS service vendor has determined that the new batteries recently installed in the UPS on power bus 1 may have a factory flaw or defect which caused them to drop the load this afternoon during the utility blip. This is based on the fact that the older batteries which were in place last week held the load during a slightly longer utility blip. Additionally, they have informed us that when two or more new batteries are found to have failed so quickly, it is indicative of a batch defect during production. They are recommending a complete replacement of all batteries immediately. We are currently working with our battery vendor in Charlotte to provide new batteries from a different manufacturer. The current ETA is 9:00 AM Friday morning.

Nagash

05-13-2011, 07:40 AM

Flipping heck, that sounds like Japanese to me but I understand that the brown stuff did hit the fan and that, as usual, you spent a lot of your own time away from your loved ones to sort it out for the benefit of all of us. For that, I can only thank you and give you a cookie if I see you in game :)

Petitpas/Nagash

Messianic

05-13-2011, 08:04 AM

Thanks!

We appreciate all the work and safeguards you've put in.

Asfasfos

05-13-2011, 08:08 AM

however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data

¿How come you dont use a RAID Mirroring Rogean?

One of the safeguards I have put in place is active real-time replication from the database software to another database server

That's a really good system, it's like a DataGuard system in Oracle, so if you have a physical shutdown in one database you can take the other up (primary-standby system)

Good job Rogean :)

bulbousaur

05-13-2011, 08:48 AM

Thanks for all of your work to let us play this great game, Rogean.

Knightmare

05-13-2011, 10:07 AM

I understand little of the more intricate tech parts, but the main gist of it explains not only how much work goes into this, but how much time and energy you guys put into keeping this server up and running.

We get an XP bonus for the failure. Do they give you a discount or anything Rogean? lol.

Either way, this post should go a long way to explain why we all need to ante up and kick in some donations and/or -- ------. If not just to help keep the server running, then to show some appreciation for what you guys do.

Thanks to all of you guys.

P.S. OMW to Charlotte this weekend, good time for me to go in and cast Fear or Screaming Terror on 'em all :)

Vonkaar

05-13-2011, 10:18 AM

Are you getting any SLA reimbursement? power outa

meh, offline discussion methinks

Vonkaar

05-13-2011, 10:21 AM

Anyway, that's a huge fuckup for a datacenter.

Teensy Weensy

05-13-2011, 10:33 AM

Thanks for all of your work to let us play this great game, Rogean.

Mcbard

05-13-2011, 01:44 PM

Thanks for the update and thanks for all the time and effort that goes into this place. :)

h0tr0d (shaere)

05-13-2011, 03:10 PM

Anyway, that's a huge fuckup for a datacenter.

I no rite? Unbelievable. Great job though mate.

Gibcarver

05-14-2011, 12:06 AM

I wonder who made their batteries. I am glad at least the data-center is looking for a new manufacturer and not just letting it slide.

thefloydian

05-14-2011, 02:59 AM

The people who run this server are really great. Thanks for all the effort.

Knuckle

05-14-2011, 10:21 AM

Just to give people an update on what happened.
Several pasty overweight virgins killed themselves on May 12th.