Project 1999

Go Back   Project 1999 > Blue Community > Blue Server Chat

Closed Thread
 
Thread Tools Display Modes
  #1  
Old 05-13-2011, 12:33 AM
Rogean Rogean is offline
¯\_(ツ)_/¯

Rogean's Avatar

Join Date: Oct 2009
Location: Massachusetts
Posts: 5,381
Default Outage Summary May 12th

Just to give people an idea of what happened.

Sometime after 6 PM EST the data center experienced a power outage affecting the rack that our servers are in (more info posted below on this). The servers were back up a few minutes later, however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data. The data center provided me with a KVM Over IP (Kind of like a remote monitor, keyboard, and mouse) to confirm that it could not be restored.

One of the safeguards I have put in place is active real-time replication from the database software to another database server (one of our other zone servers). Both servers lost power but the second server didn't incur any hard drive issues, although it still resulted in data corruption due to an interrupted replication stream and write processes that weren't completed (The raid controllers have batteries, but the database software being halted causes issues as well). That data was run through the database repair program to fix all the issues with incomplete writes and other oddities that had happened, and it did notify me that the character data had lost 3 entries due to corrupt data. (These characters were later identified, rolled back, and compensated).

Since the replicated database only holds select data (pretty much only player data), I had to first restore from the last *full* database backup, taken on May 7th, which included all necessary background information such as database table structure and content. After that, I had to overwrite that database with the more specific player data from the replicated copy, and then lastly I had to re-merge in our latest content patch. The secondary replication server then had to be reconfigured to be the new primary server, as well as continue to run the zones it had before.

After all that it was just a matter of cleaning up internal configurations and etc, and getting the server booted up. After the server was up and we got through the initial rush of resurrections and got the 3 affected characters restored, I went ahead and set up the real-time replication to another separate server, in-case of another disaster :P.

Figured I would give everyone a glimpse of some of the work involved.

Here is what the data center said about the power failure:

Quote:
Please be advised that the Netriplex AVL01 data center facility team has reported that the UPS on Power Bus 1 momentarily dropped the critical load during a utility power blip impacting some customers in our AVL01 datacenter. Our records indicate that some or all of your infrastructure is being powered by this Bus. This power Bus currently serves approximately 12 percent of the customers in our AVL01 data center.

Facilities management is currently investigating this issue and has called our UPS service & support vendor to come onsite to investigate this issue with our electrical team. Further information is not known at this time.
Quote:
This update is to inform you that our UPS emergency service vendor is onsite investigating the issue at this time. An estimated time of resolution is still unknown.
Quote:
This update is to inform you that our UPS service vendor has determined that the new batteries recently installed in the UPS on power bus 1 may have a factory flaw or defect which caused them to drop the load this afternoon during the utility blip. This is based on the fact that the older batteries which were in place last week held the load during a slightly longer utility blip. Additionally, they have informed us that when two or more new batteries are found to have failed so quickly, it is indicative of a batch defect during production. They are recommending a complete replacement of all batteries immediately. We are currently working with our battery vendor in Charlotte to provide new batteries from a different manufacturer. The current ETA is 9:00 AM Friday morning.
__________________
Sean "Rogean" Norton
Project 1999 Co-Manager

Project 1999 Setup Guide
  #2  
Old 05-13-2011, 07:40 AM
Nagash Nagash is offline
Sarnak

Nagash's Avatar

Join Date: Feb 2010
Posts: 478
Default

Flipping heck, that sounds like Japanese to me but I understand that the brown stuff did hit the fan and that, as usual, you spent a lot of your own time away from your loved ones to sort it out for the benefit of all of us. For that, I can only thank you and give you a cookie if I see you in game [You must be logged in to view images. Log in or Register.]

Petitpas/Nagash
  #3  
Old 05-13-2011, 08:04 AM
Messianic Messianic is offline
Planar Protector


Join Date: Jul 2010
Posts: 3,122
Default

Thanks!

We appreciate all the work and safeguards you've put in.
__________________
Heat Wave - Wizard
Messianic - Monk
Melchi Zedek - Necro

Quote:
Originally Posted by Dumbledorf View Post
I'll look into getting it changed to The Secret Order of the Silver Rose of Truth and Dragons.
  #4  
Old 05-13-2011, 08:08 AM
Asfasfos Asfasfos is offline
Kobold


Join Date: Dec 2009
Posts: 104
Default

Quote:
however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data
¿How come you dont use a RAID Mirroring Rogean?

Quote:
One of the safeguards I have put in place is active real-time replication from the database software to another database server
That's a really good system, it's like a DataGuard system in Oracle, so if you have a physical shutdown in one database you can take the other up (primary-standby system)

Good job Rogean [You must be logged in to view images. Log in or Register.]
  #5  
Old 05-13-2011, 08:48 AM
bulbousaur bulbousaur is offline
Sarnak


Join Date: May 2011
Posts: 287
Default

Thanks for all of your work to let us play this great game, Rogean.
  #6  
Old 05-13-2011, 10:07 AM
Knightmare Knightmare is offline
Kobold

Knightmare's Avatar

Join Date: Jul 2010
Location: Tennessee
Posts: 196
Default

I understand little of the more intricate tech parts, but the main gist of it explains not only how much work goes into this, but how much time and energy you guys put into keeping this server up and running.

We get an XP bonus for the failure. Do they give you a discount or anything Rogean? lol.

Either way, this post should go a long way to explain why we all need to ante up and kick in some donations and/or -- ------. If not just to help keep the server running, then to show some appreciation for what you guys do.

Thanks to all of you guys.

P.S. OMW to Charlotte this weekend, good time for me to go in and cast Fear or Screaming Terror on 'em all [You must be logged in to view images. Log in or Register.]
__________________
~Knitemare T`Knite~
~Harbingers of Thule~

Quote:
Originally Posted by Uthgaard
I don't come into McDonalds and criticize your work.
Last edited by Rogean; 05-13-2011 at 11:45 AM..
  #7  
Old 05-13-2011, 10:18 AM
Vonkaar Vonkaar is offline
Aviak

Vonkaar's Avatar

Join Date: Oct 2010
Location: Flower Mound, TX
Posts: 90
Default

Are you getting any SLA reimbursement? power outa

meh, offline discussion methinks
__________________
Vonkaar
Best Druid on Project 1999 - I have a blue ribbon to prove it.
Destroyer of Mooto - 7 times since 1999.
The Bane of Bixies©
Huge halfling balls.

  #8  
Old 05-13-2011, 10:21 AM
Vonkaar Vonkaar is offline
Aviak

Vonkaar's Avatar

Join Date: Oct 2010
Location: Flower Mound, TX
Posts: 90
Default

Anyway, that's a huge fuckup for a datacenter.
__________________
Vonkaar
Best Druid on Project 1999 - I have a blue ribbon to prove it.
Destroyer of Mooto - 7 times since 1999.
The Bane of Bixies©
Huge halfling balls.

  #9  
Old 05-13-2011, 10:33 AM
Teensy Weensy Teensy Weensy is offline
Aviak


Join Date: Nov 2010
Posts: 71
Default

Quote:
Originally Posted by bulbousaur [You must be logged in to view images. Log in or Register.]
Thanks for all of your work to let us play this great game, Rogean.
  #10  
Old 05-13-2011, 01:44 PM
Mcbard Mcbard is offline
Banned


Join Date: Sep 2010
Location: Da U.P. eh
Posts: 992
Default

Thanks for the update and thanks for all the time and effort that goes into this place. [You must be logged in to view images. Log in or Register.]
Closed Thread

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

Forum Jump


All times are GMT -4. The time now is 04:24 PM.


Everquest is a registered trademark of Daybreak Game Company LLC.
Project 1999 is not associated or affiliated in any way with Daybreak Game Company LLC.
Powered by vBulletin®
Copyright ©2000 - 2024, Jelsoft Enterprises Ltd.