![]() |
#1
|
||||||
|
![]() Just to give people an idea of what happened.
Sometime after 6 PM EST the data center experienced a power outage affecting the rack that our servers are in (more info posted below on this). The servers were back up a few minutes later, however it resulted in a hard drive failure in the primary database server's array, which ultimately destroyed the array and resulted in complete loss of data. The data center provided me with a KVM Over IP (Kind of like a remote monitor, keyboard, and mouse) to confirm that it could not be restored. One of the safeguards I have put in place is active real-time replication from the database software to another database server (one of our other zone servers). Both servers lost power but the second server didn't incur any hard drive issues, although it still resulted in data corruption due to an interrupted replication stream and write processes that weren't completed (The raid controllers have batteries, but the database software being halted causes issues as well). That data was run through the database repair program to fix all the issues with incomplete writes and other oddities that had happened, and it did notify me that the character data had lost 3 entries due to corrupt data. (These characters were later identified, rolled back, and compensated). Since the replicated database only holds select data (pretty much only player data), I had to first restore from the last *full* database backup, taken on May 7th, which included all necessary background information such as database table structure and content. After that, I had to overwrite that database with the more specific player data from the replicated copy, and then lastly I had to re-merge in our latest content patch. The secondary replication server then had to be reconfigured to be the new primary server, as well as continue to run the zones it had before. After all that it was just a matter of cleaning up internal configurations and etc, and getting the server booted up. After the server was up and we got through the initial rush of resurrections and got the 3 affected characters restored, I went ahead and set up the real-time replication to another separate server, in-case of another disaster :P. Figured I would give everyone a glimpse of some of the work involved. Here is what the data center said about the power failure: Quote:
Quote:
Quote:
__________________
| |||||
|
|
|