Back to normal?

< Instant Shipping Calculator - Added to Website Forge E-commerce

WebsiteForge video library continues to grow! >

Back to normal?

March 23rd, 2007 at 10:13 pm star

Hello everyone. It has been a rough couple of days for all of us. It's not an easy job taking responsibility for hundreds of online businesses. Usually it is terrific and sometimes we pay our dues. This is one of those "bumps in the road" that are never fun but definitely can build character.

You should know that we replace and repair servers all the time with ZERO downtime. This particular issue took some unexpected turns.

For those that are interested, I have a pretty detailed explanation and some answers to a few of your questions.

The technical details:

Wednesday: For unknown reasons our primary database server rebooted and self recovered. We decided at that time to allow it to recover instead of diverting to the backup database server. (If we use the backup server the servers must be resynched later and that causes a slow down in performance so we avoid that if we can.)

Thursday: In the morning the same thing happened. So we immediately cut over to the backup database server and took down the primary server to diagnose the hardware. All servers and email were up and running.

When the primary server rebooted and joined the cluster the SCSI channel card for unexplained reasons re-arranged it's physical channels and mounted it's partitions incorrectly causing a myriad of issues too long to explain here. The entire cluster was disabled while we corrected this issue. Why this happened is still under investigation. Every IT professional we have discussed this with so far has no idea why this hardware misbehaved in this fashion. It could not be predicted nor avoided. So the down time was primarily caused by this unexpected issue.

Once recovered the primary database server had to synchronize with the backup so that they would both contain the most current db transactions. This is what caused the speed to decrease for about 12 hours.

Friday: The database servers finished synchronizing. The primary database server was put back into the cluster and things were reported to be 100% again.

UNFORTUNATELY the primary server crashed again with the exact symptom that it had Wednesday and Thursday. We again replaced the hardware without any downtime (the way it should of happened the first time) and began to troubleshoot software issues that may be the cause.

We had to restart the server hardware and software several times throughout the day causing very short but annoying email and web site interruptions. Near the end of the day we determined the issue was due to a Linux kernel bug. I won't bore you with the details, however the team was able to mitigate the issue.

We also performed a database server upgrade on the backup server and will complete the upgrade to the primary database server at 3:00 am Saturday morning. This will result in a very short downtime and server reboot.

Currently things are running 100% and running at the normal speed.

Answers to common questions:

Question: I thought redundant systems never fail. Why isn't the system running 100% of the time?

Answer: It sounds great, however the fact is nobody can create a system that is up 100% of the time. No matter how much money or time you have, these things can and may happen. I did some research and found that a similar issue happened to Google's Gmail system recently. In fact over the last few years you can find most large companies have a similar story (even Oracle systems which purport to "never fail"). I'll add that the tradeoff for reliability is speed. You can't have both.

Question: What have you done to make sure this doesn't happen in the future?

Answer: We have written scripts for the servers to detect any changes in the SCSI addressing upon boot. This way if the drives mount incorrectly, the server services will not start. This will avoid the same issue happening in the future. Once we get word from the hardware manufacturer we may make additional adjustments.

Question: Why didn't your tech support team give us this detailed information?

Answer: Primarily because the team worked feverishly and expected to have it resolved much faster. The tech support guys did not have this detail. Only the server administrators and myself knew the play-by-play. I apologize if any of you felt uninformed. We did try to contact anyone with specific questions to respond.

Please understand that the team worked non-stop until this issue was resolved. Some team members didn't sleep for the better part of 30 hours. We take ALL issues seriously and care about each and every customer. I cannot guarantee 100% uptime. However I can guarantee we will always work as hard as possible to provide the best possible service to our Website Forge customers.

Shane Merem
http://www.websiteforge.com/
Website Design and E-commerce

Posted in Important Notifications by Shane Merem

< Return to Post List

Jon Scott says:

March 23rd, 2007 at 10:31 pm star

Thanks for the explaination. At first I thought I was having another system crash of my own, as I just had to re-format my personal hard drive due to a malicious virus. I bought better anti-virus software... While it was confusing as it happened, I am glad you got it fixed. And believe me I know the helpless feeling, as an entertainer {Dj} I have been over a hundred miles from home and had an amp quit before. Even the time it takes to change the amp to the back-up is embarrassing so I share your panic, belatedly. Good Job.

Jon Scott

CLR Marine says:

March 24th, 2007 at 5:58 am star

Edit

Thanks for the detailed explaination. I know I even went nuts when the system went down again. But in all truth, it is up about 99% of the time.

Brian says:

March 24th, 2007 at 9:24 am star

Edit

Shane,

You managed a Crisis situation well and keep working at it until the crisis was over. You have put into place safety measures to prevent this issue from happening again. This is all we can ask as customers. So thank you for putting the pressure on your people to fix it in a speedy manner. You also when it was possible briefed us on the problem and told us what happen. Now if you could teach this to a few more companies I deal with I would be very happy. :)

Stuart says:

March 25th, 2007 at 9:32 am star

My first blog entry - WOW!

Just wanted to say I really appreciated how quickly the message was posted on our sites saying there was a technical issue so I didn't get a bunch of calls and emails about something I would know nothing about...