ZDNet:
As you may have noticed during our live coverage of Apple’s iPad event Wednesday, ZDNet had a few performance issues. Actually, that’s a euphemism since we were pretty much dead in the water for a few hours.
Since we’re an IT site and you go through these failures from time to time we thought it would be instructive and hopefully educational to have a gander at our post mortem.
Here’s a memo John Potter, our vice president of technology, sent out to our merry band of bloggers:
Everyone,
I wanted to reach out to you and give you all an explanation for the site outages we had during the Apple iPad event. Our Site Reliability team conducted a thorough post-mortem, and I’ve outlined the salient points below.
What happened?
The load balancer is a combination of software and hardware that acts as a gateway between the Internet and our servers. It helps route traffic to the appropriate servers and web applications.
During the Apple iPad event, the load balancer repeatedly thought most of our blog servers were not responding, so it only sent traffic to one or two at a time instead of all of our servers that were available.
This disruption impacted the blogs for all our sites. ZDNet Blogs was the site that suffered the most problems. However, during our attempts at recovery, BNET Industries, SmartPlanet, MoneyWatch, TechRepublic Training and ZDNet Reviews were also impacted.
Why did it happen?
The load balancer is configured to routinely check our servers to make sure they are ready to receive traffic. This can be done in more than one way, and the method selected depends upon the nature of the web application running on the server. For blogs, the load balancer was configured to send a request to the web application on each server and to expect an “I’m up” or “I’m down” response. If it receives the latter or no response at all, then it will stop routing traffic to that server. It does this kind of check every 5 seconds. [Read the rest]
posted by: gqjournal

Comments