Crash your cloud, before it crashes itself: Netflix shares tool to help find unknown bugs
Summary: The just-released Chaos Monkey tool lets cloud administrators unleash a mischievous program onto their cloud to randomly break components. But why would anyone want to do this?
By Jack Clark for Cloud Watch | July 30, 2012 -- 16:25 GMT (09:25 PDT)
Name any major failure that has struck a cloud recently - Amazon, Microsoft, Heroku - and the reason for the failure will be the same: an unforeseen problem.But it doesn't have to be that way. Netflix, which operates a vast multi-continent video distribution cloud on top of Amazon Web Services, got so annoyed with unforeseen bugs in its own software that it designed a tool named Chaos Monkey to go out into its cloud and break things. The only difference between Netflix's tool and a real outage is that Chaos Monkey runs only in office hours.
"Failures happen and they inevitably happen when least desired or expected. If your application can't tolerate an instance failure would you rather find out by being paged at 3am or when you're in the office and have had your morning coffee?" Cory Bennett and Ariel Tseitlin wrote in a post to the company's engineering blog on Monday.
"Over the last year Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don't happen again."
The tool runs within Amazon Web Services. It seeks out workloads running in Auto Scaling Groups and terminates the virtual machines (instances) at random. [Read more]
posted by: gqjournal
Comments