Wednesday, May 28, 2014

A cautionary SysAdmin tale: Rebooting a DC

Courtesy of @sql_handle on Twitter, I read this article from Joyent that every person in a SysAdmin, DevOps, PROD Support, what ever title you put on it, role needs to read.  A person was able to take down an ENTIRE data center with a command based tool, without any additional precautions, so most customers had downtime for 20 minutes but some were down for more than TWO hours.  Of course this is an extreme example, but it is important to realize that until the rubber meets the road you don't understand just how badly something can go with something this simple.  Would a "two-key" type system have prevented this?  What about additional parameters required on the command line?

Now that Joyent knows the flaws in their architecture, and there are some doozies which we should all review to make sure they aren't in our own architectures, they have put out a plan for how to move forward past this.  Can they move fast and smart enough, while maintaining their internal controls, so that they don't create more downtimes for customers by implementing new and improved processes?  Time will tell I guess, but this article helped me widen my scope a bit for just what could be pain points in the future.

No comments:

Post a Comment