Discussion on the state of cloud computing and open source software that helps build, manage, and deliver everything-as-a-service.
High Availability is one of those things often touted by cloud pundits as one of the miracle features of 'the cloud'. As a recovering sysadmin, things like availability, uptime, mean time to recovery, fault tolerance, and redundancy are near and dear to my heart.
How services and applications are built and used is changing. The older way of thinking, was to centralize everything, buy the most reliable hardware you could get and that is as available as you could make things. (Read that as closer to mainframe thinking) And that type of thinking generally worked (and still does in many cases) OK - but it's not without problems; Namely, it's expensive and failure still occurs.
As the industry has continued down the consumption of computing services pathway the need for reliability has grown exponentially. People want and need better reliability - and the old way of 'ensuring' availability doesn't scale very well. That led us to what most people talk about when they consider HA today. Much less expensive machines but using redundancy in components - and actually architecting High Availability at the software level. This led to things like Linux-HA, Pacemaker, Zookeeper, Corosync, etc. and to applications and services designing some of their own distributed (and thus more highly-available) capabilities - such as database replication, web load balancing, etc.
Real HA comes from proper architecting. We are moving away from a critical application running on a single piece of hardware. Folks have started to realize, through the pioneering work of folks like Amazon, Netlfix, and Zynga, that failure is assured. Trying to avoid failure is fruitless - embracing failure and architecting systems to expect, and properly react to failure is the path to availability.
Along the path, HA became a buzzword, and is still one of those essential checkboxes that must be completed for enterprise computing purchases. Like cloud-washing that we see so much of now, it has led to some abuse of the term, and over time the term has changed meanings. This is pretty taxing on people who actually care about the underlying technology. But this post isn't a rant about buzzword-washing and definition dilution - it's about HA in the cloud.
So first, my disclaimer, If you started reading at this paragraph or below - 'the cloud' will not magically give you high availability any more than installing linux-ha or Bucardo will. It will give you tools that you can make use of to increase availability. So what are some of those tools within the context of CloudStack and IaaS - and note these are just a few of the tools that CloudStack provides.
The first is redundant routers. CloudStack debuted the redundnant router feature a few releases ago. It makes use of VRRP to ensure that a router is up and functioning, and if something happens to the primary router, the redundant router will take over. How fast it takes over is configureable, and of course, CloudStack is 'intelligent' enough to physically separate those redundant pairs onto different sets of hardware. (nothing like a SPOF router to kill off the HA hopes you had with load-balanced web servers, and replicated databases, right?) You can of course replicate this type of redundancy with real hardware - but part of the cloud is using commodity hardware - and it doesn't get much more commodity than a virtual machine.
Another feature folks find useful is CloudStack's built in load balancer features. Yes load balancing isn't unique to CloudStack, but CloudStack makes it easy, automatable, and end-user servicable. Effectively, any CloudStack user can turn on load balancing, add hosts to be load balanced, etc. This is far from failproof (and there's always the condition where the load balancer fails - though you can solve that to a degree with a redundant router mentioned above) and again, not true HA - but it's a tactic that can increase availability.
We also have what CloudStack calls HA - I personally don't like the term, but I suppose things like ARMTTRTRFPAIA (Automated, Rapid, Mean-Time-To-Recovery That Reduces Failure Potential And Increases Availability) were considered too hard to decipher. For better or worse other folks call this capability HA as well. So how does it work? Essentially it watches the VM instance and upon detecting failure (of the instance or the hypervisor host it runs on) it will begin fencing operations and restart the VM on another node. We recently had a discussion about this on the cloudstack-devel mailing list - and out of that Alex Huang produced a great summation of CloudStack handles HA, fencing, etc. It's definitely worth a read.
In short - Availability, especially High Availability isn't a tool - it's not imbued by buying a specific piece of hardware or software, but it is achievable - the tools exist for you to make applications and services highly available - and CloudStack can help in that direction.