Downtime has been the enemy since the birth of the concept. The industry has attacked it as though it were electronic smallpox, feverishly developing one strategy after another, trying to get to the day when we can all look back on downtime as an eradicated problem from a more primitive time.
The industry isn’t there yet and may never be – all it takes is one overly enthusiastic gardening project slicing some fiber to invoke some downtime – but we’re close to overcoming planned downtime. That’s what we’re looking for here: deploying application updates without coming down. If we’re running VM’s instead of PaaS, it’d be ideal to run OS updates without bringing the application down too. That’s not exactly in scope here, but the principle and execution are pretty similar.
In order to make this work, we need a few pieces in place:
At least one pre-production environment. Some place where we can work out the kinks of an update and of our deployment process for that update before it gets a chance to threaten production. Ideally, it’ll be a scaled-down mirror of the production environment with an identical deployment process. That lets us test not only the code, but the deployment time and any oversight in the process. If our deployment process doesn’t swap nodes in our load balancer, better to find that out before our users do.
A good code update. It doesn’t help anything to have a flawless zero downtime deployment if it succeeds perfectly in deploying something that doesn’t work. This should be a well-settled question long before a production deployment. Code defects, at least the kind that cause total site outage, should have been detected and corrected in lower-environment testing. Bugs are notoriously sneaky and it’s possible, maybe even common, to have one slip through testing and get into production. They just shouldn’t be the kind of bugs that crater the site.
As a subset of this: database updates can’t break the existing site. Updating web code is pretty simple and easy to stagger. Updating databases isn’t. If a database change for the next revision of the app doesn’t work with the current version of the app, we’ll go down at the point where the database change is applied but both all deployment targets have not yet been updated.
A deployment tool that can account for environmental differences and multiple targets. Octopus Deploy is great and works well with Azure out of the box. The interface is reasonably intuitive and the wiring to source control packaging isn’t painful. Lots of other options are available, each with its quirks and perks. Ideally, whatever you choose will be able to tell a load balancer to flag nodes up and down without you having to log into the LB device and enable/disable by hand.
Multiple deployment targets. If you only have one web server, it’ll have to stop for some amount of time while it’s being updated. Maybe that’s a couple of seconds for an app pool refresh, maybe it’s a half an hour for a massive file system or permissions change. Neither of those is great. With two targets, we can deploy to one while we’re running on the other, and then flip-flop them to update the first one, and then put them both back in service. Each server went down for however long it went down, but we didn’t have any of this distasteful downtime overall.
Some way of managing traffic flow to those multiple deployment targets. There are a lot of answers to this question, in varying degrees of complexity, expense, reliability, and ease of use. For our purposes here, it doesn’t really matter if it’s a top-end Netscaler or an Azure app service plan. Anything that lets us force all incoming traffic to one service over another will do.
Once all those pieces are in place, this is actually easier than it looks:
- Take Node A out of service in the test environment load balancer.
- Deploy to Node A.
- Deploy required database updates. Remember that thing we talked about earlier with database changes not launching torpedoes at the previous site rev.
- Put Node A back in service.
- Take Node B out of service.
- Deploy to Node B.
- Put Node B back in service.
- Test everything. Fix whatever needs fixing.
- Repeat steps 1-8 in production.
Most of those steps can be automated, depending on the specific environment and the details of the network gear. Octopus is capable of changing the up/down states of nodes in Azure load balancers. It can also run Powershell, which can eventually be convinced to tell a hardware load balancer to do the same thing.
The Bad News
If this really is as easy as we’re saying, why isn’t everyone already doing it? It’s a reasonable question and there are probably as many answers as people not doing it, but there are a few common themes to address.
- Cost. Adding nodes to a web farm isn’t free (although adding app services to an Azure farm isn’t expensive). Load balancing gear isn’t free. All of those things should have at least minimal monitoring for day-to-day operations, and every node and LB is another point of alert generation. That necessarily means more overhead in monitoring and alert response. Overall, we’ve found that what you save in downtime and emergency site outages more than pays for the extra gear, but if you’re just looking at the bill up front and not factoring in the downtime savings later, it can look a little daunting.
- Complexity. Once it’s set up, deploying with it isn’t terribly difficult. The initial setup can get pretty tough, though. Load balancers speak their own language, Azure speaks its own language, Octopus Deploy to an extent speaks its own language. Sitting them all down for a summit means having someone around that’s fluent enough in all three languages to draft an agreement. This is aggravated by existing documentation; all three systems are pretty well-documented, but they also assume wildly varying levels of background and accordingly tend to make things look harder than they are.
- Fear of change. The rig’s been running this way for five years and it’s always been fine. Why go to the work and expense of changing it, when the new way might not even work as well? Risk aversion is common, but usually when it comes up, a quick review of support tickets and downtime reports belies “it’s always worked fine”. It may very well have always worked acceptably, but there’s no measure of how much business it lost when it was down for maintenance. Customer support issues and support man-hours are very measurable and may be higher than you think; if you can reduce those, that alone is worth it. Also, nothing lasts forever – disks get blown, memory breaks, fans fail and systems melt. You don’t have to ride the bleeding edge of future-proofing, but it’s just as bad to avoid change for tradition’s sake as it is to embrace change for the sheer sake of change.
I know, it’s a lot take in, especially if the near-realization of the zero-downtime dream is new for you. If you’re looking at all of this and wondering how you get from wherever you are to living the dream (or as close as any of us get to living the dream), there really are just a couple of things to keep in mind.
- Minimizing downtime isn’t free, but it’s less expensive than experiencing downtime.
- Changing something on this scale is scary, but it really is much less dangerous than it sounds.
- Help is always available – if the dream sounds good but you can’t visualize how to get from point A to point B, there are people out there who do this for a living and have gotten really, really good at it.
If you’re looking at moving forward but don’t really know where to start, it’s common and the first real step is taking a good, hard, honest look at your rig, how it works, and what your downtime is costing you across the board. If you’re an operator, it may not hit your radar that downtime is shrinking revenue. If you’re a manager, it may not hit your radar that downtime is piling stress on your administrators. We’ve dropped a high-level look at the finish line; once you know where your starting point is, mapping the course between those two becomes much easier, and again – we’re always here to help.