This has always seemed a little weird to me, conceptually. Asking an application to test itself feels like “oh, we’ve investigated ourselves and found we’ve done nothing wrong” – if the app is the broken thing, how can we trust the broken thing to test itself and squeak about its pain? If our app is already lying about how well it works, why should we trust a check run by a now-known liar? This is part of why monitoring systems get so elaborate; we’re looking for a thousand windows into the application and trying to stitch all those views together into one complete tapestry of the app end-to-end.
It helped me out to think of app self-checks as a sort of spy that lives right next door to the app and tries everything the app does. If the spy can get away with all the needed steps, we can feel reasonably safe that its next-door neighbor can, too. If the spy gets hemmed up somewhere, it’s virtually certain that its neighbor is also going to choke at those same places. So, if our app’s daily planner looks like this:
- Get user name and password from form.
- Pass user data to SQL server for authentication.
- Pass authenticated data to SQL to SELECT user-relevant rows from a different table.
- Follow SELECT’d rows to a file share.
- Get list of files from that share.
- Show file list to user.
Then we build a copy of that plan and feed it to the spy; we tell him what user data to use and turn him loose. Every time we call him, he runs this exact procedure and gives us a green light if everything works. If anything, any part of that process chokes, he fails at that point and tells us instead that he got stalled and where. He’s got his own URL that we can call instead of calling the main app, and we call it frequently enough to get out in front of problems before users find them in the main app.
More importantly, we call him immediately after every deployment and we call him via automation. As soon as our auto-deploy has finished its last step and has green lights on everything, we call the self-check and parse the response. If it fails out, for whatever reason, we can feel pretty sure that the app itself is also going to have some trouble and we really should look at rolling back and investigating the failure. If it comes back all smiles and sunshine, we know that all of the underlying functionality works and it’s unlikely that we’ll see anything showstopping in the main application.
It’s not perfect. Our spy here is purpose-built for infiltrating the underlying dependencies – back-end network, database access/permissions, file shares, Azure functions, whatever else we need to use outside of our own process. As long as it’s written correctly (and this is not as hard as it sounds), it does those things pretty well and very reliably. It’s less good at detecting strictly in-app problems; if you have a field for “username” and some enterprising user pastes in a thumbs-up emoji or the Gettysburg Address you can still find a situation where the app breaks but the health check succeeds. It can’t tell you how your app responds to excessive connection attempts or excessive session instantiation or any of the other hallmarks of deliberate attacks. We have other agents in our intelligence outfit that are hired to look for those kinds of things; it’s important not to overtask a process spy.
It’s also not a substitute for good diagnostics and monitoring elsewhere. Our spy can tell us “hey boss, I called the database server just like you asked, but the guy at the door slammed it in my face” and that’s useful information. It doesn’t tell us why, and doesn’t really give us enough detail to zero in on the doorman and ask him some pointed questions with a bright light shining in his face. Ideally, that doorman also works for us, and when the spy fails, we can go to the database monitors and see why a request was rejected. Or, maybe, the request never got there because of a network configuration issue or port blockage, in which case our eyes on the network should be able to highlight exactly where we failed so we can get right to addressing it.
Along those lines, though, we hired that spy to watch the underlying dependencies, and that has a couple of implications. If we make changes to the underlying dependencies – change the location of a database, re-configure network pathing, add firewalls, etc – we have to update the spy’s orders accordingly. Otherwise, the spy will fail even if the main application works great, and it does not take many such failures to destroy confidence in our agent. Everyone’s had that one monitor that just cries all the time for no reason and gets ignored as a result. That’s never good (useless monitors should be disabled, not ignored) but it’s specifically deadly for the one guy we count on tell us that every other thing is running 5×5.
That, in turn, means re-deploying to account for infrastructure changes. We were probably doing this anyway, for something as sweeping as a database location change, and it shouldn’t be painful as long as the deployment process has already been brought to heel. If it hasn’t, and we’ve hired an army of one to pick through config code manually updating connection strings, then the health check is a few more strings to update and one more place for human error to overlook something troublesome. If that’s the case, though, then it’s probably wiser to sort out deployment first than to add more moving parts to a problematic process. “Consider your problems in the order in which they may defeat you” and all that. The idea here is to make life easier, not harder, and adding a complex system to a cumbersome process does the exact opposite of what we want.
You may be thinking “Wow, so I can write what amounts to a second application just like the first one, maintain it, update it, and gum up my CD process with it, or I could instead just outsource that headache to my monitoring system and hire some synthetic transaction monitors to do the same job without adding stress to my own life?” You might be thinking it because I certainly did. I was already comfortable with synthetic transactions – they’re deliberately made very easy to configure, they walk as far into your app as you want them to, and they’re already handled and scheduled by the monitoring rig. They’re great, for what they do, and I’m all for having those alongside a self-check.
But there are a couple of things they don’t do and can’t, really. A synthetic transaction can exactly replicate your user workflow. It can tell when something didn’t work, but it doesn’t get any better information than your browser does. Your health check, with some love and care, can spotlight exactly what it was trying to do when it choked, down to the database it tried to reach or the file share it tried to enumerate. It’s almost certain to be faster at those tasks, since it’s not waiting on rendering or page load or whatever else a human needs that a check does not.
It’s not hard and fast – very little is – but as quick rule-of-thumb:
Use a synthetic transaction monitor when:
- You’re looking for user experience pain.
- You want a look at the application from a completely external perspective.
- You have a monitoring system in place and the hands to manage it successfully.
Use a health check when:
- You have an application. Seriously, there’s almost no reason to skip on this. The only thing that comes to mind is “problems higher on the priority list”.
- You want more detailed feedback about the breakdown point in a failure.
- You need a system-level dependency check that doesn’t get distracted by user experience issues.
Both of them in concert rapidly spotlight exactly where you need to put human eyes. If the health check succeeds and the synthetic transaction does not, you can skip looking at database, network, function, and file share access, you can prove all of that is working. Instead, you can go right to the web server itself and see if it’s running/rendering slowly, if there’s something on the displayed page that’s a problem, and so on. If the health check fails, you don’t need to dig into the synthetic transaction logs; you already know why the synthetic transaction is failing and you won’t get distracted by all the stuff the synthetic is complaining about.
The long and short of it
The highlight reel on setting up a self-check:
- Catalog all of your app’s dependencies.
- Write an endpoint that tests all of those things. The better it is about reporting details of failures, the more useful it is.
- Deploy it as part of your app deployment.
- Call it as the last step in your deployment process
- Have a plan in place for what happens if it fails – roll back, investigate, whatever makes sense in your organization, anything other than “it failed and it’s time to panic”
- Once it’s deployed, call it periodically as part of your monitoring routine
The real magic of a self check is that you can feel pretty good about any given deployment if this test runs at the end and gives you the thumbs up. It adds a layer of immediate validation to a deployment without having to involve or wait for the bigger monitoring engine to pass judgement. It has a great secondary effect of being an immediately useful diagnostic tool day-to-day. It’s not usually a significant development effort – the guys that wrote the app should have very little trouble writing a stripped-down flavor – and it pays off a lot more in peace of mind and ease of troubleshooting than it costs in development/maintenance hours.