A Public Post Mortem of An Outage

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.  In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.  Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.  We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.  This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business.  The disaster hit what was nearly a “worst case scenario.”  Not quite, but very close.  The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.  This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.  Typically, when some big systems event happens, I do not get to talk about it publicly.  But once in a while, I do.    It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.  But you have to examine the disaster.  There is so much to be learned about processes and ourselves.

First, some back story.  A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.  It is a little over four years old and has been running in isolation for many years.  Older servers are always a bit worrisome as they approach end of life.  Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism.  Backups were handled externally to an enterprise backup appliance in the same datacenter.  A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation.  Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly.  The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.  Even the server vendor was unable to diagnose the issue.  This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.  We could replace drives, we could replace power supplies, we could replace the motherboard.  Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.  In the end the system ended up being able to be repaired and no data was lost.  The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.  The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.  The process was exhausting but when completed the system was restored successfully.

The long outage and sense  of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.  People started saying it and this leads to people believing it.  Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are.  But our triage handling had been superb.  Triage doesn’t mean magic and there is discovery phase and a reaction phase.  When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.  We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.  This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.  Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.  Nothing is absolutely perfect, but it went extremely well.  The machine worked as intended.

The far more surprising part was looking at the disaster impact.  There are two ways to look at this.  One is the wiser one, the “no hindsight” approach.  Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.  This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.  The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?  It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.  Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start.  The system had been in place for most of a decade with zero downtime.  The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.  That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!  This was downright shocking.  The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.  In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted  in massive long term cost savings, but the lesson is an important one.  There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.  Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage.  We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.  But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.  Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.

 

3 thoughts on “A Public Post Mortem of An Outage”

  1. What is the point of this piece? Something broke, it was fixed, and the people who fixed it did a good job. Who in IT would be opposed to fixing the broken thing? How does this get us to a criticism of “traditional conservative thinking”?

    It sounds like someone made a good decision: recover data from the dead server, instead of restoring from backup. This isn’t worthy of a self-congratulating 2000-word romp, it’s a decision made every day by IT professionals everywhere.

  2. The post mortem was of the decision making around the planning. I think you’ve missed the purpose of the piece as it was not at all about the process of fixing things after the disaster happened, but about the decision making that contributed to the outage and the determination that the outage was less costly than the cost of mitigating it – which is what goes against traditional thinking. It is extremely common, to the point of being mocked to think otherwise, to believe that almost any cost should be incurred to avoid an outage, especially an extended one. But this is simply not true.

    Even when this happened, it was nearly everyone’s opinion, before the post mortem and the costs were analyzed, that a bad decision had been made originally and that this outage could have been avoided. But once we saw the costs of the outage and the cost of having mitigated it, it was clear that the right decision had been made, and dramatically so. But without the post mortem proving this mathematically, nearly everyone was going to write this off as a mistake that had to be recovered from.

  3. How does this get us to a criticism of “traditional conservative thinking”?

    Because traditional thinking would have said that we should have mitigated before hand, rather than being in a position to have to recover. But the math showed this to have been a big mistake.

Leave a Reply

Your email address will not be published. Required fields are marked *