Category Archives: Risk

Disaster Recovery Planning with Existing Platform Equipment

Disaster Recovery planning is always difficult, there are so many factors and “what ifs” that have to be considered and investing too much in the recovery solution can itself become a  bit of a disaster.  A factor that is often overlooked in DR planning is that: in the event of a disaster you are generally able and very willing to make compromises where needed because a disaster has already happened.  It is triage time, not business as usual.

Many people immediately imagine that if you need capacity and performance of X for your live, production systems that you will need X as well for your disaster recovery systems.  In the real world, this is rarely true, however.  In the event of a disaster you can, with rare exception, work with lower performance and limit system availability to just the more critical systems and many maintenance operations, which often includes archiving systems, are suspended until full production is restored.  This means that your disaster recovery system can often be much smaller than your primary production systems.

Disaster recovery systems are not investments in productivity, they are hedges against failure and need to be seen in that light.  Because of this it is a common and effective strategy to approach the DR system needs more from a perspective of being “adequate” to maintain business activities while not enough to necessarily do so comfortably or transparently.  If a full scale disaster hits and staff have to deal with sluggish file retrieval, slower than normal databases or hold off on a deep BI analysis run until the high performance production systems are restored, few people will complain.  Most workers and certainly more business decision makers can be very understanding that a system is in a failed state and that they may need to help carry on as best as they can until full capacity is restored.

With this approach in mind, it can be an effective strategy to re-purpose older platforms for use at Disaster Recovery sites when new platforms are purchased and implemented for primary production usage.  This can create a low cost and easily planned around “DR pipeline” where the DR site always has the capacity of your “last refresh” which, in most DR scenarios, is more than adequate.  This can be a great way to make use of equipment that otherwise might either be scrapped outright or might tempt itself into production re-deployment by invoking a “sunk cost” emotional response that, in general, we want to avoid.

The sunk cost fallacy is a difficult one to avoid.  Already owning equipment makes it very easy to feel that deploying it again, even when a newly designed system is being implemented, outside of the system designs and specifications is useful or good.  And there are cases where this might be true, but most likely it is not.  But just as we don’t want to become overly emotionally attached to equipment just because we have already paid for it, we also don’t want to ignore the value in the existing equipment that we already own.  This is where a planned pipeline into a Disaster Planning scenario can leverage what we have already invested in a really great way in many cases.  We do have to remember that this is likely very useful equipment with a lot of value left in it, if we just know how to use it properly to meet our existing needs.

A strong production to disaster recovery platform migration planning process can be a great way to lower budgetary spending while getting excellent disaster recovery results.

A Public Post Mortem of An Outage

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment.  In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks.  Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk.  We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated.  This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business.  The disaster hit what was nearly a “worst case scenario.”  Not quite, but very close.  The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided.  This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us.  Typically, when some big systems event happens, I do not get to talk about it publicly.  But once in a while, I do.    It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”.  But you have to examine the disaster.  There is so much to be learned about processes and ourselves.

First, some back story.  A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies.  It is a little over four years old and has been running in isolation for many years.  Older servers are always a bit worrisome as they approach end of life.  Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism.  Backups were handled externally to an enterprise backup appliance in the same datacenter.  A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation.  Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly.  The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter.  Even the server vendor was unable to diagnose the issue.  This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed.  We could replace drives, we could replace power supplies, we could replace the motherboard.  Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources.  In the end the system ended up being able to be repaired and no data was lost.  The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss.  The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used.  The process was exhausting but when completed the system was restored successfully.

The long outage and sense  of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure.  People started saying it and this leads to people believing it.  Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are.  But our triage handling had been superb.  Triage doesn’t mean magic and there is discovery phase and a reaction phase.  When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame.  We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair.  This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days.  Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly.  Nothing is absolutely perfect, but it went extremely well.  The machine worked as intended.

The far more surprising part was looking at the disaster impact.  There are two ways to look at this.  One is the wiser one, the “no hindsight” approach.  Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made.  This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was.  The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it?  It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen.  Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start.  The system had been in place for most of a decade with zero downtime.  The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely.  That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision!  This was downright shocking.  The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production.  In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted  in massive long term cost savings, but the lesson is an important one.  There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well.  Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage.  We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more.  But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward.  Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.

 

The Jurassic Park Effect

“If I may… Um, I’ll tell you the problem with the scientific power that you’re using here, it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you even knew what you had, you patented it, and packaged it, and slapped it on a plastic lunchbox, and now …” – Dr. Ian Malcolm, Jurassic Park

When looking at building a storage server or NAS, there is a common feeling that what is needed is a “NAS operating system.”  This is an odd reaction, I find, since the term NAS means nothing more than a “fileserver with a dedicated storage interface.”  Or, in other words, just a file server with limited exposed functionality.  The reason that we choose physical NAS appliances is for the integrated support and sometimes for special, proprietary functionality (NetApp being a key example of this offering extensive SMB and NFS integration and some really unique RAID and filesystem options or Exablox offering fully managed scale out file storage and RAIN style protection.)  Using a NAS to replace a traditional file server is, for the most part, a fairly recent phenomenon and one that I have found is often driven by misconception or the impression that managing a file server, one of the  most basic IT workloads, is special or hard.  File servers are generally considered the most basic form of server and traditionally what people meant when using the term server unless additional description was added and the only form commonly integrated into the desktop (every Mac, Windows and Linux desktop can function as a file server and it is very common to do so.)

There is, of course, nothing wrong with turning to a NAS instead of a traditional file server to meet your storage needs, especially as some modern NAS options, like Exablox, offer scale out and storage options that are not available in most operating systems.  But it appears that the trend to use a NAS instead of a file server has led to some odd behaviour when IT professionals turn back to considering file servers again.  A cascading effect, I suspect, where the reasons for why NAS are sometimes preferred and the goal level thinking are lost and the resulting idea of “I should have a NAS” remains, so that when returning to look at file server options there is a drive to “have a NAS” regardless of whether there is a logical reason for feeling that this is necessary or not.

First we must consider that the general concept of a NAS is a simple one, take a traditional file server, simplify it by removing options and package it with all of the necessary hardware to make a simplified appliance with all of the support included from the interface down to the spinning drives and everything in between.  Storage can be tricky when users need to determine RAID levels, drive types, monitor effectively, etc.  A NAS addresses this by integrating the hardware into the platform.  This makes things simple but can add risk as you have fewer support options and less ability to fix or replace things yourself.  A move from a file server to a NAS appliance is truly about support almost exclusively and is generally a very strong commitment to a singular vendor.  You chose the NAS approach because you want to rely on a vendor for everything.

When we move to a file server we go in the opposite direction.  A file server is a traditional enterprise server like any other.  You buy your server hardware from one vendor (HP, Dell, IBM, etc.) and your operating system from another (Microsoft, Red Hat, Suse, etc.)  You specify the parts and the configuration that you need and you have the most common computing model for all of IT.  With this model you generally are using standard, commodity parts allowing you to easily migrate between hardware vendors and between software vendors. You have “vendor redundancy” options and generally everything is done using open, standard protocols.  You get great flexibility and can manage and monitor your file server just like any other member of your server fleet, including keeping it completely virtualized.  You give up the vertical integration of the NAS in exchange for horizontal flexibility and standardization.

What is odd, therefore, is when returning to the commodity model but seeking, what is colloquially known as, a NAS OS.  Common examples of these include NAS4Free, FreeNAS and OpenFiler.  This category of products is generally nothing more than a standard operating system (often FreeBSD as it has ideal licensing, or Linux because it is well known) with a “storage interface” put onto it and no special or additional functionality that would not exist with the normal operating system.  In theory they are a “single function” operating system that does only one thing.  But this is not reality.  They are general purpose operating systems with an extra GUI management layer added on top.  One could say the same thing about most physical NAS products themselves, but they typically include custom engineering even at the storage level, special features and, most importantly, an integrated support stack and true isolation of the “generalness” of the underlying OS.  A “NAS OS” is not a simpler version of a general purpose OS, it is a more complex, yet less functional version of it.

What is additionally odd is that general OSes, with rare exception, already come with very simple, extremely well known and fully supported storage interfaces.  Nearly every variety of Windows or Linux servers, for example, have included simple graphical interfaces for these functions for a very long time.  These included GUIs are often shunned by system administrators as being too “heavy and unnecessary” for a simple file server.  So it is even more unusual that adding a third party GUI, one that is not patched and tested by the OS team and not standardly known and supported, would then be desired as this goes against the common ideals and practices of using a server.

And this is where the Jurassic Park effect comes in – the OS vendors (Red Hat, Microsoft, Oracle, FreeBSD, Suse, Canonical, et. al.) are giants with amazing engineering teams, code review, testing, oversight and enterprise support ecosystems.  While the “NAS OS” vendors are generally very small companies, some with just one part time person, who stand on the shoulders of these giants and build something that they knew that they could but they never stopped to ask if they should.  The resulting products are wholly negative compared to their pure OS counterparts, they do not make systems management easier nor do they fill a gap in the market’s service offerings. Solid, reliable, easy to use storage is already available, more vendors are not needed to fill this place in the market.

The logic often applied to looking at a NAS OS is that they are “easy to set up.”   This may or may not be true as easy, here, must be a relational term.  For there to be any value a NAS OS has to be easy in comparison to the standard version of the same operating system.  So in the case of FreeNAS, this would mean FreeBSD.  FreeNAS would need to be appreciably easier to set up than FreeBSD for the same, dedicated functions.  And this is easily true, setting up a NAS OS is generally pretty easy.  But this ease is only a panacea and one of which IT professionals need to be quite aware.  Making something easy to set up is not a priority in IT, making something that is easy to operate and repair when there are problems is what is important.  Easy to set up is nice, but if it comes at a cost of not understanding how the system is configured and makes operational repairs more difficult it is a very, very bad thing.  NAS OS products routinely make it dangerously easy to get a product into production for a storage role, which is almost always the most critical or nearly the most critical role of any server in an environment, that IT has no experience or likely skill to maintain, operate or, most importantly, fix when something goes wrong.  We need exactly the opposite, a system that is easy to operate and fix.  That is what matters.  So we have a second case of “standing on the shoulders of giants” and building a system that we knew we could, but did not know if we should.

What exacerbates this problem is that the very people who feel the need to turn to a NAS OS to “make storage easy” are, by the very nature of the NAS OS, the exact people for whom operational support and the repair of the system is most difficult.  System administrators who are comfortable with the underlying OS would naturally not see a NAS OS as a benefit and avoid it, for the most part.  It is uniquely the people for whom it is most dangerous to run a not fully understood storage platform that are likely to attempt it.  And, of course, most NAS OS vendors earn their money, as we could predict, on post-installation support calls for customers who deployed and got stuck once they were in production so that they are at the mercy of the vendors for exorbitant support pricing.  It is in the interest of the vendors to make it easy to install and hard to fix.  Everything is working against the IT pro here.

If we take a common example and look at FreeNAS we can see how this is a poor alignment of “difficulties.”  FreeNAS is FreeBSD with an additional interface on top.  Anything that FreeNAS can do, FreeBSD an do.  There is no loss of functionality by going to FreeBSD.  When something fails, in either case, the system administrator must have a good working knowledge of FreeBSD in order to exact repairs.  There is no escaping this.  FreeBSD knowledge is common in the industry and getting outside help is relatively easy.  Using FreeNAS adds several complications, the biggest being that any and all customizations made by the FreeNAS GUI are special knowledge needed for troubleshooting on top of the knowledge already needed to operate FreeBSD.  So this is a large knowledge set as well as more things to fail.  It is also a relatively uncommon knowledge set as FreeNAS is a niche storage product from a small vendor and FreeBSD is a major enterprise IT platform (plus all use of FreeNAS is FreeBSD use but only a tiny percentage of FreeBSD use is FreeNAS.)  So we can see that using a NAS OS just adds risk over and over again.

This same issue carries over into the communities that grow up around these products.  If you look to communities around FreeBSD, Linux or Windows for guidance and assistance you deal with large numbers of IT professionals, skilled system admins and those with business and enterprise experience.  Of course, hobbyists, the uninformed and others participate too, but these are the enterprise IT platforms and all the knowledge of the industry is available to you when implementing these products.  Compare this to the community of a NAS OS.  By its very nature, only people struggling with the administration of a standard operating system and/or storage basics would look at a NAS OS package and so this naturally filters the membership in their communities to include only the people from whom we would be best to avoid getting advice.  This creates an isolated culture of misinformation and misunderstandings around storage and storage products.  Myths abound, guidance often becomes reckless and dangerous and industry best practices are ignored as if decades of accumulated experience had never happened.

A NAS OS also, commonly, introduces lags in patching and updates.  A NAS OS will almost always and almost necessarily trail its parent OS on security and stability updates and will very often follow months or years behind on major features.  In one very well known scenario, OpenFiler, the product was built on an upstream non-enterprise base (RPath Linux) which lacked community and vendor support, failed and was abandoned leaving downstream users, included everyone on OpenFiler, abandoned without the ecosystem needed to support them.  Using a NAS OS means trusting not just the large, enterprise and well known primary OS vendor that makes the base OS but trusting the NAS OS vendor as well.  And the NAS OS vendor is orders of magnitude more likely to fail if they are basing their products on enterprise class base OSes.

Storage is a critical function and should not be treated carelessly and should not be ignored as if its criticality did not exist.  NAS OSes tempt us to install quickly and forget, hoping that nothing ever goes wrong or that we can move on to other roles or companies completely before bad things happen.  It sets us up for failure where failure is most impactful.  When a typical application server fails we can always copy the files off of its storage and start fresh.  When storage fails, data is lost and systems go down.

“John Hammond: All major theme parks have delays. When they opened Disneyland in 1956, nothing worked!

Dr. Ian Malcolm: Yeah, but, John, if The Pirates of the Caribbean breaks down, the pirates don’t eat the tourists.”

When storage fails, businesses fail.  Taking the easy route to setting up storage and ignoring the long term support needs and seeking advice from communities that have filtered out the experienced storage and systems engineers increases risk dramatically.  Sadly, the nature of a NAS OS, is that the very reason that people turn to it (lack of deep technical knowledge to build the systems) is the very reason they must avoid it (even greater need for support.)  The people for whom NAS OSes are effectively safe to use, those with very deep and broad storage and systems knowledge would rarely consider these products because for them they offer no benefits.

At the end of the day, while the concept of a NAS OS sounds wonderful, it is not a panacea and the value of a NAS does not carry over from the physical appliance world to the installed OS world and the value of standard OSes is far too great for NAS OSes to effectively add real value.

“Dr. Alan Grant: Hammond, after some consideration, I’ve decided, not to endorse your park.

John Hammond: So have I.”

The Weakest Link: How Chained Dependencies Impact System Risk

When assessing system risk scenarios it is very easy to overlook “chained” dependencies.  We are trained to look at risk at a “node” level asking “how likely is this one thing to fail.”  But system risk is far more complicated than that.

In most systems there are some components that rely on other components. The most common place that we look at this is in the design of storage for servers, but it occurs in any system design.  Another good example is how web applications need both application hosts and database hosts in order to function.

It is easiest to explain chained dependencies with an example.  We will look at a standard virtualization design with SAN storage to understand where failure domain boundaries exist and where chained dependencies exist and what role redundancy plays in system level risk mitigation.

In a standard SAN (storage area network) design for virtualization you have virtualization hosts (which we will call the “servers” for simplicity), SAN switches (switches dedicated for the storage network) and the disk arrays themselves.  Each of these three “layers” is dependent on the others for the system, as a whole, to function.  If we had the simplest possible set with one server, one switch and one disk array we very clearly have three devices representing three distinct points of failure.  Any one of the three failing causes the entire system to fail.  No one piece is useful on its own.  This is a chained dependency and the chain is only as strong as its weakest link.

In our simplistic example, each device represents a failure domain.  We can mitigate risk by improving the reliability of each domain.  We can add a second server and implement a virtulization layer high availability or fault tolerance strategy to reduce the risk of server failure.  This improves the reliability of one failure domain but leaves two untouched and just as risky as they were before.  We can then address the switching layer by adding a redundant switch and configuring a multi-pathing strategy to handle the loss of a single switching path reducing the risk at that layer.  Now two failure domains have been addressed.  Finally we have to address the storage failure domain which is done, similarly, by adding redundancy through a second disk array that is mirrored to the first and able to failover transparently in the event of a failure.

Now that we have beefed up our system, we still have three failure domains in a dependency chain.  What we have done is made each “link” in the chain, each failure domain, extra resilient on its own.  But the chain still exists.  This means that the system, as a whole, is far less reliable than any single failure domain within the chain is alone.  We have made something far better than where we started, but we still have many failure domains.  These risks add up.

What is difficult in determining overall risk is that we must assess the risk of each item, then determine the new risk after mitigation (through the addition of redundancy) and then find the cumulative risk of each of the failure domains together in a chain to determine the total risk of the entire system.  It is extremely difficult to determine the risk within each failure domain as the manner of risk mitigation plays a significant role.  For example a cluster of storage disk arrays that fails over too slowly may result in an overall system failure even when the storage cluster itself appears to have worked properly.  Even defining a clear failure can therefore be challenging.

It is often tempting to take a “from the top” view assessment of risk which is very dangerous, but very common for people who are not regular risk assessment practitioners.  The tendency here is to look at the risk only viewing the “top most” failure domain – generally the servers in a case like this, and ignoring any risks that sit beneath that point considering those to be “under the hood” rather than part of the risk assessment.  It is easy to ignore the more technical, less exposed and more poorly understood components like networking and storage and focus on the relatively easy to understand and heavily marketed reliability aspects of the top layer.  This “top view” means that the risks under the top level are obscured and generally ignored leading to high risk without a good understanding of why.

Understanding the concept of chained dependencies explains why complex systems, even with complex risk mitigation strategies, often result in being far more fragile than simpler systems.  In our above example, we could do several things to “collapse” the chain resulting in a more reliable system as a whole.

The most obvious component which can be collapsed is the networking failure domain.  If we were to remove the switches entirely and connect the storage directly to the servers (not always possible, of course) we would effectively eliminate one entire failure domain and remove a link from our chain.  Now instead of three chains, each of which has some potential to fail, we have only two.  Simpler is better, all other things being equal.

We could, in theory, also collapse in the storage failure domain by going from external storage to using storage local to the servers themselves essentially taking us from two failure domains down to a single failure domain – the one remaining domain, of course, is carrying more complexity than it did before the collapsing, but the overall system complexity is greatly reduced.  Again, this is with all other factors remaining equal.

Another approach to consider is making single nodes more reliable on their own.  It is trendy today to look at larger systems and approach risk mitigation in that way, by adding redundant, low cost nodes to add reliability to failure domains.  But traditionally this was not the default path taken to reliability.  It was far more common in the past, as is shown in the former prevalence of mainframe and similar classed systems, to build in high degrees of reliability into a single node.  Mainframe and high end storage systems, for example, still do this today.  This can actually be an extremely effective approach but fails to address many scenarios and is generally extremely costly, often magnified by a need to have systems partially or even completely maintained by the vendor.  This tends to work out only in special niche circumstances and is not practical on a more general scope.

So in any system of this nature we have three key risk mitigation strategies to consider: improve the reliability of a single node, improve the reliability of a single domain or reduce the number of failure domains (links) in the dependency chain.  Putting these together as is prudent can help us to achieve the risk mitigation level appropriate for our business scenario.

Where the true difficulty exists, and will remain, is in the comparison of different risk mitigation strategies.  The risk of a single node can generally be estimated with some level of confidence.  A redundancy strategy within a single domain has far less ability to be estimated – some redundancy strategies are highly effective, creating extremely reliably failure domains while others can actually backfire and reduce the reliability of a domain!  The complexity that often comes with redundancy strategies is never without caveat and while it will typically pay off, it rarely carries the degree of reliability benefit that is initially expected.  Estimating the risk of a dependency chain is therefore that much more difficult as it requires a clear understanding of the risks associated with each of the failure domains individually as well as an understanding of the failure opportunity existing at the domain boundaries (like the storage failover delay failure noted earlier.)

Let’s explore the issues around determining risk in two very common approaches to the same scenario building on what we have discussed above.

Two extreme examples of the same situation we have been discussing are a single server with internal storage used to host virtual machines versus a six device “chain” with two servers and using a high availability solution at the server layer, two switches with redundancy at the switching layer and two disk arrays providing high availability at the storage layer.  If we switch any large factor here we can generally provide a pretty clear estimate of relative risk – if any of the failure domains lacks reliable redundancy, for example – we can pretty clearly determine that the single server is the more reliable overall system except in cases where an extreme amount of single node reliability is assigned to a single node, which is generally an impractical strategy financially.  But with each failure domain maintaining redundancy we are forced to compare the relative risks of intra-domain reliability (the redundant chain) vs. inter-domain reliability (the collapsed chain, single server.)

With the two entirely different approaches there is no reasonable way to assess the comparative risks of the two means of risk mitigation.  It is generally accepted that the six (or more) node approach with extensive intra-domain risk mitigation is the more reliable of the two approaches and this is almost certainly, generally true.  But it is not always true and rarely does this approach outperform the single node strategy by a truly significant margin while commonly costing four to ten fold as much as the single server strategy.  That is potentially a very high cost for what is likely a small gain in reliability and a small potential risk of a loss in reliability.  Each additional piece of redundancy adds complexity that a human must implement, monitor and maintain and with complexity and human interaction comes more and more risk.  Avoiding human error can often be more important than avoiding mechanical failure.

We must also consider the cost of recovery.  If failure is to occur it is generally trivial to recover from the failure of a simple system.  An extremely complex system, having failed, may take a great degree of effort to restore to a working condition.  Complex systems also require much broader and deeper degrees of experience and confidence to maintain.

There is no easy answer to determining the reliability of systems.  Modern information delivery systems are simply too large and too complex with too many indeterminable factors to be able to evaluate in all cases.  With a good understanding of chained dependencies, however, and an understanding of risk mitigation strategies we can take practical steps to determine roughly relative risk levels, see where similar risk scenarios compare in cost, identify points of fragility, recognize failure domains and dependency chains,  and appreciate how changes in system design will move us clearly towards or away from reliability.