All posts by Scott Alan Miller

When a Backup Is Not A Backup

Conceptually the idea of “backup” has become a murky area within IT.  Everyone seems to have their own concepts of what a backup is and how they expect it to behave.  This can be dangerous when the person supplying backup and the person consuming backup have a mismatch in expectations.  I see this happen every day even with traditional backup mechanisms.  With new types of backups appearing on a regular basis the opportunities for miscommunications and loss of data become much more pronounced.

By traditional backups I refer to the traditional world of tape-based backups with a grandfather – father – son rotational strategy in place, just to set the stage for the discussion.  New backups might include system images, disk-based backups, continuous backups and backups to “the cloud” or online backups.  The world of backups is evolving rapidly and now is when misunderstandings begin to put corporate data resources at risk.

So what exactly is a “backup”?  The concept sounds simple, but what do we really mean when we use the term?  Do we mean the ability to restore a system after it has failed?  The ability to roll back to an earlier version of a file?  Perhaps archiving of data when the original no longer exists?  How long do which files get kept?  Does this apply only to file data or are emails and databases included too?  Do we only need to restore in case of system failure or do we need the ability to restore granular data as well?  Do we need only one copy or do we need copies of every version of a file?

Now, with the additional risks posed by things like ransonware, we have even more concerns than ever before and ideas around not just versioning but potentially unlimited versioning and air gapping between systems and backups has become of a concern where before, it generally was not.

Many organizations, especially smaller ones, often choose to approach backups a bit differently from enterprises and often eschew backups completely.  Instead they “take backups” but then often delete the original files.  And instead of keeping many copies of the files that have been “backed up” they opt to keep only a single copy (or multiple versions that are co-dependent on each other) .  This means that what they have is not really a backup, but rather an archive.  If the one disk or tape on which the file is stored becomes damaged, the file is lost completely.

The term backup implies that there are at least two copies of some piece of data that do not rely on each other.  An archive does not imply this and just implies that we have taken data from production to another system, presumably one that is lower cost and likely much lower and harder to retrieve.  Archived data implies no redundancy, unlike the term backup.

If we “take a backup” and then proceed to delete the original data we no longer have a backup and the file that is stored in the “backup system”, whether this is on disk, a tape in a vault or whatever, turns into an archive of the original data rather than a backup of it.  It is now our source file, rather than being a copy.  This is some of the magic of digital media, copies are a clone rather than a mimic so the archival component is legitimately the original in every sense.

This may seem pedantic but it truly is not.  If a business is paying for backups, they likely assume that that cost is going towards having some redundancy, not just a single copy of data.  And if you have regulations around being required to keep backups for compliance reasons, only having an archival copy is a clear violation of that requirement.  Having two systems fail and being unable to retrieve data is an edge case that all compliance must accept.  But having an archival system fail where a backup is required but was not kept, is not an acceptable scenario.

For this reason, and many more, concepts like the 3-2-1 backup methodology make sense because this approach guarantees that backups are kept within the backup system and originals do not need to be kept on production.  In some ways of thinking, this approach could be thought of as merging archiving and backups into a single system which adds much clarity to the design.

Whatever backup system works for you, be cognizant that backups mean independent copies and that in many ways, independent copies that do not share failure domains has become nearly a requirement for all backups today.

Hiring IT: Speed Matters

After decades of IT hiring, something that I have learned is that companies serious about hiring top talent always make hiring decisions very quickly.  They may spend months or even years looking for someone that is a right fit for the organization, but once they have found them they take action immediately.

This happens for many reasons.  But mostly it comes down to wanting to secure resources once they have been identified.  Finding good people is an expensive and time consuming process.  Once you have found someone that is considered to be the right fit for the need and the organization, there is a strong necessity to reduce risk by securing them as quickly as possible.  A delay in making an offer presents an opportunity for that resource to receive another offer or decide to go in a different direction.  Months of seeking a good candidate, only to lose them because of a delay of a few hours or days in making an offer is a ridiculous way to lose money.

Delays in hiring suggest that either the situation has not yet been decided upon or that the process has not gotten a priority and that other decisions or actions inside of the company are seen as more  important than the decisions around staffing.  And, of course, it may be true that other things are more important.

Other factors being more important are exactly the kinds of things that potential candidates worry about.   Legitimate priorities might include huge disasters in the company, things that are not a good sign in general.  Or worse, maybe the company just doesn’t see acquiring the best talent as being important and delays are caused by vacations, parties, normal work or not even being sure that they want to hire anyone at all.

It is extremely common for companies to go through hiring rounds just to “see what is out there.”  This doesn’t necessarily mean that they will not consider hiring someone if the right person does come along, but it easily means that the hiring is not fully approved or funded and might not even be possible.  Candidates go through this regularly, a great interview might result in no further action and so know better than to sit around waiting on positions, even ones that seem very likely and possible.  The risks are too high and if a different, good opportunity comes along, will normally move ahead with that.  Few things signal that a job offer is not forthcoming or that a job is not an ideal one than delays in the hiring process.

Candidates, especially senior ones, know that good jobs hire quickly.  So if the offer has not arrived promptly it is often assumed that offer(s) are being made to other candidates or that something else is wrong.  In either situation, candidates know to move on.

If hiring is to be a true priority in an organization, it must be prioritized.  This should go without saying, but good hiring slips through the cracks more often than not.  It is far too often seen as a background activity; one that is approached casually and haphazardly.  It is no wonder that so many organizations waste countless hours of time on unnecessary candidate searches and interviews and untold time attempting to fill positions when, for all intents and purposes, they are turning away their best options all the while.

When to Consider High Availability?

“High Availability isn’t something you buy, it’s something that you do.”  – John Nicholson

Few things are more universally desired in IT than High Availability (HA) solutions.  I mean really, say those words and any IT Pro will instantly say that they want that.  HA for their servers, their apps, their storage and, of course, even their desktops.  If there was a checkbox next to any system that simply said “HA”, why wouldn’t we check it?  We would, of course.  No one voluntarily wants a system that fails a lot.  Failure bad, success good.

First, though, we must define HA.  HA can mean many things.  At a minimum, HA must mean that the availability of the system in question must be higher than “normal”.  What is normal?  That alone is hard enough to define.  HA is a loose term, at best.  In the context of its most common usage, though, which is common applications running on normal enterprise hardware I would offer this starting point for HA discussions:

Normal or Standard Availability (SA) would be defined as the availability from a common mainline server running a common enterprise operating system running a common enterprise application in a best practices environment with enterprise support.  Good examples of this might include Exchange running on Windows Server running on the HP Proliant DL380 (the most common mainline commodity server.)  Or for BIND (the DNS server) running on Red Hat Enterprise Linux on the Dell PowerEdge R730.  These are just examples to be used for establishing a rough baseline.  There is no great way to measure this, but with a good support contract and rapid repair or replacement in the real world, reliability of a system of this nature is believed to be between four and five nines of reliability (99.99% uptime or higher) when human failure is not included.

High Availability (HA) should be commonly defined as having an availability significantly higher than that of Standard Availability.  Significantly higher should be a minimum of one order of magnitude hincrease.  So at least five nines of reliability and more likely six nines. (99.9999% uptime.)

Low Availability (LA) would be commonly defined as having an availability significantly lower than that of Standard Availability with significantly, again, meaning at least one order of magnitude.  So LA would typically be assumed to be around 99% to 99.9% or lower availability.

Measurement here is very difficult as human factors, environmental and other play a massive role in determining the uptime of different configurations.  The same gear used in one role might achieve five nines while in another fail to achieve even one.  The quality of the datacenter, skill of the support staff, rapidity of parts replacement, granularity of monitoring and a multitude of other factors will affect the overall reliability significantly.  This is not necessarily a problem for us, however.  In most cases we can evaluate the portions of a system design that we control in such a way that relative reliability can be determined so that at least we can show that one approach is going to be superior to another in order that we can then leverage well informed decision making even if accurate failure rate models cannot be easily built.

It is important to note that other than providing a sample baseline set of examples from which to work there is nothing in the definitions of high availability or low availability that talk about how these levels should be achieved – that is not what the terms mean.  The terms are resultant sets of reliability in relation to the baseline and nothing else. There are many ways to achieve high availability without using commonly assumed approaches and practically unlimited ways to achieve low availability.

Of course HA can be defined at every layer.  We can have HA platforms or OS but have fragile applications on top.  So it is very important to understand at what level we are speaking at any given time.  At the end of the day, a business will only care about the high availability delivery of services regardless of how it is achieved, or where.  The end result is what matters not the “under the hood” details of how it was accomplished or, as always, the ends justify the means.

It is extremely common today for IT departments to become distracted by new and flashy HA tools at the platform layer and forget to look for HA higher and lower in the stack to ensure that we provide highly available services to the business; rather than only looking at the one layer while leaving the business just as vulnerable, or moreso, than ever.

In the real world, though, HA is not always an option and, when it is, it comes at a cost.  That cost is almost always monetary and generally comes with extra complexity as well.  And as we well know, any complexity also carries additional risk and that risk could, if we are not careful, cause an attempt to achieve HA actually fail and might even leave us with LA or Low Availability.

Once we understand this necessary language for describing what we mean, we can begin to talk about when high availability, standard availability and even low availability may be right for us.  We use this high level of granularity because it is so difficult to measure system reliability that getting too detailed becomes valueless.

Conceptually, all systems come with risk of downtime and nothing can be always up, that’s impossible.  Reliability costs money, generally, all other things being equal.  So to determine what level of availability is most appropriate for a workload we must determine the cost of risk mitigation (the amount of money that it takes to change the average amount of downtime) and compare that against the cost of the downtime itself.

This gets tricky and complicated because determining cost of downtime is difficult enough, then determining the risk of downtime is even more difficult.  In many cases, downtime is not a flat number, but it might be.  This cost could be expressed as $5/minute or $20K/day or similar.  But an even better tool would be to create a “loss impact curve” that shows how money is lost over time (within a reasonable interval.)

For example, a company might easily face no loss at all for the first five minutes with slowly increasing, but small, losses until about four hours when work stops because people can no longer go to paper or whatever and then losses go from almost zero to quite large.  Or some companies might take a huge loss the moment that the systems are down but the losses slowly dissipate over time.  Loss might only be impactful at certain times of day.  Maybe outages at night or during lunch are trivial but mid morning or mid afternoon are major.  Every company’s impact, risk and ability to mitigate that risk are different, often dramatically so.

Sometimes it comes down to the people working at the company.  Will they all happily take needed bathroom, coffee, snack or even lunch breaks at the time that a system fails so that they can return to work when it is fixed?  Will people go home early and come in early tomorrow to make up for a major outage?  Is there machinery that is going to sit idle?  Will the ability to respond to customers be impacted?  Will life support systems fail?  There are countless potential impacts and countless potential ways of mitigating different types of failures.  All of this has to be considered.  The cost of downtime might be a fraction of corporate revenues on a minute by minute basis or downtime might cause a loss of customers or faith that is more impactful than the minute by minute revenue generated.

Once we have some rough loss numbers to deal with we at least have a starting point.  Even if we only know that revenue is ~$10/minute and losses are expected to be around ~$5/minute we have a starting point of sorts.  If we have a full curve or a study done with some more detailed numbers, all the better.  Now we need to figure out roughly what our baseline is going to be.  A well maintained server, running on premises, with a good support contract and good backup and restore procedures can pretty easily achieve four nines of reliability.  That means that we would experience about five hours of downtime every five years.  This is actually less than the generally expected downtime of SA in most environments and potentially far less than expected levels in excellent environments like high quality datacenters with nearby parts and service.

So, based on our baseline example of about five hours every five years we can figure out our potential risk.  If we lose about ~$5/minute and we expect roughly 300 minutes of downtime every five years we looking at a potential loss of $1,500 every half decade.

That means that at the most extreme we could never spend $1,500 to mitigate that risk, that would be financially absurd.  This happens for several reasons.  One of the biggest is that this is only a risk, spending $1,500 to protect against losing $1,500 makes little sense, but it is a very common mistake to make when people do not analyze these numbers carefully.

The biggest factor is that any mitigation technique is not completely effective.  If we manage to move our four nines system to a five nines system we would reduce only 90% of the average downtime moving us from $1,500 of loss to $150 of loss.  If we spent $1,500 for that reduction, the total “loss” would still be $1,650 (the cost of protection is a form of financial loss.)  The cost of the risk mitigation combined with the anticipated remaining impact when taken together must still be lower than the anticipated cost of the risk without mitigation or else the mitigation itself is pointless or actively damaging.

Many may question why the total cost of risk mitigation must be lower and not simply equal as surely, that must mean that we are at a “risk break even” point?  This seems true on the surface, but because we are dealing with risk this is not the case.  Risk mitigation is a certain cost- financial damage that we take up front in the hopes of reducing losses tomorrow.  But the risk for tomorrow is a guess, hopefully a well educated one, but only a guess.  The cost today is certain.  Taking on certain damage today in the hopes of reducing possible damage tomorrow only makes sense when the damage today is small and the expected or possible damage tomorrow is very large and the effectiveness of mitigation is significant.

Included in the idea of “certain cost of front” to reduce “possible cost tomorrow” is the idea of the time value of money.  Even if an outage was of a known size and time, we would not spend the same money today to mitigate it tomorrow because our money is more valuable today.

In the most dramatic cases, we sometimes see IT departments demanding tens or hundreds of thousands of dollars be spent up front to avoid losing a few thousand dollars, maybe, sometime maybe many years in the future.  A strategy that we can refer to as “shooting ourselves in the face today to avoid maybe getting a headache tomorrow.”

It is included in the concept of evaluating the risk mitigation but it should be mentioned specifically that in the case of IT equipment there are many examples of attempted risk mitigation that may not be as effective as they are believed to be.  For example, having two servers that sit in the same rack will potentially be very effective for mitigating the risk of host hardware failure, but will not mitigate against natural disasters, site loss, fire, most cases of electrical shock, fire suppression activation, network interruptions, most application failure, ransomware attack or other reasonably possible disasters.

It is common for storage devices to be equipment with “dual controllers” which gives a strong impression of high reliability, but generally these controllers are inside a single chassis with shared components and even if the components are not shared, often the firmware is shared and communications between components are complex; often leading to failures where the failure of one component triggers the failure of another – making them quite frequently LA devices rather than SA or the HA that people expected when purchasing them.  So it is very critical to consider if the risk mitigation strategy will mitigate which risks and if the mitigation technique is likely to be effective.  No technique is completely effective, there is always a chance for failure, but some strategies and techniques are more broadly effective than others and some are simply misleading or actually counter productive.  If we are not careful, we may implement costly products or techniques that actively undermine our goals.

Some techniques and products used in the pursuit of high availability are rather expensive, which might include buying redundant hardware, leasing another building, installing expensive generators or licensing special software.  There are low cost techniques and software as well, but in most cases any movement towards high availability will result in a respectively large outlay of investment capital in order to achieve it.  It is absolutely critical to keep in mind that high availability is a process, there is no way to simply buy high availability.  Achieving HA requires good documentation, procedures, planning, support, equipment, engineering and more.  In the systems world, HA is normally approached first from an environmental perspective with failover power generators, redundant HVAC systems, power conditioning, air filtration, fire suppression systems and more to ensure that the environment for the availability is there.  This alone can often make further investment unnecessary as this can deliver incredible results.  Then comes HA system design ensuring that not just one layer of a technology stack is highly available but that the entire stack is allowing for the critical applications, data or services to remain functional during as much time as possible.  Then looking at site to site redundancy to be able to withstand floods, hurricanes, blizzards, etc.  Of course there are completely different techniques such as utilizing cloud computing services hosted remotely on our behalf.  What matters is that high availability requires broad thinking and planning, cannot simply be purchased as a line item and is judged by the ability to return a risk factor providing a resulting uptime or likelihood of uptime much higher than a “standard” system design would deliver.

What is often surprising, almost shocking, to many businesses and especially to IT professionals, who rarely undertake financial risk analysis and who are constantly being told that HA is a necessity for any business and that buying the latest HA products is unquestionably how their budgets should be spent, is that when the numbers are crunched and the reality of the costs and effectiveness of risk mitigation strategies are considered that high availability has very little place in any organization, especially those that are small or have highly disparate workloads.  In the small and medium business market it is almost universal to find that the cost and complexity (which in turn brings risk, mostly in the form of a lack of experience around techniques and risk assessment) of high availability approaches is far too costly to ever offset the potential damage of the outage from which the mitigation is hoped to protect.  There are exceptions, of course, and there are many businesses for which high availability solutions are absolutely sensible, but these are the exception and very far from being the norm.

It is also sensible to think of the needs for high availability to be based on a workload basis and not department, company or technology wide.  In a small business it is common for all workloads to share a common platform and the need of a single workload for high availability may sweep other, less critical, workloads along with it.  This is perfectly fine and a great way to offset the cost of the risk mitigation of the critical workload through ancillary benefit to the less critical workloads.  In a larger organization where there is a plethora of platform approaches used for differing workloads it is common for only certain workloads that are both highly critical (in terms of risk from downtime impact) and that are practically mitigated of risk (the ability to mitigate risk can vary dramatically between different types of workloads) to have high availability applied to them and other workloads to be left to standard techniques.

Examples of workloads that may be critical and can be effectively addressed with high availability might be an online ordering system where the latency created by multi-regional replication has little impact on the overall system but losing orders could be very financially impactful should a system fail.  An example of a workload where high availability might be easy to implement but ineffectual would be an internal intranet site serving commonly asked HR questions; it would simply not be cost effective to avoid small amounts of occasional downtime for a system like this.  An example of a system where risk is high but the cost or effectiveness of risk mitigation makes it impractical or even impossible might be a financial “tick” database requiring massive amounts of low latency data to be ingested and the ability to maintain a replica would not only be incredibly costly but could introduce latency that would undermine the ability of the system to perform adequately.  Every business and workload is unique and should be evaluated carefully.

Of course high availability techniques can be actioned in stages; it is not an all or nothing endeavor.  It might be practical to mitigate the risk of system level failure by having application layer fault tolerance to protect against failure of system hardware, virtualization platform or storage.  But for the same workload it might not be valuable to protect against the loss of a single site.  If a workload only services a particular site or is simply not valuable enough for the large investment needed to make it fail between sites it could easily fall “in the middle.”  It is very common for workloads to only implement partially high availability solutions, often because an IT department may only be responsible for a portion of them and have no say over things like power support and HVAC, but probably most common because some high availability techniques are seen as high visibility and easy to sell to management while others, such as high quality power and air conditioning, often are not even though they may easily provide a better bang for the buck.  There are good reasons why certain techniques may be chosen and not others as they affect different risk components and some risks may have a differing impact on an individual business or workload.

High availability requires careful thought as to whether it is worth considering and even more careful thought as to implementation.  Building true HA systems requires a significant amount of effort and expertise and generally substantial cost.  Understanding which components of HA are valuable and which are not requires not just extensive technical expertise but financial and managerial skills as well.  Departments must work together extensively to truly understand how HA will impact an organization and when it will be worth the investment.  It is critical that it be remembered that the need for high availability in an organization or for a workload is anything but a foregone conclusion and it should not be surprising in the least to find that extensive high availability or even casual high availability practices turn out to be economically impractical.

In many ways this is because standard availability has reached such a state that there is continuously less and less risk to mitigate.  Technology components used in a business infrastructure, most notably servers, networking gear and storage, have become so reliable that the amount of downtime that we must protect against is quite low.  Most of the belief in the need for knee jerk high availability comes from a different era when reliable hardware was unaffordable and even the most expensive equipment was rather unreliable by modern standards.  This feeling of impending doom that any device might fail at any moment comes from an older era, not the current one.  Modern equipment, while obviously still carrying risks, is amazingly reliable.

In addition to other risks, over-investing in high availability solutions carries financial and business risks that can be substantial.  It increases technical debt in the face of business uncertainty.  What if the business suddenly grows, or worse, what if it suddenly contracts, changes direction, gets purchased or goes out of business completely?  The investment in the high availability is already spent even if the need for its protection disappears.  What if technology or location change?  Some or all of a high availability investment might be lost before it would have been at its expected end of life.

As IT practitioners, evaluating the benefits, risks and costs of technology solutions is at the core of what we do.  Like everything else in business infrastructure, determining the type of risk mitigation, the value of protection and how much is financially proper is our key responsibility and cannot be glossed over or ignored.  We can never simply assume that high availability is needed, nor that it can simply be skipped.  It is in analysis of this nature that IT brings some of its greatest value to organizations.  It is here that we have the potential to shine the most.

 

 

IT’s Most Needed Skills

IT does not exist in a bubble.  IT is a business enabler, a way of taking an existing business and making it more efficient, cost effective, nimble and capable.  Except for home hobbyist, and even there this isn’t quite true – IT is subject to the business that it supports.  It has a goal, an objective and a budget.  The business provides for context in which IT exists.

I speak with a wide array of IT professionals every day.  I work with both the enterprise and small business markets and I work with a review board overseeing IT and software development programs at a university.  On that board we were asked, “What is the single, most critical skill lacking in college graduates seeking jobs in IT today?”

The answer to that question was overwhelmingly “the ability to write and communicate effectively.”  No, it was not a technology skill.  It was not even a skill taught by the computer or technology department.  What we needed was for the English department to push these students harder and for the university to require more and harder English classes for non-majors and to demand that those skills be applied to classes taken in all disciplines and not relegate those skills purely for use in English-focused classes.

The ability to communicate effectively is critical in any profession but IT especially is a field where we need to be able to communicate large amounts of data, both technical and esoteric, rapidly, accurately and with extreme precision.  Most fields don’t penalize you to the same degree as IT for not knowing the correct use of white space or capitalization, spelling or context.  IT demands a level of attention to detail rare even in professional fields.

As a prime example I have seen the misuse of “Xen server” to mean “XenServer” no less than twenty times in an attempt to get technical assistance – which inevitably lead to useless advice since these are the proper names of two different products with unique configurations, vendors and troubleshooting procedures.  How many lost hours of productivity for each of those companies have happened just because someone cannot properly identify or communicate the software product with which they are seeking assistance?  Worse set, I’ve see this same product referred to as ZenServer or ZEN server – both of which are the correct names for other software products.  Yes, four different products that are all homonyms that require proper spelling, spacing and capitalization to reliably differentiate one from another.  The worse scenario is when someone writes “Xenserver” or “Xen Server”, neither being the exact name of any product, where the ambiguity means that there are at least two products equally far from matching what is given.  The person speaking often feels that the needs for precision is “annoying” but fails to understand why the advice that they receive doesn’t seem to apply to their situation.

I’ve seen confusion come from many written inaccuracies – mistaking versions of Windows or confusing “VMWare Server” for “VMWare ESXi” because someone refers to both or either simply as the name of the vendor and not of the product forgetting that that one vendor makes at least five or more different virtualization products.   These are basic written language skills necessary to successful work in IT.  Not only does lacking this skill create technical challenges in communicating to peers but it also implies an inability to document or search for information reliably – some of the most common and critical IT skills.  This, of course, also means that an IT professional in this position may be responsible for purchasing the wrong product from the wrong vendor simply because they did not take the time to be accurate in repeating product or vendor names or may cause system damage by following inappropriate advice or documentation.

Good communications skills go far beyond technical documentation and peer interactions – being able to communicate to the business or other support groups within the organization, to vendors or to customers is extremely important as well.  IT, more than nearly any other field, acquires, processes and disseminates information.  If an IT professional is unable to do so accurately their value diminishes rapidly.

The IT professional seeking to advance their career beyond pure technical pursuits needs the ability to interact with other departments most notably operations and business management in most cases.  These are areas within most companies where written word, as well as presentation, is highly valued and the IT team member able to present recommendations to management will have better visibility within the organization.  Technology departments need people with these skills in order to successfully present their needs to the business.  Without this skill within the ranks, IT departments often fail to push critical projects, secure funding or obtain necessary visibility to work effectively within the organization.

The second big skill needed in IT departments today is an understanding of business – both business in general and the business referring to the specific business of their own organization.  As I said at the beginning of this article, IT is a business enabler.  If IT professionals do not understand how IT relates to their business they will be poorly positioned to valuate IT needs and make recommendations in the context of the business.  Everything that IT does it does for the business, not for technology and not for its own purposes.

Within the IT ranks it is easy to become excited about new products and approaches – we love these things and likely this was a factor in our wanting to work in IT.  But finding the latest software release to be exciting or the latest, fastest hardware  to be “neat” are not sentiments that will cut muster with a business professional who needs to understand the ramifications of a technology investment.  IT professionals wishing to move beyond being purely technology implementers into being technology recommendors and advisers need to be able to speak fluently to business people in their own language and to frame IT decisions within the context of the business and its needs.