Category Archives: Storage

When a Backup Is Not A Backup

Conceptually the idea of “backup” has become a murky area within IT.  Everyone seems to have their own concepts of what a backup is and how they expect it to behave.  This can be dangerous when the person supplying backup and the person consuming backup have a mismatch in expectations.  I see this happen every day even with traditional backup mechanisms.  With new types of backups appearing on a regular basis the opportunities for miscommunications and loss of data become much more pronounced.

By traditional backups I refer to the traditional world of tape-based backups with a grandfather – father – son rotational strategy in place, just to set the stage for the discussion.  New backups might include system images, disk-based backups, continuous backups and backups to “the cloud” or online backups.  The world of backups is evolving rapidly and now is when misunderstandings begin to put corporate data resources at risk.

So what exactly is a “backup”?  The concept sounds simple, but what do we really mean when we use the term?  Do we mean the ability to restore a system after it has failed?  The ability to roll back to an earlier version of a file?  Perhaps archiving of data when the original no longer exists?  How long do which files get kept?  Does this apply only to file data or are emails and databases included too?  Do we only need to restore in case of system failure or do we need the ability to restore granular data as well?  Do we need only one copy or do we need copies of every version of a file?

Now, with the additional risks posed by things like ransonware, we have even more concerns than ever before and ideas around not just versioning but potentially unlimited versioning and air gapping between systems and backups has become of a concern where before, it generally was not.

Many organizations, especially smaller ones, often choose to approach backups a bit differently from enterprises and often eschew backups completely.  Instead they “take backups” but then often delete the original files.  And instead of keeping many copies of the files that have been “backed up” they opt to keep only a single copy (or multiple versions that are co-dependent on each other) .  This means that what they have is not really a backup, but rather an archive.  If the one disk or tape on which the file is stored becomes damaged, the file is lost completely.

The term backup implies that there are at least two copies of some piece of data that do not rely on each other.  An archive does not imply this and just implies that we have taken data from production to another system, presumably one that is lower cost and likely much lower and harder to retrieve.  Archived data implies no redundancy, unlike the term backup.

If we “take a backup” and then proceed to delete the original data we no longer have a backup and the file that is stored in the “backup system”, whether this is on disk, a tape in a vault or whatever, turns into an archive of the original data rather than a backup of it.  It is now our source file, rather than being a copy.  This is some of the magic of digital media, copies are a clone rather than a mimic so the archival component is legitimately the original in every sense.

This may seem pedantic but it truly is not.  If a business is paying for backups, they likely assume that that cost is going towards having some redundancy, not just a single copy of data.  And if you have regulations around being required to keep backups for compliance reasons, only having an archival copy is a clear violation of that requirement.  Having two systems fail and being unable to retrieve data is an edge case that all compliance must accept.  But having an archival system fail where a backup is required but was not kept, is not an acceptable scenario.

For this reason, and many more, concepts like the 3-2-1 backup methodology make sense because this approach guarantees that backups are kept within the backup system and originals do not need to be kept on production.  In some ways of thinking, this approach could be thought of as merging archiving and backups into a single system which adds much clarity to the design.

Whatever backup system works for you, be cognizant that backups mean independent copies and that in many ways, independent copies that do not share failure domains has become nearly a requirement for all backups today.

The Software RAID Inflection Point

In June, 2001 something amazing happened in the IT world: Intel released the Tualatin based Pentium IIIS 1.0 GHz processor. This was one of the first few Intel processors (IA32 architecture) to have crossed the 1 GHz clock barrier and the first of any significance. It was also special in that it had dual processor support and a double sized cache compared to its Coppermine based forerunners or it’s non-“S” Tualatin successor (that followed just one month behind.) The PIIIS system boards were insanely popular in their era and formed the backbone of high performance commodity servers, such as Proliant and PowerEdge, in 2001 and for the next few years culminating in the Pentium IIIS 1.4GHz dual processor systems that were so important that they resulted in kicking off the now famous HP Proliant “G” naming convention. The Pentium III boxes were “G1”.

What does any of this have to do with RAID? Well, we need to step back and look at where RAID was up until May, 2001. From the 1990s and up to May, 2001 hardware RAID was the standard for the IA32 server world which mainly included systems like Novell Netware, Windows NT 4, Windows 2000 and some Linux. Software RAID did exist for some of these systems (not Netware) but servers were always struggling for CPU and memory resources and expending these precious resources on RAID functions was costly and would cause applications to compete with RAID for access and the systems would often choke on the conflict. Hardware RAID solved this by adding dedicated CPU and RAM just for these functions.

RAID in the late 1990s and early 2000s was also very highly based around RAID 5 and to a lesser degree, RAID 6, parity striping because disks were tiny and extremely expensive for capacity and squeezing maximum capacity out of the available disks was of utmost priority and risks like URE were so trivial due to the small capacity sizes that parity RAID was very reliable, all things considered. The factors were completely different than they would be by 2009. In 2001, it was still common to see 2.1GB, 4.3GB and 9GB hard drives in enterprise servers!

Because parity RAID was the order of the day, and many drives were typically used on each server, RAID had more CPU overhead on average in 2000 than it did in 2010! So the impact of RAID on system resources was very significant.

And that is the background. But in June, 2001 suddenly the people who had been buying very low powered IA32 systems had access to the Tualatin Pentium IIIS processors with greatly improved clock speeds, efficient dual processor support and double sized on chip caches that presented an astounding leap in system performance literally over night. With all this new power and no corresponding change in software demands systems that traditionally were starved for CPU and RAM suddenly had more than they knew how to use, especially as additional threads were available and most applications of the time were single threaded.

The system CPUs, even in the Pentium III era, were dramatically more powerful than the small CPUs, which were often entry level PowerPC or MIPS chips, on the hardware RAID controllers and the available system memory was often much larger than the hardware RAM caches and investing in extra system memory was often far more effective and generally advantages so with the availability of free capacity on the main system RAID functions could, on average be moved from the hardware RAID cards to the central system and gain performance, even while giving up the additional CPU and RAM of the hardware RAID cards. This was not true on overloaded systems, those starved for resources and was more relevant for parity RAID systems with RAID 6 benefiting the most and non-parity systems like RAID 1 and 0 benefiting the least.

But June, 2001 was the famous inflection point – before that date the average IA32 system was faster when using hardware RAID. And after June, 2001 new systems purchased would on average be faster with software RAID. With each passing year the advantages have leaned more and more towards software RAID with the abundance of underutilized core CPUs and idle threads and spare RAM exploding with the only advantage towards hardware RAID being the drop in parity RAID usage as mirrored RAID took over as the standard as disk sizes increased dramatically while capacity costs dropped.

Today is has been more than fifteen years since the notion that hardware RAID would be faster has been retired. The belief lingers on due primarily to the odd “Class of 1998” effect. But this has long been a myth repeated improperly by those that did not take the time to understand the original source material. Hardware RAID continues to have benefits, but performance has not been one of them for the majority of the time that we’ve had RAID and is not expected to ever rise again.

New Hyperconvergence, Old Storage

We all dream of the day that we get to build a new infrastructure from the ground up without any existing technical debt to hold us back.  A greenfield deployment where we pick what is best, roll it out fresh and enjoy.  But most of us live in the real world where that is not very realistic and what we actually face is a world where we have to plan for the future but work with what we already have, as well.

Making do with what we have is nearly an inevitable fact of life in IT and when approaching storage when moving from an existing architecture to hyperconvergence things will be no different.  In a great many cases we will be facing a situation where an existing investment in storage will be in place that we do not want to simply discard but does not necessarily fit neatly into our vision of a hyperconverged future.

There are obvious options to consider, of course, such as returning leased gear, retiring older equipment or selling still useful equipment outright.  These are viable options and should be considered.  Eliminating old gear or equipment that does not fit well into the current plans can be beneficial as we can simplify our networks, reduce power consumption and possible even recoup some degree of our investments.

In reality, however, these options are rarely financially viable and we need to make more productive use of our existing technology investments.  What options are available to us, of course, depend on a range of factors.  But we will look at some examples of how common storage devices can be re-purposed in a new hyperconverged-based system in order to maintain their utility either until they are ready to retire or even, in some cases, indefinitely.

The easiest re-purposing of existing storage, and this applies equally to both NAS and SAN in most cases, is to designate them as backup or archival targets.  Traditional NAS and SAN devices are excellent backup hardware and are generally usable by nearly any backup mechanism, regardless of approach or vendor.  And because they are generic backup targets if a mixture of backup mechanisms are used, such as agent based, agentless and custom scripts, these can all work to the same target.  Backups so rarely get the attention and investment that they deserve that this is not just the easiest but often the most valuable use of pre-existing storage infrastructure.

Of course anything that is appropriate for backups can also be used for archival storage.  Archival needs are generally less needed (only a percentage of firms need archival storage while all need backups) and are of lower priority, so this is more of an edge reuse case, but still one to consider, especially for organizations that may be working to re-purpose a large number of possibly disparate storage devices.  However it is worth noting that moving to hyperconvergence does tend to “flatten” the compute and storage space in a way that may easily introduce a value to lower performance, lower priority archival storage that may not have existed or existed so obviously prior to the rearchitecting of the environment.

NAS has the unique advantageous use cases of being usable as general purpose network storage, especially for home directories of end users.  NAS storage can be used in so many places on the network, it is very easy to continue to use, after moving core architectures.  The most popular case is for users’ own storage needs with the NAS connected directly to end user devices which allows for storage capacity, performance and network traffic to be offloaded from the converged infrastructure to the NAS.    It would actually be very rare to remove a NAS from a hyperconverged network as its potential utility is so high and apparent.

Both SAN and NAS have the potential to be attached directly to the virtual machines running on top of a hyperconverged infrastructure as well.  In this way they can continue to be utilized in a traditional manner until such time as they are no longer needed or appropriate.  While not often the recommended approach, attaching network storage to a VM directly, there are use cases for this and it allows systems to behave as they always have in a physical realm into the future.  This is especially useful for mapped drives and user directories via a NAS, much as we mentioned for end user devices, but the cases are certainly not limited to this.

A SAN can provide some much needed functionality in some cases for certain workloads that require shared block storage that otherwise is not available or exposed on a platform.  Workloads on a VM will use the SAN as they always have and not even be aware that they are virtualized or converged.  Of course we can also attach a SAN to a virtualized file server or NAS head running on our hyperconverged infrastructure if the tiering for that kind of workload is deemed appropriate as well.

Working with existing infrastructure when implementing a new one does present a challenge, but one that we can tackle with creativity and logical approach.  Storage is a nearly endless challenge and having existing storage to re-purpose may easily end up being exceptionally advantageous.

SMBs Must Stop Looking to BackBlaze for Guidance

I have to preface this article, because people often take these things out of context and react strongly to things that were never said, with the disclaimer that I think that BackBlaze does a great job, has brilliant people working for them and has done an awesome job of designing and leveraging technology that is absolutely applicable and appropriate for their needs.  Nothing, and I truly mean nothing, in this article is ever to be taken out of context and stated as a negative about BackBlaze.  If anything in this article appears or feels to state otherwise, please reread and confirm that such was said and, if so, inform me so that I may correct it.  There is no intention of this article to imply, in any way whatsoever, that BackBlaze is not doing what is smart for them, their business and their customers.  Now on to the article:

I have found over the past many years that many small and medium business IT professionals have become enamored by what they see as a miracle of low cost, high capacity storage in what is know as the BackBlaze POD design.  Essentially the BackBlaze POD is a low cost, high capacity, low performance nearly whitebox storage server built from a custom chassis and consumer parts to make a disposable storage node used in large storage RAIN arrays leveraging erasure encoding.  BackBlaze custom designed the original POD, and released its design to the public, for exclusive use in their large scale hosted backup datacenters where the PODs functions as individual nodes within a massive array of nodes with replication between them.  Over the years, BackBlaze has updated its POD design as technology has changed and issues have been addressed.  But the fundamental use case has remained the same.

I have to compare this to the SAM-SD approach to storage which follows a similar tact but does so using enterprise grade, supported hardware.  These differences sometimes come off as trivial, but they are anything but trivial, they are key underpinnings to what makes the different solutions appropriate in different situations.  The idea behind the SAM-SD is that storage needs to be reliable and designed from the hardware up to be as reliable as possible and well supported for when things fail.  The POD takes the opposite approach making the individual server unreliable and ephemeral in nature and designed to be disposed of rather than repaired at all.  The SAM-SD design assumes that the individual server is important, even critical – anything but disposable.

The SAM-SD concept, which is literally nothing more than an approach to building open storage, is designed with the SMB storage market in mind.  The BackBlaze POD is designed with an extremely niche, large scale, special case consumer backup market in mind.  The SAM-SD is meant to be run by small businesses, even those without internal IT.  The POD is designed to be deployed and managed by a full time, dedicated storage engineering team.

Because the BackBlaze POD is designed by experts, for experts in the very largest of storage situations it can be confusing and easily misunderstood by non-storage experts in the small business space.  In fact, it is so often misunderstood that objections to it are often met with “I think BackBlaze knows what they are doing” comments, which demonstrates the extreme misunderstanding that exists with the approach.  Of course BackBlaze knows what they are doing, but they are not doing what any SMB is doing.

The release of the POD design to the public causes much confusion because it is only one part of a greater whole.  The design of the full data center and the means of redundancy and mechanisms for such between the PODs is not public, but is proprietary.  So the POD itself represents only a single node of a cluster (or Vault) and does not reflect the clustering itself, which is where the most important work takes place.  In fact the POD design itself is nothing more than the work done by the Sun Thumper and SAM-SD projects of the past decade without the constraints of reliability.  The POD should not be a novel design, but an obvious one.  One that has for decades been avoided in the SMB storage space because it is so dramatically non-applicable.

Because the clustering and replication aspects are ignored when talking about the BackBlaze POD some huge assumptions tend to be made about the capacity of a POD that has much lower overhead than BackBlaze themselves get for the POD infrastructure, even at scale.  For example, in RAID terms, this would be similar to assuming that the POD is RAID 6 (with only 5% overhead) because that is the RAID of an individual component when, in fact, RAID 61 ( 55% overhead) is used!  In fact, many SMB IT Professionals when looking to use a POD design actually consider simply using RAID 6 in addition to only using a single POD.  The degree to which this does not follow BackBlaze’s model is staggering.

BackBlaze: “Backblaze Vaults combine twenty physical Storage Pods into one virtual chassis. The Vault software implements our own Reed-Solomon encoding to spread data shards across all twenty pods in the Vault simultaneously, dramatically improving durability.

To make the POD a consideration for the SMB market it is required that the entire concept of the POD be taken completely out of context.  Both its intended use case and its implementation.  What makes BackBlaze special is totally removed and only the most trivial, cursory aspects are taken and turned into something that in no way resembles the BackBlaze vision or purpose.

Digging into where the BackBlaze POD is differing in design from the standard needs of a normal business we find these problems:

  • The POD is designed to be unreliable, to rely upon a reliability and replication layer at the super-POD level that requires a large number of PODs to be deployed and for data to be redundant between them by means of custom replication or clustering.  Without this layer, the POD is completely out of context.  The super-POD level is known internally as the BackBlaze Vault.
  • The POD is designed to be deployed in an enterprise datacenter with careful vibration dampening, power conditioning and environmental systems.  It is less resilient to these issues as standard enterprise hardware.
  • The POD is designed to typically be replaced as a complete unit rather than repairing a node in situ.  This is the opposite of standard enterprise hardware with hot swap components designed to be repaired without interruption, let alone without full replacement.  We call this a disposable or ephemeral use case.
  • The POD is designed to be incredibly low cost for very slow, cold storage needs.  While this can exist in an SMB, typically it does not.
  • The POD is designed to be a single, high capacity storage node in a pool of insanely large capacity.  Few SMBs can leverage even the storage potential of a single POD let alone a pool large enough to justify the POD design.
  • The BackBlaze POD is designed to use custom erasure encoding, not RAID.  RAID is not effective at this scale even at the single POD level.
  • An individual POD is designed for 180TB of capacity and a Vault scale of 3.6PB.

Current reference of the BackBlaze POD 5:

In short, the BackBlaze POD is a brilliant underpinning to a brilliant service that meets a very specific need that is as far removed from the needs of the SMB storage market as one can reasonably be.  Respect BackBlaze for their great work, but do not try to emulate it.