Hot Spare or a Hot Mess

A common approach to adding a layer of safety to RAID is to have spare drive(s) available so that replacement time for a failed drive is minimized.  The most extreme form of this is referred to as having a “hot spare” – a spare drive actually sitting in the array but unused until the array detects a drive failure at which time the system automatically disables the failed drive and enables the hot spare, the same as if a human had just popped the one drive out of the array and popped in the other allowing a resilver operation (a rebuilding of the array) to begin as soon as possible.  This can bring the time to swap in a new drive from hours or days to seconds and, in theory, can provide an extreme increase in safety.

First, I’d like to address what I personally feel is a mistake in the naming conventions. What we refer to as a hot spare should, I believe, actually be called a warm spare because it is sitting there ready to go but does not contain the necessary data to be used immediately.  A spare drive stored outside of the chassis, one that requires a human to step in and swap the drives manually, would be a cold spare.  To truly be a hot spare a drive should be full of data and, therefore, would be a participatory member of the RAID array in some capacity.  Red Hat has a good article on how this terminology applies to disaster recovery sites for reference.  This differentiation is important because what we call a hot spare does not already contain data and does not immediately step in to replace the failed drive but instead steps in to immediately begin the process of restoring the lost drive – a critical differentiation.

In order to keep concepts clear, from here on out I will refer to what vendors call hot spares as “warm spares.”  This will make sense in short order.

There are two main concerns with warm spares.  The first is the ineffectual nature of the warm spare in most use cases and the second is the “automated array destruction” risk.

Most people approach the warm spare concept as a means of mitigating the high risk of secondary drive failure on a parity RAID 5 array.  RAID 5 arrays protect only against the failure of a single disk within the array.  Once a single disk has failed the array is left with no form of parity and any additional drive failure results in the total loss of the array.  RAID 5 is chosen because it is very low cost for the given capacity and sacrifices reliability in order to achieve this cost effectiveness.   Because RAID 5 is therefore risky in comparison to other RAID options, such as RAID 6 or RAID 10, it is common to implement a warm spare in order to minimize the time that the array is left in a degraded state allowing the array to begin resilvering itself as quickly as possible.

So the takeaway here that is more relevant is that warm spares are generally used as a buffer against using less reliable RAID array types as a cost saving measure.  Warm spares are dramatically more common in RAID 5 arrays followed by RAID 6 arrays.  Both of which are chosen over RAID 10 due to cost for capacity, not for reliability or performance.  There is one case where the warm spare idea truly does make sense for added reliability, and that is in RAID 10 with a warm spare, but we will come to that.  Outside of that scenario I feel that warm spares make little sense in the real world.

We will start by examining RAID 1 with a warm spare.  RAID 1 consists of two drives, or more, in a mirror.  Adding a warm spare is nice in that if one of the mirrored pairs dies the warm spare will immediately begin mirroring the remaining drive and you will be protected again in short order.  That is wonderful.  Except for one minor flaw, instead of using a warm spare that same drive could have been added to the RAID 1 array all along where it would have been a tertiary mirror.  In this tertiary mirror capacity the drive would have added to the overall performance of the array giving a nearly fifty percent read performance boost with write performance staying level and providing instant protection in case of a drive failure rather than “as soon as it remirrors” protection.  Basically it would have been a true “hot spare” rather than a warm spare.  So without spending a penny more the system would have had better drive array performance and better reliability simply by having the extra drive in a hot “in the array” capacity rather than sitting warm and idle waiting for disaster to strike.

With RAID 5 we see an even more dramatic warning against the warm spare concept, here where it is more common than anywhere else.  RAID 5 is single parity RAID with the ability to rebuild, using the parity, any drive in the array that fails.  This is where the real problems begin.  Unlike in RAID 1 where a remirroring operation might be quite quick, a RAID 5 resilver (rebuild) has the potential to take quite a long time.  The warm spare will not assist in protecting the array until this resilver process completes successfully – this is commonly many hours and is easily days and possibly weeks or months depending on the size of the array and how busy the array is.  If we took that same warm spare drive and instead tasked it with being a member of the array with an additional parity stripe we would achieve RAID 6.  The same set of drives that we have for RAID 5 plus warm spare would create a RAID 6 array of the exact same capacity.  Again, like the RAID 1 example above, this would be much like having a hot spare, where the drive is participating in the array with live data rather than sitting idly by waiting for another drive to fail before kicking in to begin the process of taking over.  In this capacity the array degrades to a RAID 5 equivalent in case of a failure but without any rebuild time, so the additional drive is useful immediately rather than only after a possible very lengthy resilver process.  So for the same money, same capacity the choice of setting up the drives in RAID 6 rather than RAID 5 plus warm spare is a complete win.

We can continue this example with RAID 6 plus warm spare.  This one is a little less easy to define because in most RAID systems, except for the somewhat uncommon RAIDZ3 from ZFS, there is no triple parity system available one step above RAID 6 (imagine if there was a RAID 7, for example.)  If there were the exact argument made for RAID 5 plus warm spare would apply to RAID 6 plus warm spare.  In the majority of cases RAID 6 with a warm spare must justify itself against a RAID 10 array.  RAID 10 is more performant and far more reliable than a RAID 6 array but RAID 6 is generally chosen to save money in comparison to RAID 10.  But to offset RAID 6′s fragility warm spares are sometimes employed.  In some cases, such as a small five disk RAID 6 array with a warm spare, this is dollar for dollar equivalent to a six disk RAID 10 array without a warm spare.  In larger arrays the cost benefit of RAID 6 does become apparent but the larger the cost savings the larger the risk differential as parity RAID systems increase risk with array size much more quickly than do mirror based RAID systems like RAID 10.  Any money saved today is done at the risk of outage or data loss tomorrow.

Where a warm spare comes into play effectively is in a RAID 10 array where a warm spare rebuild is a mirror rebuild, like in RAID 1, which does not carry parity risks, where there is no logical extension RAID system above RAID 10 from which we are trying to save money by going with a more fragile system.  Here adding a warm spare may make sense for critical arrays because there is no more cost effective way to gain the same additional reliability.  However, RAID 10 is so reliable without a warm spare that any shop contemplating RAID 5 or RAID 6 with a warm spare would logically stop at simple RAID 10 having already surpassed the reliability they were considering settling for previously.  So only shops not considering those more fragile systems and looking for the most robust possible option would logically look to RAID 10 plus warm spare as their solution.

Just for technical accuracy, RAID 10 can be expanded for better read performance and dramatic improvement in reliability (but with a fifty percent cost increase) by moving to three disk RAID 1 mirrors in its RAID 0 stripe rather than standard two disk RAID 1 mirrors just like we showed in our RAID 1 example.  This is a level of reliability seldom sought in the real world but can exist and is an option.  Normally this is curtailed by drive count limitations in physical array chassis as well as competing poorly against building a completely separate secondary RAID 10 array in a different chassis and then mirroring these at a high level effectively created RAID 101 – which is the effective result of common, high end storage array clusters today.

Our second concern is that of “automated array destruction.”  This applies only to the parity RAID scenarios of RAID 5 and RAID 6 (or the rare RAID 2, RAID 3, RAID 4 and RAIDZ3.)  With the warm spare concept, the idea is that when a drive fails the warm spare is automatically and instantly swapped in by the array controller and the process of resilvering the array begins immediately.  If resilvering was a completely reliable process this would be obviouslyd highly welcomed.  The reality is, sadly, quite different.

During a resilver process a parity RAID array is at risk of Unrecoverable Read Errors (UREs) cropping up.  If a URE occurs in a single parity RAID resilver (that is RAID 2 –  5) then the resilvering process fails and the array is lost completely.  This is critical to understand because no additional drive has failed.  So if the warm spare had not been present then the resilvering would have not commenced and the data would still be intact and available – just not as quickly as usual and at the small risk of secondary drive failure.  URE rates are very high with today’s large drives and with large arrays the risks can become so high as to move from “possible” to “expected” during a standard resilvering operation.

So in many cases the warm spare itself might actually be the trigger for the loss of data rather than the savior of the data as expected.  An array that would have survived might be destroyed by the resilvering process before the human who manages it is even alerted to the first drive having failed.  Had a human been involved they could have, at the very least, taken the step to make a fresh backup of the array before kicking off the resilver knowing that the latest copy of the data would be available in case the resilver process was unsuccessful.  It would also allow the human to schedule when the resilver should begin, possibly waiting until business hours are over or the weekend has begun when the array is less likely to experience heavy load.

Dual and triple parity RAID (RAID 6 and RAIDZ3 respectively) share URE risks as well as they too are based on parity.  They mitigate this risk through the additional levels of parity and do so successfully for the most part.  The risk still exists, especially in very large RAID 6 arrays, but for the next several years the risks remain generally quite low for the majority of storage arrays until far larger spindle-based storage media is available on the market.

The biggest problem with parity RAID and the URE risk is that the driver towards parity RAID (willing to face additional data integrity risks in order to lower cost) is the same driver that introduces heightened URE risk (purchasing lower cost, non-enterprise SATA hard drives.)  Shops facing parity RAID generally do so with large, low cost SATA drives bringing two very dangerous factors together for an explosive combination.  Using non-parity RAID 1 or RAID 10 will completely eliminate the issue and using highly reliable enterprise SAS drives will drastically reduce the risk factor by an order of magnitude (not an expression, it is actually a change of one order of magnitude.)

Additionally during resilver operations it is possible for performance to degrade on parity systems so drastically as to equate to a long-term outage.  The resilver process, especially on large arrays, can be so intensive that end users cannot differentiate between a completely failed array and a resilvering array.  In fact, resilvering at its extreme can take so long and be so disruptive that the cost to the business can be higher than if the array had simply failed completely and a restore from backup had been done instead.  This resilver issue does not affect RAID 1 and RAID 10, again, because they are mirrored, not parity, RAID systems and their resilver process is trivial and the performance degradation of the system is minimal and short lived.  At its most extreme, a parity resilver could take weeks or months during which time the systems act as though they are offline – and at any point during this process there is the potential for the URE errors to arise as mentioned above which would end the resilver and force the restore from backup anyway.  (Typical resilvers do not take weeks but do take many hours and to take days is not at all uncommon.)

Our final overview can be broken down to the following (conventional term “hot spare” used again): RAID 10 without a “hot spare” is almost always a better choice than RAID 6 with a “hot spare.”  RAID 6 without a “hot spare” is always better than RAID 5 with a “hot spare.”  RAID 1 with additional mirror member is always better than RAID 1 with a “hot spare.”  So whatever RAID level with a hot spare you decide upon, simply move up one level of RAID reliability and drop the “hot spare” to maximize both performance and reliability for equal or nearly equal cost.

Warm spares, like parity RAID, had they day in the sun.  In fact it was when parity RAID still made sense for widespread use – when URE errors were unlikely and disk costs were high – that warm spare drives made sense as well.  They were well paired, when one made sense the other often did too.  What is often overlooked is that as parity RAID, especially RAID 5, has lost effectiveness it has pulled the warm spare along with it in unexpected ways.

Share

23 thoughts on “Hot Spare or a Hot Mess

  1. I’m a bit confused on the URE issue. I see it brought up a lot and the impression is always given that a single URE instantly equals the loss of access to all data on the entire array.

    Is this actually the case?

    Does the controller encounter a URE and just disable access to the array completely, right away? It seems like the controller should be able to just fail to read or write that block of data but continue to give access to areas not affected by the URE (similar to a failed sector on a standalone HDD).

    Does this vary from controller to controller?

    I haven’t actually seen any real-world reports of what happened during/after an URE event.

    Also, what are your thoughts on hot/warm spares in situations where a single spare can be allocated to serve multiple arrays?

    Thanks for the post.

  2. UREs don’t cause healthy arrays to fail because either parity or mirroring contains the same bit elsewhere and can reconstruct the data without issue. It is only in a degraded parity array (where there is no longer any parity) that the URE causes a full array failure and this is because during a resilver operation the array is unstable and when a URE is encountered the parity is unable to reconstruct the stripe and the resilver fails causing the array itself to fail.

    UREs, in the real world during a parity resilver, really do cause a complete loss of the array. It’s catastrophic level failure.

    Mirrored RAID (RAID 1 or RAID 10, for example) does not do a computation to reconstruct a stripe and so a URE does not cause a resilver to fail. It is specifically parity reconstruction + URE that is the danger.

  3. Warm spares shared with multiple arrays share the same problems except that they are more cost effective. If they automate array destruction, they are bad. But if shared between several RAID 10 arrays, for example, they are excellent.

  4. All – good and I know this topic wasn’t set to address performance – but from my understanding – RAID6 takes a 50% write performance vs. RAID5 due to writing the parity twice. So if performance is a concern as well as budget – then wouldn’t RAID5 with a cold spare be a good option knowing that the URE for enterprise SAS drives is significantly higher than SATA drives?

  5. RAID 5 has a 4x write penalty and RAID 6 has a 6x penalty (so 50% greater, yes.) But neither RAID 5 or RAID 6 is chosen for performance. RAID 0, 1 and 10 are the performance choices (depending on needs.) Basically any scenario where a new RAID 5/6 array is being put in can be made as fast while costing less and/or being more reliable, depending on need, using a non-parity option.

    In read-only systems there would be exceptions to this, but in a situation where writes are trivial the performance advantage would heavily fall to RAID 6 over RAID 5 + Hot Spare due to the extra spindle.

  6. Mr. Miller
    You have wrote a very good article. I am not a IT guy. But it allow me to understand, for the first time, the real cost / benefit of the different raid scenario. When come time to consult the IT consultant, we are in better position to make the proper decision.

  7. How can you say “RAID 10 is ….far more reliable than a RAID 6 array”? In RAID10 the same data are stored only twice. So two disk failure could cause data loss if “right” pair of disks fails. Is there something I am missing?

  8. @Michal Lei – the number of times that the data is stored isn’t what creates reliability. It is a factor, to be sure, but it is far from the end all of reliability. If drives failed equally and there were no other reliability factors, then yes, RAID 6 would beat RAID 10.

    But RAID 10 has several advantages over RAID 6. It doesn’t suffer URE resilver failure which makes RAID 6′s extra redundancy potentially useless (or at best, less useful than assumed.) RAID 10 rebuilds in a fraction of the time that RAID 6 does (reducing the window for further drive failure.) RAID 10 is less likely, due to wear leveling, to experience the first drive failure. It is less likely as well to experience a second one. And unlike parity RAID, RAID 10 does not exhibit the “drive killing” behaviour that parity RAID does before resilvering is complete.

    And the big one, RAID 10, after having lost 1 drive is only vulnerable to losing a single additional drive – the matched pair of the one that has failed. RAID 6 having lost a single drive is at risk of any additional drive failing anywhere in the array at which point it will continue to function but is completely exposed to UREs so may fail to resilver even if no further drives fail.

    Taken together, RAID 10, especially as arrays grow in size, keeps reliability pace far in front of RAID 6. The bigger the array, the more reliable RAID 10 becomes in relation to RAID 6. The failure domain size never grows whereas with RAID 6 it just gets bigger and bigger.

    Parity RAID “redundancy level” is extremely misleading because it is not mirrored redundancy but is often enumerated as if it were. RAID 6 does not have three copies of each block, not even two copies. But we treat it as having triple redundancy, but that would be a three way mirror. Parity Redundancy is not the same ‘class’ of redundancy as mirroring and has to be looked at in a different light. The chance that a RAID 6 array could actually survive the loss of two drives is relatively low, much lower than the chances of RAID 10 surviving the same thing. (This may be reversed in very small arrays with very high reliability URE drives.)

  9. This article makes perfect sense. To take things a step further, we utilize the additional reliability of mirrored arrays over parity array by using consumer grade drives along with the flexibility of MDADM. We use a 3 drive, 3 copy MD RAID 10 with inexpensive new drives or used drives depending on what is available for purchase. This is all in addition to a backup scheme.

    We feel that even with consumer grade drives, it is highly unlikely that we would have two drives fail at once given they are in a mirrored array and rebuilds do not require the parity calculation thereby reducing the wear on the array during rebuild and reducing rebuild times significantly thereby reducing the second failure window. A 1 TB array takes 4-6 to hours to rebuild even under light load for our few in office users. This means that typically the time from failure to a fully rebuilt array will almost always be less than 24 hours if we are slow to replace the drive, and could be less than a work day if we are on top of it leaving a very small window for a second failure.

    But the drives we buy easily cost 40%-70% less than enterprise drives, but do not necessarily offer 40%-70% less life. All of this is afforded by using mirrored vs parity arrays. From our understanding, a RAID 5 rebuild on aged consumer drives would probably leave us without fingernails, and possibly data on a regular basis where with the mirrored arrays, this is not the case. Throw in URE problems and we feel much better about our consumer hardware choices.

    The big vision is that the cost for one server is low enough, that we plan on adding a second one and mirroring them using DRBD and associated services giving us an HA setup for around $1000 or less to start, with room for future drives for expansion of the arrays. Then we could have a whole server go down and have a window of time to get it back up, array and all. Again, to use lower priced consumer grade hardware, we need to keep things as simple as possible. Parity is not as simple as mirroring. We also keep our drives in pass through mode so no hardware compatibility in the way in case of controller failure.

    With article like these, we gravitate towards things like mirroring and feel we are better off because of it. Thanks for writing it up!

  10. RAID6 is more reliable than RAID1 in smaller drive arrays, because you need to lose three drives to lose data, whereas RAID1 only needs to lose two drives – granted, a specific pair, but still only two. You’re pushing it a bit if you go much above 6 drives, though the chances of data loss are still quite low.

    You bring up the bogeyman of URE like it’s a trump card for RAID10 (and we can drop the 0, here, striping doesn’t do anything for your reliability). I suspect you’ve taken the “why-raid-6-stops-working-in-2019″ article by Robin Harris a bit too seriously; you probably also trust a bit too much in the fairy dust stats that HD manufacturers use when they oversimplify MTTDL. Google’s HDD paper and just recently, Backblaze, with many petabytes and tens of thousands of drives under management, disagree with such simplistic numbers like 1e14 vs 1e15.

    Thing is, if you see an error during a rebuild, was the data really there to begin with? You should be scrubbing your RAID arrays regularly so you won’t be surprised during a resilver. The URE panic promulgated by Robin would have you believing you can’t reliably complete a single scrub.

    But I have to laugh. This is all a bit of a bogus argument. RAID6 is for archival, to reduce the cost of redundancy, at the expense of performance; if you’re not streaming big contiguous blobs, your life will be miserable. RAID10 is generally used to add reliability and extra read performance to RAID0. Neither counts as backup; you still need an extra copy somewhere else. RAID10 is for when you need uptime with good performance, while RAID6 is more like a reliable and reasonably cheap cache of your offsite backup. They don’t actually compete much with one another, as they serve different purposes.

  11. In setups with many disks you should consider that you get much more usable space from raid 6 rather than raid 10. We have installed several systems with 16 disks, where we get just over 40TB of usable space using 3TB disks, where we would only get 24 if we were running raid 10.

    I think this factor is worth mentioning in your article, and perhaps elaborate on how many disks you should add to a one raid set/volume set

  12. RAID 6 is never more reliable than RAID 1, the math makes no sense. Yes, RAID 6 can potentially lose more drives but it always has more to lose plus introduces other risks. RAID 1 has redundancy rather than parity, is much more stable, doesn’t introduce resilver risks and, contrary to popular myth, is not limited to just two drives. You can expand RAID 1 to as many drives as you like. Triple mirrored RAID 1 is ridiculously reliable. But no matter how you slice or dice it, RAID 1 always wins the reliability question.

  13. Scrubbing does not fix the problem. URE risk calculations are done with the assumption that the disks were freshly scrubbed. This is a common myth brought out anytime someone wants to discredit URE fears. But it is assumed in the risk mentioned.

    Yes, if you don’t scrub then maybe you are in even more risk. But that only makes things worse, not better.

  14. Barry, nothing you talk about makes sense. Of course RAID is not backup. That’s a level of misconception not being dealt with here. RAID is about reducing the need to go to backup, about avoiding data loss rather than minimizing it.

    RAID 6 is common in archival systems, yes, but that is far from its only place.

  15. Hi Scott,

    Thanks for taking your time to write articles like this to educate the unwashed masses such as myself. I have read a lot of your content and really appreciate it.

    There is one thing still bugging me though. This question was asked before but I think you kind of misread/dodged the question.

    Why would a RAID 5′s controller fail an array when encountering a URE during rebuild? Why wouldn’t you just get some corrupted bits, which presumably could be lived with? I don’t see what is causing a single URE to mandate a total array failure.

  16. Because parity RAID is a lot like compression in that the result is, in some ways, like a single file. If you have a zip file, for example, and it becomes corrupt, even if it contains many files inside of it, you expect them all to be corrupt and unreadable because the container in which they are held becomes corrupt. Parity RAID acts the same way, only not to make the drive smaller, obviously. The entire array behaves as a single file and if that file is corrupt and unreadable, the system cannot put the pieces back together to reconstruct the file and everything contained within that file, the filesystems on top of the array, is lost.

    This isn’t a feeling about how it works or something that comes from me. This is the well known and documented behavior of the RAID 5 specification. It’s not a theoretical problem but a commonly observed one.

    Now, why do RAID manufacturers not come up with an algorithm that can protect against that and limit loss? I think that the answer probably comes down to the fact that it is not a practical use of financial resources. How difficult it is to solve and how effective the solution could be, I am not sure.

  17. So your big raid 10 is sets of 2 drives mirroring, and then a stripe across these sets, right? And a random drive fails. That means one of your 2 drive mirrors is now a raid1 moirror with only one member, the warm spare gets added to that array, and the controller starts rebuilding that drive from the existing drive.

    Now a URE happens. That data is still in only one place at that time. It’s now in 0 places. That should invalidate the raid1 set just as much as a URE encountered during a raid5 rebuild, wouldn’t it? And therefore the stripe across the mirrors and so the entire array lost?

    The difference is the *likelihood* of a URE occurring. In a raid5 rebuild, you read *all* the data on *all* the remaining disks. In the case of the mirror rebuild, you read all the data on just *one* disk — and so the risk of a URE is correspondingly less, anywhere from 33% as much to 10% of the risk, depending on the size of the raid5 array.

    It’s not zero or even negligible, though. (which is why RAID Is Not A Backup)

  18. Hi Scott,

    I’ve run recently into the famous URE problem with RAID 5 while I was confidently upgrading my Netgear ReadyNAS following their procedure. After a long bit of reading (mostly from your blog and spiceworks), I really better understand those problems now (Many thanks !), but I still have some questions:

    You (among others) say that RAID 6 will encounter the same problem as RAID 5 in the next 5 years but I saw nowhere a precise calculation or description of when the RAID 6 really becomes risky ?

    To solve my problem, I bought a new 6 drives NAS with 6 x 4 TB SATA (classic 10^14 drives). I know that it’s not the best that can be done but it’s a home NAS, I’m not Bill Gates and I need to have over 10 TB of useable capacity, so this was the best price/security ratio. I’m now in the trouble of having to choose between RAID 10 and RAID 6.

    RAID 10 is clearly the best choice concerning security because it’s not affected by the Parity RAIDs URE disaster but I could really use the extra 4 TB that RAID 6 would offer and since RAID 6 seems to be capable to survive URE while resilvering, I would like to have your thoughts on this. Too risky or not ? What do you think ?

    If an URE appears on a disk (in a RAID 6 configuration) while resilvering, will this definitely get this disk out of the rebuild (leaving the resilvering vulnerable to another URE) or will the NAS just get the good data from the other disks, write it on a new sector and go on with the rebuild (leaving the operation vulnerable only to 2 simultaneaous URE or another disk failure + URE on a third one) ?

    As you can see, I think that I have well understood the uselessness of RAID 5 nowadays and the power of RAID 10 but I’m in trouble when I need to evaluate RAID 6.

    Many thanks in advance and please, excuse my bad english cause it’s not my natural language.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>