Dreaded Array Confusion

Dreaded Array Confusion, or DAC, is a term given to a group of RAID array failure types which are effectively impossible to diagnose but are categorized by the commonality that they experience no drive failure in conjunction with complete array failure resulting in total data loss.  It is hypothesized that three key causes result in the majority of DAC:

Software or Firmware Bugs: While dramatic bugs in RAID behavior are rare today, they are always possible, especially with more complicated array types such as parity RAID where reconstructive calculations must be performed on the array.  A bug in RAID software or firmware (depending on if we are talking about software of hardware RAID) could manifest itself in any number of ways including the accidental destruction of the array.  Firmware issues could occur in the drives themselves as well.

Hardware Failure:  Failure in hardware such as processors, memory or controllers can have dramatic effects on a RAID array.  Memory errors especially could easily result in total array loss.  This is thought to be the least common cause of DAC.

Drive Shake: In this scenario individual drives shake loose and disconnect from the backplane and later shake back into place triggering a resilvering event.  If this were to happen with multiple drives during a resilver cycle or if a URE were encountered during a resilver we would see total array loss on parity arrays potentially even without any hardware failure occurring.

Because of the nature of DAC and because it is not an issue with RAID itself but with the support components for it we are left in a very difficult position to attempt to identify or quantify the risk.  No one knows how likely DAC is to happen and while we know that DAC is a more significant threat on parity RAID systems we do not know by how much.  Anecdotal evidence suggests the risk on mirrored RAID is immeasurably low and on parity RAID may rise above background noise in risk analysis.  Of the failure modes, software bugs and drive shake both present much higher risk to systems running on parity RAID because URE risk only impacts parity arrays and the software necessary for parity is far more complex than the software needed for mirroring.  Parity RAID simply is more fragile and carries more types of risks exposing it to DAC in more ways than mirrored RAID is.

Because DAC is a number of possibilities and because it is effectively impossible to identify after it has occurred there is little possible means of any data being collected on it.  Since having identified DAC as a risk many people have come forth, predominantly in the Spiceworks community, to provide anecdotal eye witness accounts of DAC array failures.  The nature of end user IT is that statistics, especially on nebulous concepts like DAC which are not widely known, are not gathered and cannot be.  DAC arises in shops all over the world where a system administrator returns to the office to find a server with all data gone and no hardware having failed.  The data is already lost.  Diagnostics will not likely be run, logs will not exist and even if the issue can be identified to whom would it be reported and even if reported, how do we quantify how often it happens versus how often it does not or how often it might but not be reported.  Sadly all I know is that in having identified and somewhat publicized the risk and its symptoms that suddenly many people came forth acknowledging that they had seen DAC first hand as well and had no idea what had happened.

If my anecdotal studies are any indicator, it would seem that DAC actually poses a sizable risk to parity arrays with failures existing in an appreciable percentage of arrays but the accuracy and size of the cross section of that data collection was tiny.  However, it was original though that DAC was so rare that theoretically you would be unable to find anyone who had ever observed it but this does not appear to be the case.  I already am aware of many people who have experienced it.

We are forced, by the nature of the industry, to accept DAC as a potential risk and list it as an unknown “minor” risk in risk evaluations and be prepared for it but cannot calculate against it.  But knowing that it can be a risk and understanding why it can happen are important in evaluating risk and risk mitigation.

[Anecdotal evidence suggests that DAC is almost always exclusive to hardware RAID implementations of single parity RAID arrays on SCSI controllers.]