Choosing an Email Architecture: Internal or Hosted

If you talk to email specialists what you seem to find, in my small, anecdotal survey of the market, is that half of all of these professionals will tell you to simply install email locally, normally Microsoft Exchange, and the other half will simply tell you to go with a hosted (a.k.a. Software-as-a-Service / SaaS or “in the cloud”) service, most often Google Apps, but email is not such a simple architectural component that it should be distilled to trite answers.  Email is one of the most important components of your business’ communications infrastructure, often surpassing telephony, and choosing the right delivery methodology for your company is critical for your long term success.

We will start by considering some basic factors in email hosting.  Email systems require a good deal of bandwidth, quite a significant amount of storage, high reliability, careful management and significant security consideration.

Bandwidth is the first area to consider.  Every email sent and received must travel between the end user and the email server as well as between the email server itself and the outside world in the case of email destined externally.  In small businesses nearly all email is destined to leave the company network to go to clients, customers, vendors, etc.  In larger enterprises email use changes and as we approach the Fortune 100 email shifts from being almost exclusively a tool for communicating with people outside the organization to being a platform primarily used for internal communications.

This shift in how email itself is used is a very important factor in deciding how to deploy email services.  If email is used almost exclusively internally for intra-staff communications then this will lend itself very well to hosting email systems in-house to increase security and improve WAN bandwidth utilization.  The caveat here being, of course, that a highly distributed company of any size would not keep this traffic on a LAN network and so should be treated as if the email usage is external regardless of whether or not it is intra-staff.  Small companies with communications happening primarily with external users will find better utilization in a hosted service.

Storage is actually often a smaller factor in email architecture decision making than it may at first appear that it should be.  Traditionally email’s storage requirements made a compelling argument for hosting internally due to the cost benefit of keeping large storage, especially that used for archival needs, local.  Recently, large hosted email vendors such as Rackspace and Google Apps have brought the price of online, archival email storage so low that, in many cases, it may actually be more cost effective to utilize hosted storage rather than local storage or, at least, the cost is at parity.  Even long term archival storage can be had very cost effectively in a hosted solution today.

Reliability is a rather complex subject.  Email is critical to any organization.  If an email system goes down many companies simply grind to a halt.  In some cases, the company effectively shuts down when email stops flowing.  Not only do employees stop communicating with each other but customers, vendors, suppliers and others see the company as being offline at best and out of business at worst.  Interrupting communications with the outside world can represent immediate and serious financial impact to almost any business.

Hosted email has the obvious advantage of being hosted in a large, commercial datacenter with redundancy at every level (assuming a top tier vendor) from hardware to storage to networking to power to support.  Hosting email in house requires a business to determine the level of redundancy that is most cost effective given the business’ ability to withstand email downtime and is generally an exercise in compromises – how much reliability can a company do without given the cost necessary to provide it.

Some companies will opt to host email servers at a colocation facility which will provide them with many redundant components but to meet the features of a Rackspace or Google level offering, multiple datacenters would likely be needed.  Colocation is a halfway option providing the technical features of hosted options with the management and flexibility of in-house email systems.

A more common scenario, though, is for companies to host a single email server completely within their walls relying on their internal power, hardware and network connection.  In a scenario like this a company must either take extreme measures to ensure uptime – such as hosting a completely redundant site at immense cost – or front-ending their entire email infrastructure with a reliable online spooling service such as Postini, MessageLabs or MXLogic.  The cost of such services, while critical for the reliability most companies need, is often equal to or even greater than complete email hosting options.  This spooling service cost will likely add an ongoing, scaling cost that will make fully hosted email services always a less expensive option than in-house hosting.

Management cost is very difficult to determine but requires attention.  A fully hosted solution requires relatively little technical knowledge.  Time to manage is low and the skill level necessary to do so is relatively low.  With an in-house solution your company must supply infrastructure, networking, security, system and email skills.  Depending on your needs and your available staff this may be part time for a single professional or it may require multiple FTEs or even outside consultants.  The total time necessary to manage an in-house email system will vary dramatically and is often very hard to calculate do the complex nature of the situation but, at a minimum, it is orders of magnitude greater than a hosted solution.

Security is the final significant consideration.  Beyond traditional system-level security email requires spam filtering.  Handling spam can be done in many ways: in software on the email server, on an appliance located on the local network, farmed out to a spam filtering service or left to the hosted email solution provider.  Spam filtering, if handled internally, is seldom a set and forget service but one that requires regular attention and generally extra cost in licensing and management.

After looking at these main considerations every company should sit down, crunch the numbers, and determine which solution makes the most sense for them on an individual level.  Often it is necessary to use a spreadsheet and play with several scenarios to see what each solution will cost both up front and over time.  This, combined with a valuation of features and their applicability to the company, will be critical in determining the appropriateness of each option.

The secret weapons of the in-house solution are features, integration and flexibility.  In-house email options can be extended or modified to offer exactly the feature set that the organization requires – sometimes at additional cost.  A perfect example of this is Zimbra’s instant messaging integration which can be a significant value-add for an email platform.  This has to be considered in addition to raw cost.  Integration with existing internal authentication mechanisms can be an important factor as well.

In my own experience and cost calculations, hosted solutions represent the vast majority of appropriate solutions in the SMB space due to raw economics while large and enterprise class customers will find insurmountable benefits from the flexibility and internal communications advantages of in-house solutions.  Small businesses struggle mostly with cost while large business struggle primarily with the communications complexity of their scale.  Large businesses also get the best value from in-house solutions due to “professional density” – the inverse number of IT professionals whose time is wasted due to corporate scale inefficiencies.

Today, whether a business chooses to host their own email or to receive email as a service, there are many options from which to choose even once a basic architecture is chosen.  Traditionally only a few in-house options such as MS Exchange and Lotus Notes would be considered but new alternatives such as Zimbra (recently acquired by VMWare,) Scalix and Kerio are expanding the landscape with lower costs, new deployment options and aggressive feature sets.  Hosting’s relative newcomer, and overnight industry heavyweight, Rackspace is drawing a lot of attention with their new email offerings which more closely mimic traditional in-house offerings while Google continues to get attention with their unique GMail services.  I expect to see the hosted email space continue to become more competitive with new integration features being a key focus.

Every business is unique and the whole of the factors must be considered.  Using a combination of business and IT skills is necessary to evaluate the available options and opportunities and no one discipline should be making these decisions in isolation.  This is a perfect example of where IT managers must understand the economics of the business in addition to the technological aspects of the solution.

Linux Virtualization Deployment Advantage

As more and more businesses begin to deploy virtualization broadly, we must begin to step back and reconsider the opportunities presented to us by this shift in datacenter architecture.  Virtualization comes with new challenges and potential not only for cost savings but for aggressive project implementation.  Small businesses, especially, when using virtualization tend to prepare themselves for projects that they could never have envisioned doing during the era of physical-only servers.

The big winners in this space of emerging virtualization opportunity are the open source operating systems such as Linux, OpenSolaris and FreeBSD.  The reason that these particular operating systems have unique opportunities that Windows and Mac OSX do not is because of the way that they are, or can be, licensed.  Each of these operating systems has an option by which they are available completely for free – something that cannot be done with Windows or Mac OSX.

Traditionally, when purchasing a new server a business would price out expensive hardware with relatively inexpensive software.  An enterprise operating system, such as Windows, would typically represent a relatively small percentage of the cost of a new server.  Even a small server would cost a few thousand dollars and Windows Server can easily be purchased for less than one thousand dollars.  In this scenario a business looking to purchase a new server would see only a very small cost savings in opting for a “free” operating system since introducing a new OS has its own risks and the bulk of the cost of the new server is in the hardware which would still need to be purchased.

Given that equation, only a rare small business would consider the purchase of a non-Windows-based server.  The opportunity for failure is too high given the risk of change and the cost savings are too small.  Today, though, virtualization is commonplace and becoming more ubiquitous every day.  Businesses virtualizing their infrastructure typically have excess capacity on their servers that is going unused.  As these businesses and their IT departments begin to look to utilize this spare capacity they will increasingly find that the cost of deploying virtualized Windows Server remains high while the cost of deploying a virtualized Linux or OpenSolaris server is nominal – generally nothing more than the effort to do so without any capital expenditure or its associated risk.

The ability to deploy new servers, at any time, without any cost is a significant advantage that companies have not begun to truly comprehend.  If a business wants a new web server, for instance, they can have one provisioned and built in thirty minutes without buying any licenses.  Having redundant virtualization hardware means that a redundant web server can be had as well – again without any capital cost.  Unlike Windows (or other commercial operating systems) there is no need to purchase a second license just to have a backup server.

This means that for the first time many businesses can begin to consider clusters as well.  Typically the cost of licensing software for clustering was prohibitive but if that licensing becomes free then suddenly clusters become very attractive options.

Of course, as open source proponents will point out, the low cost of Linux and other free and open source solutions have long been reasons to move to these platforms, but this discounts the incredible shift in pricing structure that occurs only when spare usable capacity meets the previously existing free licenses.  It is only because so many business have already implemented virtualization strategies, or are in the process of doing so, that this new opportunity truly presents itself.

The first challenge will be in getting businesses to begin to think of operating systems and application platforms as being free.  The ways in which businesses may take advantage of this has yet to be seen.  Businesses are so used to being hamstrung by the need to buy new hardware and expensive server software licenses for every new system deployment that the widespread availability of spare server images is quite novel indeed.

Of course, as with many new technology changes, it is the small and medium business space where the greatest change will likely take place.  Large enterprises are already doing datacenter consolidation and do not necessarily have spare capacity available to them as their capacity plan already takes into account virtualization.  But in the smaller business space where capacity planning is a practically non-existent practice we see a different type of opportunity.

What we typically see in small businesses moving to virtualization is an over-purchasing of hardware.  This generally comes from a misunderstanding of how capacity planning and virtual guest interaction will occur in the virtualized environment but also from a desire to err on the side of overpowered versus underpowered and the nature of virtualization capacity planning being a bit of a “black art”.  Because of this, however, many small businesses have server resources sitting idle.  It is not uncommon to see a powerful server virtualizing just two server instances when there is capacity to virtualize a dozen or more.

It is this overprovisioning of hardware that offers unique opportunity.  Many small businesses, and even medium sized businesses, may manage to effectively virtualize their entire existing server infrastructure leaving no further opportunity for cost savings through consolidation.  At this point the spare capacity of the existing servers offers no further cost savings and can now be viewed as capacity for growth instead.

This begs the question of “What new deployment opportunities exist given these opportunities?”  This question is difficult to answer as it will be different for nearly every business, but we can look at some commonalities to build a rough picture of where we may see new value presenting itself.

The most obvious new opportunity is in new web applications.  Small businesses often would like to take advantage of free web-based applications but do not want to risk deploying new, low-priority applications to their existing Windows-based web server of do not even have a server available to do so.  Creating one or more open source application servers is incredibly simple.  Deploying a wiki, corporate web portal, a blogging engine or news site, bug or incident tracking application, microblogging platform (a la laconi.ca,) CRM, ERP or any of thousands of similar applications can be done quickly and easily with minimal cost using only “spare” time from the existing IT resources.  Any number of internal applications such as these could bring value to the company and produce very little impact on a virtualization platform so many could be deployed utilizing only a small amount of excess capacity.

Beyond obvious web apps there are more feature-rich systems that could be deployed for no cost.  A great example is the OpenFire instant messaging and presence server.  Companies can suddenly roll out complete enterprise class, secure, internal instant messaging applications at no cost whatsoever.  Another example is in monitoring systems such as Nagios, Zenoss or Zabbix – all of which are available for free and represent a real benefit for companies that currently have no such system.  Enterprise monitoring completely for free.

Beyond new applications there is also an “environmental” benefit to be had.  In an enterprise environment changes going into production go through a series of testing.  Typically big businesses will maintain a development server environment, a user acceptance testing environment and then the production environment.  For a small business to do this with Windows is extremely cost prohibitive as the servers in each environment need to be licensed.  But with open source servers being virtualized using spare capacity deploying virtual servers for each of these environments is completely free and allows small businesses to test their own processes before making production changes giving them added stability previously unaffordable to them.

After all of these growth benefits there is one additional benefit to consider – flexibility.  Because these new systems can be deployed and tested with no cost it provides a new opportunity for small shops to deploy open source solutions that may replace expensive Windows solutions that they are currently using.  This could include replacing Exchange with Zimbra or replacing IIS with Apache or Active Directory with an LDAP server.  Doing a project like this would be risky and potentially costly if the hardware and software had to be purchased up front.  But if the project can be done, only using free time from the existing IT department, and can be done as a free “proof of concept” before looking to do a pilot and then full production replacement then risk can be minimized and the entire project can be effectively free.

While a full architectural replacement may be very aggressive for an average small business it is also a very significant potential cost savings.  Moving completely to open source systems is not for everyone and should be evaluated carefully.  The ability to evaluate a project of this magnitude, for free, is very important and small businesses should consider doing so to be sure that they are using the systems that make the most sense for their business model and needs rather than simply using the solutions with which they are already familiar or are already in place.

There are many additional ways in which free and open source products, deployed using existing, excess server capacity, can be used to expand the IT infrastructure of small businesses.  Learning to seek out opportunities rather than seeking cost savings from IT is a new process for most small businesses and requires some relearning, but those that take the time to pursue these opportunities have many benefits to be gained.

In House Email for Small Businesses

In small businesses the primary concern with email is cost.  Email is a commodity and especially in smaller shops the biggest differentiating factor between email products and vendors is cost.  In larger companies factors beyond cost begin to come into play more significantly such as directory services, system integration, push email, extended client support, collaboration tools, presence and more.

Surprisingly, when small businesses decide to bring their email in-house they seem to immediately turn to Microsoft Exchange.  Now I don’t want to belittle Exchange’s place in the market.  Exchange is an extremely robust and feature rich product that has earned its reputation as the go-to enterprise collaboration and messaging server.  In the last decade Exchange came seemingly from nowhere to completely dominate the large business email market.  People simply assume that you run Exchange in the Fortune 500 and, for the most part, they are correct.

The features for which Exchange is best known, however, are not features often critical or even useful to small businesses.  In actuality, the weight of Exchange – necessary to support so many great big-business features – can make it unwieldy for small businesses – even for those businesses with the financial and technological resources to support it.  Exchange focuses on collaboration and internal team communications.

Exchange brings with it many burdens.  The first being cost, up front purchasing, licensing and ongoing support.  Up front costs of Exchange include the Exchange email server purchase plus the licenses necessary for the Windows Servers – yes, that is multiple servers – on which it runs.  (Yes, you can mitigate some of this cost by purchasing Microsoft’s Small Business Server which integrates these components together but extra costs remain and flexibility is lost.)  Licensing costs for Exchange include needed Windows Server CALs and Exchange Email CALs for every user, and in some case fictional user accounts, who will need to access the system.  Ongoing support cost comes from the extra complexity arising from Exchange’s complex feature set and deployment architecture.

The second set of burdens with Exchange comes from the user interface, namely Outlook.  Now technically Exchange requires no additional user interface as Outlook Web Access, or OWA, is included for free and is a very functional web interface for email.  This would be fine if all of Exchange’s functionality was exposed through OWA, but this is not the case, so this is often nothing more than a decent fall-back solution for remote users who are away from their corporate laptops.  To really achieve the benefits of Exchange a company needs to invest in Microsoft Outlook which is a very robust and powerful email and collaboration platform but also an expensive one.  The per-user cost of Outlook can be quite significant when added to the per user costs already existing from the Exchange licensing.

The third set of burdens comes from the overhead of managing such a complex and powerful beast as Exchange.  Exchange is no simple system and, when secured according to best practices, spans multiple physical servers and operates in multiple roles.  Exchange system administration is considered its own discipline within IT or, at least, a Windows Server specialty.  Qualified Exchange admins are costly and in-demand from big business.  Small businesses looking to acquire good Exchange talent will either be paying top dollar, hiring consultants or attempting to make do with less experienced staff – a potential disaster on such a critical and publicly exposed system.  In addition to managing the Exchange system itself the staff will also need to contend with managing the deployment and maintenance of the Outlook clients which, while not complicated, does increase the burden on the IT department compared to other solutions.

More potential cost comes from the need to supply anti-virus technologies and anti-spam technologies to support the Exchange installation.  It would be unfair to mention AV and AS technologies in relation to Exchange without pointing out that any in-house email system will absolutely need these technologies as well – these costs are certainly not unique to Exchange.  However, the ecosystem surrounding Exchange has a very strong tendency to encourage the use of expensive, commercial third party tools to meet these needs.  Outside of Exchange, AV and AS technologies are often included with the email packages and no further purchases are needed.

Vying for attention in the Exchange-alternative space are open source entries Zimbra and Scalix as well as several commercial products such as IBM’s Lotus Notes, Novell’s Groupwise, Open-Xchange and Kerio’s MailServer.  Of these, Lotus Notes and Groupwise target, primarily, the large business space bringing their own set of complex collaboration functionality and cost.  The other four, Zimbra, Scalix, Open-Xchange and Kerio MailServer, focus primarily on the small business space and bring leaner, more targeted solutions that will more likely fit the profile desired for a majority of small businesses who are looking to bring their email solution in-house.

Over the last few years Zimbra especially has been in the news with their advanced web interface and early sale to Yahoo! and very recent acquisition by VMWare.  Zimbra has led, at least in the media, the charge of alternative vendors looking to open the in-house email market.  What makes these products stand out is that they deliver the bulk of Exchange’s enterprise level features, including calendaring and other important corporate applications, but do so either for free or at very competitive prices and through robust web interfaces removing the need for a local fat client like Outlook (but while maintaining the option.)

Zimbra and Scalix truly stand out as ideal candidates for the majority of small businesses looking to keep their email in-house.  Both Zimbra and Scalix offer a wide range of functionality, robust AJAX-based web interface, large commercial installation bases, broad industry support and offer the paid option of full vendor support.  But the biggest benefit for many small businesses is that these packages are available in completely free editions allowing an SMB on a budget to rely completely upon their internal IT department or their IT vendor for support rather than buying expensive, per-user email system licenses.

In addition to being free themselves, Zimbra and Scalix offer a range of deployment scenarios including Red Hat Linux, and its free alternative CentOS Linux, as well as Novell’s Suse Linux. By being available on these platforms these vendors again lower the cost of deploying their solutions as no Windows Server license is required to support them.  This is a large potential cost savings over Exchange again as Exchange requires not one but at least two Windows Server licenses on which to run.  Linux also brings some cost and performance advantages in the virtualization space with more and more varied virtualization options compared to most other platforms.

Caveats exist, of course.  Many shops are wary when looking at non-Microsoft solutions.  A lack of skilled Linux technicians in the SMB space is of real concern.  Windows Admins are abundant and it is a rare shop who would need to even seek one out let alone fail to find one capable of supporting their systems.  While Linux Admins are hardly found by swinging cats, they are widely available and tend to be on average, in my opinion,  more skilled – if only because there is a smaller, more senior pool of people from whom to draw talent.  This helps to balance the equation making Linux support not nearly as scary as it may seem like it will be for small businesses, but it does mean that most SMBs will have to look to more experienced IT consulting firms to assist them – which can bring long term cost benefits as well.

Many users are addicted to the functionality and interfaces of Exchange.  This can be a significant factor in deciding to try an alternative product.  Once workers have become accustomed to their existing workflows and processes, changing them by replacing their email server architecture can be rather disruptive.  Exchange offers quite an extensive array of functionality and users who are using those functions, not handled by competing products, will not likely be pleased losing those features, even if there are alternatives available.  So knowing your userbase and what features are being used is important.  Many companies never touch these features and can migrate easily.

Zimbra and Scalix bring their own features, of course.  One of the best being Zimbra’s built-in instant messaging system and presence system built using the standard XMPP protocol.  Putting secure instant messaging directly into the email interface is a huge win for Zimbra and a significant value-add over the status quo.

Obviously the ideal time to consider an alternative email product is at the very beginning when email is first being deployed or when a migration from another system is already underway.  But even companies with existing email systems can seek cost benefits by moving to a less costly system with savings being recouped over a longer period of time and more work necessary to train users.

Small businesses should look first to products like Zimbra and Scalix as the de facto choice for their environments and heavier, more expensive products like Microsoft Exchange should be considered a “special case” choice that requires careful cost analysis and justification.  Far too many SMB IT departments are picking the expensive route without being required to justify their actions.  If more small businesses were diligent about monitoring their IT spending they would likely find many places where their money is not only being spent somewhat liberally but sometimes even for features that they cannot use at all and sometimes on systems that carry many long term support costs as well.

RAID Revisited

Back when I was a novice service tech and barely knew anything about system administration one of the few topics that we were always expected to know cold was RAID –  Redundant Array of Inexpensive Disks.  It was the answer to all of our storage woes.  With RAID we could scale our filesystems larger, get better throughput and even add redundancy allowing us to survive the loss of a disk which, especially in those days, happened pretty regularly.  With the rise of NAS and SAN storage appliances the skill set of getting down to the physical storage level and tweaking it to meet the needs of the system in question are rapidly disappearing.  This is not a good thing.  Just because we are offloading storage to external devices does not change the fact that we need to fundamentally understand our storage and configure it to meet the specific needs of our systems.

A misconception that seems to have entered the field over the last five to ten years is the belief that RAID somehow represents a system backup.  It does not.  RAID is a form of fault tolerance.  Backup and fault tolerance are very different conceptually.  Backup is designed to allow you to recover after a disaster has occurred.  Fault tolerance is designed to lessen the chance of disaster in the first place.  Think of fault tolerance as building a fence at the top of a cliff and backup as building a hospital at the bottom of it.  You never really want to be in a situation without a both a fence and a hospital, but they are definitely different things.

Once we are implementing RAID for our drives, whether locally attached or on a remote appliance like SAN, we have four key RAID solutions from which to choose today for business: RAID 1 (mirroring), RAID 5 (striping with parity), RAID 6 (striping with double parity) and RAID 10 (mirroring with striping.)  There are others, like RAID 0, that only should be used in rare circumstances when you really understand your drive subsystem needs.  RAID 50 and 51 are used as well but far less commonly and are not nearly as effective.  Ten years ago RAID 1 and RAID 5 were common, but today we have more options.

Let’s step through the options and discuss some basic numbers.  In our examples we will use n to represent the number of drives in our array and we will use s to represent the size of any individual drive.  Using these we can express the usable storage space of an array making comparisons easy in terms of storage capacity.

RAID 1: In this RAID type drives are mirrored.  You have two drives and they do everything together at the same time, hence “mirroring”.  Mirroring is extremely stable as the process is so simple, but it requires you to purchase twice as many drives as you would need if you were not using RAID at all as your second drive is dedicated to redundancy.  The benefit being that you have the assurance that every bit that you write to disk is being written twice for your protection.  So with RAID 1 our capacity is calculated to be (n*s/2).  RAID 1 suffers from providing minimal performance gains over non-RAID drives.  Write speeds are equivalent to a non-RAID system while read speeds are almost twice as fast in most situations since during read operations the drives can access in parallel to increase throughput.  RAID 1 is limited to two drive sets.

RAID 5: Striping with Single Parity, in this RAID type data is written in a complex stripe across all drives in the array with a distributed parity block that exists across all of the drives.  By doing this RAID 5 is able to use an arbitrarily sized array of three or more disks and only loses the storage capacity equivalent to a single disk to parity although the parity is distributed and does not exist solely on any one physical disk.   RAID 5 is often used because of its cost effectiveness due to its lack of storage capacity loss in large arrays.  Unlike mirroring, striping with parity requires that a calculation be performed for each write stripe across the disks and this creates some overhead.  Therefore the throughput is not always an obvious calculation and is dependent heavily upon the computational power of the system doing the parity calculation.  Calculating RAID 5 capacity is quite easy as it is simply ((n-1)*s).  A RAID 5 array can survive the loss of any single disk in the array.

RAID 6: Redundant Striping with Double Parity.  RAID 6 is practically identical to RAID 5 but uses two parity blocks per stripe rather than one to allow for additional protection against disk failure.  RAID 6 is a newer member of the RAID family having been added several years after the other levels had become standardized.  RAID 6 is special in that it allows for the failure of any two drives within an array without suffering data loss.  But to accommodate the additional level of redundancy a RAID 6 array loses the storage capacity of the equivalent to two drives in the array and requires a minimum of four drives.  We can calculate the capacity of a RAID 6 array with ((n-2)*s).

RAID 10: Mirroring plus Striping.  Technically RAID 10 is a hybrid RAID type encompassing a set of RAID 1 mirrors existing in a non-parity stripe (RAID 0).  Many vendors use the term RAID 10 (or RAID 1+0) when speaking of only two drives in an array but technically that is RAID 1 as striping cannot occur until there are a minimum of four drives in the array.  With RAID 10 drives must be added in pairs so only an even number of drives can exist in an array.  RAID 10 can survive the loss of up to half of the total set of drives but a maximum loss of one from each pair.  RAID 10 does not involve a parity calculation giving it a performance advantage over RAID 5 or RAID 6 and requiring less computational power to drive the array.  RAID 10 delivers the greatest read performance of any common RAID type as all drives in the array can be used simultaneously in read operations although its write performance is much lower.  RAID 10’s capacity calculation is identical to that of RAID 1, (n*s/2).

In today’s enterprise it is rare for an IT department to have a serious need to consider any drive configuration outside of the four mentioned here regardless of whether software or hardware RAID is being implemented.  Traditionally the largest concern in a RAID array decision was based around usable capacity.  This was because drives were expensive and small.  Today drives are so large that storage capacity is rarely an issue, at least not like it was just a few years ago, and the costs have fallen such that purchasing additional drives necessary for better drive redundancy is generally of minor concern.  When capacity is at a premium RAID 5 is a popular choice because it loses the least storage capacity compared to other array types and in large arrays the storage loss is nominal.

Today we generally have other concerns, primarily data safety and performance.  Spending a little extra to ensure data protection should be an obvious choice.  RAID 5 suffers from being able to lose only a single drive.  In an array of just three members this is only slightly more dangerous than the protection offered by RAID 1.  We could survive the loss of any one out of three drives.  Not too scary compared to losing either of two drives.  But what about a large array, say sixteen drives.  Being able to safely lose only one of sixteen drives should make us question our reliability a little more thoroughly.

This is where RAID 6 stepped in to fill the gap.  RAID 6, when used in a large array, introduces a very small loss of storage capacity and performance while providing the assurance of being able to lose any two drives.  Proponents of the striping with parity camp will often quote these numbers to assuage management that RAID 5/6 can provide adequate “bang for the buck” in storage subsystems, but there are other factors at play.

Almost entirely overlooked in discussions of RAID reliability, an all too seldom discussed topic as it is, is the question of parity computation reliability.  With RAID 1 or RAID 10 there is no “calculation” done to create a stripe with parity.  Data is simply written in a stable manner.  When a drive fails its partner picks up the load and drive performance is slightly degraded until the partner is replaced.  There is no rebuilding process that impacts existing drive members.  Not so with parity stripes.

RAID arrays with parity have operations that involve calculating what is and what should be on the drives.  While this calculation is very simple it provides an opportunity for things to go wrong.  An array control that fails with RAID 1 or RAID 10 could, in theory, write bad data over the contents of the drives but there is no process by which the controller makes drive changes on its own so this is extremely unlikely to ever occur as there is never a “rebuild” process except in creating a mirror.

When arrays with parity perform a rebuild operation they perform a complex process by which they step through the entire contents of the array and write missing data back to the replaced drive.  In and of itself this is relatively simple and should be no cause for worry.  What I and others have seen first hand is a slightly different scenario involving disks that have lost connectivity due to loose connectors to the array.  Drives can commonly “shake” loose over time as they sit in a server especially after several years of service in an always-on system.

What can happen, in extreme scenarios, is that good data on drives can be overwritten by bad parity data when an array controller believes that one or more drives have failed in succession and been brought back online for rebuild.  In this case the drives themselves have not failed and there is no data loss.  All that is required is that the drives be reseated, in theory.  On hot swap systems the management of drive rebuilding is often automatic based on the removal and replacement of a failed drive.  So this process of losing and replacing a drive may occur without any human intervention – and a rebuilding process can begin.  During this process the drive system is at risk and should this same event occur again the drive array may, based upon the status of the drives, begin striping bad data across the drives overwriting the good filesystem.  It is one of the most depressing sights for a server administrator to see when a system with no failed drives loses an entire array due to an unnecessary rebuild operation.

In theory this type of situation should not occur and safeguards are in place to protect against it but the determination of a low level drive controller as to the status of a drive currently and previously and the quality of the data residing upon that drive is not as simple as it may seem and it is possible for mistakes to occur.  While this situation is unlikely it does happen and it adds a nearly impossible to calculate risk to RAID 5 and RAID 6 systems.  We must consider the risk of parity failure in addition to the traditional risk calculated from the number of drive losses that an array can survive out of a pool.  As drives become more reliable the significance of the parity failure risk event becomes greater.

Additionally, RAID 5 and RAID 6 parity introduces system overhead due to parity calculation which is often handled by way of dedicated RAID hardware.  This calculation introduces latency into the drive subsystem that varies dramatically by implementation both in hardware and in software making it impossible to state performance numbers of RAID levels against one another as each implementation will be unique.

Possibly the biggest problem with RAID choices today is that the ease with which metrics for storage efficiency and drive loss survivability can be obtained mask the big picture of reliability and performance as those statistics are almost entirely unavailable.  One of the dangers of metrics is that people will focus upon factors that can be easily measured and ignore those that cannot be easy measured regardless of their potential for impact.

While all modern RAID levels have their place it is critical that they be considered within context and with an understanding as to the entire scope of the risks.  We should work hard to shift our industry from a default of RAID 5 to a default of RAID 10.  Drives are cheap and data loss is expensive.

[Edit: In the years since this was initial written the rise of URE (Unrecoverable Read Errors) risks during a rebuild operation has shifted the primary risks from those listed to URE-related risks for parity arrays.]

The Information Technology Resource for Small Business

%d bloggers like this: