Make Your Business Jealous

I have, in the past, discussed how home technology infrastructures and lab environments are one of the keys to personal career success and how IT practitioners should set a high bar in their own lives and hold their businesses accountable to an even higher bar.  I often speak to IT practitioners looking for ways to improve their own education, skill set and résumé.  I propose an approach to building your home network and lab – “Make Your Business Jealous.”

I mean it quite literally.  Why not approach your home network as an opportunity to raise the bar, both on your own experience but also on what you expect from businesses?  The approach is an excellent one for changing how you think about your home network and its goals.  Don’t just use your home to learn new skills in a “checkmark” fashion such as “learned Active Directory management.”  That is an excellent and obvious use of your home network.  But we should consider taking this even farther.

In many ways, this applies far more to those IT practitioners who work in small and medium businesses where it is common to cut corners, eschew best practices, leave work half done and not approve realistic budgets, but it is valuable for anyone.  You can take this from the smallest of features in your own home – the physical aspects (like cabling, labeling and organization) and take it very large (servers, services, security, backup, etc..)

For a real world example, we should begin with the simplest component, cabling.  Few people, even IT practitioners, take the time to cable their homes for data let alone cabling it well.  This is a big missed opportunity.  This is something that is not only utilitarian but ends up adding value to the house as well.  And many businesses do a very poor job of this themselves.  Even though cabling is not strictly an IT discipline, it is a fringe area of electrical support that is related to IT and well worth using as a physically visible showcase of your home environment.

Doing good cabling at home, since it is your home and presumably there is nearly unlimited time to do things well, can really be taken to an extreme.  I recommend using forward looking cable, CAT 6 or better, so that you can handle flawless GigE today and faster in the future.  You do not want your home infrastructure to become dated unnecessarily.  Once putting in the effort it is all about doing it “right”.  This is a chance not just to run a few cables but to implement a complete cabling plant with over-provisioned cable runs to all parts of the house.  Of course you can do this in stages doing just a room or a few at a time.

In my own home projects I ran four cable runs to nearly all rooms with some like the master bedroom and living room getting more like six or eight.  This may sound like a lot, but in reality for a well wired home, it is not at all, as we will see.  You want to run more cabling than you are ever likely to use now, while putting in the effort, both because it is good practice and because it is simply impressive.  Having extra cabling means an easier time keeping things organized and more ability to flexibly change things in the future.

Running the cables in the walls, up through an attic or down through the basement or whatever is appropriate with well managed cable runs is best.  In my own house this was the attic with large “J” hooks keeping cables in place.  Be sure to label all cabling very clearly.  This is another chance to go above and beyond.  Install nice wall jacks, label every jack and every cable in every jack.  Make the physical cabling plant as organized, impressive and manageable as possible.  All it really takes is a little time and effort and you can have something that you are really proud of and that you want to show off.

Where the cables run to, of course, is a matter for some discussion.  Maybe you will want to have a networking closet or even a server room.  Having a good place to store your home networking gear can be a wonderful addition to a home lab.  A patch panel for your cable runs is ideal.  Do everything “right”, just like you would find in a dream office environment.

Good cabling will allow us to move as much as possible from wireless to wired in the home environment.  Wired connections are faster, more stable, less management, potentially more secure and improve the performance of the remaining wireless devices by lowering the wireless saturation levels in the home.  If you are like me your home network consists of many devices that have regular Internet access and are often wireless but can be wired such as desktops, video game consoles, Internet television devices, smart televisions and more.  Many or most of these devices can easily be removed from wireless which can only be beneficial.  Save the wireless for mobile devices.

A cabling plant, of course, needs something to attach to.  A good switch will be the true heart of any home network, or any network at all.  If you have a large home you could easily need a forty eight port switch or even need to look at stacked switches.  A small apartment might use sixteen ports.  Size will vary by need.

In acquiring a switch it is a good time to consider not just size but features.  Adding PoE (Power over Ethernet) now is ideal allowing for even cleaner cable management and a yet more impressive network.  Wait till you see what else we consider adding to our home network that might leverage PoE.

Also this is a time when we can consider more advanced features on the switch.  Rather than just using a traditional unmanaged switch we can look at nice, rackmount switches that are either smart (web managed) or fully managed which is excellent for getting broader switch experience.  We might want to add a guest VLAN, for example, for visitors who need Internet access.  If you have a guest bedroom in the house, maybe those Ethernet ports should be VLAN’d as guest access all of the time, for example.

You are going to want to monitor that closet too, no doubt.  Once you start to put nice equipment in there you will want to keep it safe.  Perhaps temperature and moisture sensors that communicate onto the network?

Most home networks live and die on their wireless capabilities.  This is a place to really shine.  Instead of cheap “all in one” networking gear with anemic wireless that is stuck in the most useless corner of the house you can use your shiny, new PoE cabling plant to put high quality, commercial wireless access points in the most useful point(s) of your home (and considering placing them on the grounds too, if you own multiple buildings or open space.)  Centrally managed units are very affordable and can make for a very fast, robust wireless network and can make having guest wireless access very easy as well.

Next to consider, since we have such a robust cabling system in place already, is security.  Why not add a camera or two to watch the outside of your house or the front door?  You can use PoE cameras that are centrally managed.  Look for business class solutions, of course, not consumer cameras and software available at the local store.

One of the bigger and more interesting home projects to consider is a full scale VoIP PBX.  This can be a great and interesting project, one of the few really good uses for running a server at home that will be really used as a “production” service.  A home VoIP PBX can make it easy to have separate extensions for family members, rooms or purposes.  You can have features like individual voicemail mailboxes, a house intercom, front door intercom, room to room calling, wake up calls, video conferencing, free calling for other family members (outside of the home), guest lines, multiple lines for simultaneous calling and the ability to make and take calls while traveling!

Once we have a PBX in the home installing physical phones throughout the home, on PoE of course, is the next step.  Phones can be very useful around a home, especially a larger one.  Having real phones to manage can be very educational and certainly lets you take your home to another level of IT.

No server closest would be complete without features like a domain controller, home web server (why not have a guest website for visitors and a wiki for tracking your home projects and information) and the biggest home systems – storage.  Traditional storage like NAS or file server can be very useful for storing private photos and video, music and document collections, movies and other media.  DLNA streaming can make an audio and video library available to the entire house.  Traditional storage such as SMB and NFS can provide high speed, protected mapped drives available to the computers in the home.  And more modern storage techniques like “cloud storage” can be hosted as well.

Of course all of those workloads can be virtualized and run on a server (or two) and run in the server closet.  If you are incredibly ambitious this could include features like high availability or fault tolerance, although these will generally push costs into a range often impractical for home use by nearly any standard.

And the pièce de résistance is, of course, backups.  Use real backup software, several enterprise options are even free for home-scale use.  Taking good backups, testing restores, using different media, backup strategies and backup types (such as images and file-based backups) can really showcase the reliability of your home network.

Don’t forget to go beyond running systems into monitoring.  Log collection and analysis, bandwidth monitoring, system monitoring, load monitoring and more can be added for completeness.  Think of all the things that exist, or that you wish would exist, in an ideal office setting.  There is rarely any reason not to bring these same technologies into your home.

Beyond these steps there are many places that one could go to make a home network.  Features that might be interesting for you.  Go crazy.

Our goal here is to raise the bar.  Do at home what few businesses do.  Building an amazing home network, one that is really used, beyond building a great home IT lab, is valuable for many reasons.  A great home network is more than just an amazing learning experience, it makes for a perfect interview conversation starter, it is a “portfolio” piece demonstrating skills in cradle to grave LAN design and management, it shows dedication and initiative to the field and it sets a bar to be used when speaking to businesses.

Go ahead, don’t be afraid to make your business jealous of your home network.

Virtualizing Even a Single Server

I find it very common in conversations involving virtualization to have the concept of consolidation, which in the context of server virtualization refers to putting multiple formerly physical workloads onto a single physical box with separation handled by the virtual machine barriers, treated as being the core tenant and fundamental feature of virtualization.  Without a doubt, workload consolidation represents an amazing opportunity with virtualization, but it is extremely important that the value of virtualization and the value of consolidation not be confused.  Too often I have found that consolidation is viewed as the key value in virtualization and the primary justification for it but this is not the case.  Consolidation is a bonus feature, but should never be needed when justifying virtualization.  Virtualization should be a nearly foregone conclusion while consolidation must be evaluated and many times would not be used.  That workloads should not be consolidated should never lead to the belief that those workloads should not be virtual.  I would like to explore the virtualization decision space to see how we should be looking at this point.

Virtualization should be thought of as hardware abstraction as that is truly what it is, in a practical sense.  Virtualization encapsulates the hardware and presents a predictable, pristine hardware set to guest operating systems.  This may sound like it adds complication but, in reality, it actually simplifies a lot of things both for the makers of operating systems and drivers as well as for IT practitioners designing systems.  It is because computers, computer peripherals and operating systems are such complex beasts that this additional layer actually ends up removing complexity from the system by creating standard interfaces.  From standardization comes simplicity.

This exact same concept of presenting a standard, virtual machine to a software layer exists in other areas of computing as well, such as with how many programming languages are implemented.  This is a very mature and reliable computing model.

Hardware abstraction and the stability that it brings alone are reason enough to standardize on virtualization across the board but the practical nature of hardware abstraction as implemented by all enterprise virtualization products available to us today brings us even more important features.  To be sure, most benefits of virtualization can be found in some other way but rarely as completely, reliably, simply or freely as from virtualization.

The biggest set of additional features typically come from the abstraction of storage and memory allowing for the ability to snapshot storage or even the entire running state of a virtual machine, that is to take an image of the running system and store it in a file.  This ability leads to many very important capabilities such as the ability to take a system snapshot before installing new software, changing configurations or patching; allowing for extremely rapid rollbacks should anything go wrong.  This seemingly minor feature can lead to big peace of mind and overall system reliability.  It also makes testing of features and rolling back or repeated testing very easy in non-production environments.

The ability to snapshot from the abstraction layer also leads to the ability to take “image-based backups”, that is backups taken via the snapshot mechanism at a block device layer rather than from within the operating system’s file system layer.  This allows for operating system agnostic backup mechanisms and backups that include the entire system storage pool all at once.  Image backups allow for what were traditionally known as “bare metal restores” – the entire system can be restored to a fully running state without additional interaction – easily and very quickly.  Not all hypervisor makers include this capability or include it to equal levels so while conceptually a major feature it is critical that the extent to which this feature exists or is licensed must be considered on a case by case basis (notably HyperV includes this fully, XenServer includes it partially and VMware vSphere only includes it with non-free license levels.)  When available, image-based backups allow for extremely rapid recovery at speeds unthinkable with other backup methodologies.  Restoring systems in minutes is possible, from disaster to recovery!

The ability to treat virtual machines as files (at least when not actively running) provides additional benefits that are related to the backup benefits listed above.  Namely the ability to rapidly and easily migrate between physical hosts and even to move between disparate hardware.  Traditionally hardware upgrades or replacements meant a complicated migration process fraught with peril.  With modern virtualization, moving from existing hardware to new hardware can be a reliable, non-destructive process with safe fallback options and little or possibly even zero downtime!  Tasks that are uncommon but were very risky previously can often become trivial today.

Often this is the true benefit of virtualization and abstraction mechanisms.  It is not, necessarily, to improve the day to day operations of a system but to reduce risk and provide flexibility and options in the future.  Preparing for unknowns that are either unpredictable or are simply ignored in most common situations.  Rarely is such planning done at all, much to the chagrin of IT departments left with difficult and dangerous upgrades that could have been easily mitigated.

There are many features of virtualization that are applicable to only special scenarios.  Many virtualization products include live migration tools for moving running workloads between hosts, or possibly even between storage devices, without downtime.  High availability and fault tolerant options are often available allowing some workloads to rapidly or even transparently recover from system hardware failure, moving from failed hardware to redundant hardware without user intervention.  While more of a niche benefit and certainly not to be included in a list of why “nearly all workloads”  should be virtual, it is worth noting as a primary example of features that are often available and could be added later if a need for them arises as long as virtualization is used from the beginning.  Otherwise a migration to virtualization would be needed prior to being able to leverage such features.

Virtualization products typically come with extensive additional features that only matter in certain cases.  A great many of them fall into a large pool of “in case of future need.”  Possibly the biggest of all of these is the concept of consolidation, as I had mentioned at the beginning of this article.  Like other advanced features like high availability, consolidation is not a core value of virtualization but is often confused for it.  Workloads not intending to leverage high availability or consolidation should still be virtualized – without a doubt.  But these features are so potentially valuable as future options, even for scenarios where they will not be used today, that they are worth mentioning regardless.

Consolidation can be extremely valuable and it can easily be understood why so many people simply assume that it will be used as it is so often so valuable.  The availability of this once an infrastructure is in place is a key point of flexibility for handling the unknowns of future workloads.  Even when consolidation is completely unneeded today, there is a very good chance, even in the smallest of companies, that it will be useful at some unknown time in the future.  Virtualization provides us with a hedge against the unknown by preparing our systems for the maximum in flexibility.  One of the most important aspects of any IT decision is managing and reducing risk.  Virtualization does this.

Virtualization is about stability, flexibility, standardization, manageability and following best practices.  No major enterprise virtualization product is not available, at least in some form, for free today.  Any purchase would, of course, require a careful analysis of value versus expenditure.  However, with excellent enterprise options available for free from all four key product lines in this space currently (Xen, KVM, HyperV and VMware vSphere) we need make no such analysis.  We need only show that the implementation is a non-negative.

What makes the decision making easy is that when we consider the nominal case – the bare minimum that all enterprise virtualization provides which is the zero cost, abstraction, encapsulation and storage based benefits we find that we have a small benefit in effectively all cases, no measureable downsides and a very large potential benefit from the areas of flexibility and hedging against future needs.  This leaves us with a clear win and a simple decision that virtualization, being free and with essentially no downsides on its own, should be used in any case where it can be (which, at this point, is essentially all workloads.)  Additional, non-core, features like consolidation and high availability should be evaluated separately and only after the decision to virtualize has already been solidified.  No lack of need for those extended features, in any way, suggests that virtualization should not be chosen based on its own merits.

This is simply an explanation of existing industry best practices which have been to virtualize all potential workloads for many years.  This is not new nor a change of direction.  Just the fact that across the board virtualization has been an industry best practice for nearly a decade shows what a proven and accepted methodology this is.  There will always be workloads that, for one reason or another, simply cannot be virtualized, but these should be very few and far between and should prompt a deep review to find out why this is the case.

When deciding whether or not to virtualize, the approach should always be to assume that virtualization is a foregone conclusion and only vary from this if a solid, defended technical reason makes this impossible.  Nearly all arguments against virtualization come from a position of misunderstanding with a belief that consolidation, high availability, external storage, licensing cost and other loosely related or unrelated concepts are somehow intrinsic to virtualization.  They are not and should not be included in a virtualization versus physical deployment decision.  They are separate and should be evaluated as separate options.

It is worth noting that because consolidation is not part of our decision matrix in creating base value for virtualization, that all of the reasons that we are using apply equally to both one to one deployments (that is a single virtual machine on a single physical device) as to consolidated workloads (that is multiple virtual machines on a single physical device.)  There is no situation in which a workload is “too small” to be virtualized.  If anything, it is the opposite, only the largest workloads, typically with extreme latency sensitivity, where a niche scenario of non-virtualization still exists as an edge case but even these cases are rapidly disappearing as latency improvements in virtualization and total workload capacities are improved.  These cases are so rare and vanishing so quickly that even taking the time to mention these cases is probably unwise as it suggests that exceptions, based on capacity needs, are common enough to evaluate for, which they are not, especially not in the SMB market.  The smaller the workload, the more ideal for virtualization, but this is only to reinforce that small business, with singular workloads, are the most ideal case for virtualization across the board rather than an exception to best practices, not to suggest that larger businesses should be looking for exceptions themselves.

You Are Not Special

It is not my intention for this to sound harsh, but I think that it has to be said: “You are not special.”  And by “you” here, of course, I mean your business.  The organization that you, as an IT practitioner, support.  For decades we have heard complaints about how modern education systems attempt to make every student feel unique and special, when awards are given out schools attempt to find a way, especially with elementary students, to make sure that every student gets an award of some sort.  Awards for best attendance, posture, being quiet in class or whatever are created to award completely irrelevant things in order to make every student not only feel like part of the group, but to be a special, unique individual that has accomplished something better than anyone else.

This attitude, this belief that everyone is special and that all of those statistics, general rules and best practices apply to “someone else” has become pervasive in IT now as well, manifesting itself in the belief that each business, each company is so special and unique that IT industry knowledge does not apply in this situation.  IT practitioners with whom I have spoken almost always agree that best practices and accumulated industry knowledge are good and apply in nearly every case – except for their own.  All of those rules of thumb, all of those guidelines are great for someone else, but not for them.  The problem is that nearly everyone feels this way, but this cannot be the case.

I have found this problem to be most pronounced and, in fact, almost exclusive to the small business market where, in theory, the likelihood of a company being highly unique is actually much lower than the large enterprise space of the Fortune 100 where uniqueness is somewhat expected.  But instead of small businesses assuming uniformity and enormous businesses expecting uniqueness the opposite appears to happen.  Large businesses understand that even at massive scale IT problems are mostly standard patterns and by and large should be solved using tried and true, normal approaches.  And likewise, small businesses, seemingly driven by an emotional need to be “special” claim a need for avoiding industry patterns often eschewing valuable knowledge to a ludicrous degree and often while conforming to the most textbook example of the use case for the pattern.  It almost seems, from my experience, that the more “textbook” a small business is, the more likely that its IT department will avoid solutions designed exactly for them and attempt to reinvent the wheel at any cost.

Common solutions and practices apply to the majority of businesses and workloads, easily in excess of 99.9% of them.  Even in larger companies where there is opportunity for uniqueness we expect to only see rare workloads that fall into a unique category.  Even in the world’s largest businesses the average workload is, well, average.  Large enterprises with tens of thousands of servers and workloads often find themselves with a handful of very unique situations for which there is no industry standard to rely on.  But even so, they have many thousands of very standard workloads that are not special in any way.  The smaller the business not only the less opportunity for a unique workload but the less chance of it occurring on a workload by workload basis because they have so many fewer workloads.

One of the reasons that small businesses, even ones very unique as small businesses go, are rarely actually unique is because when a small business has an extreme need for say performance, capacity, scale or security it [almost] never means that it needs that thing in excess of existing standards for larger businesses.  The standards of how to deal with large data sets or extreme security, for example, are already well established in the industry at large and small businesses need only leverage the knowledge and practices developed for larger players.

What is surprising is when a small business with relatively trivial revenue believes that its data requires a level of secrecy and security in excess of the security standards of the world’s top financial institutions, military organizations, governments, hospitals or nuclear power facilities.  What makes the situation more absurd is that in pursuing these extremes of security, small businesses almost always result in very low security standards.  They often cite needs for “extreme security” for doing insecure or as we often say “tin foil hat” procedures.

Security is one area where this behavior if very pronounced.  Often it is small business owners or small business IT “managers” who create this feeling of distrusting industry standards, not IT practitioners themselves, although the feeling that a business is unique often trickles down and is seen there as well.

Similar to security, the need for unlimited uptime and highly available systems, rarely needed even for high end enterprise workloads, seem an almost ubiquitous goal in small businesses.  Small businesses often spend orders of magnitude more money, in relationship to revenue, on procuring high availability systems compared to their larger business counterparts.  Often this is done with the mistaken belief that large businesses always use high availability and that small business must do so to compete, that if they do not that they are not a viable business or that any downtime equates to business collapse.  None of these are true.  Enterprises have far lower cost of reliability compared to revenue and still do considerable cost analysis to see what reliability expenditures are justified through risk.  Small businesses rarely do that best practice analysis and jump, almost universally, to the very unlikely belief that their workloads are dramatically more valuable than even the largest enterprises and that they have no means of mitigating downtime.  Eschewing both business best practices (doing careful cost and risk analysis before investing in risk mitigation), financial best practices (erring on the side of up front cost savings) or technology best practices (high availability only when needed and justified) leaves many businesses operating from the belief that they are “special” and none of the normal rules apply to them.

By approaching all technology needs from the assumption of being special, businesses that do this are unable to leverage the vast, accumulated knowledge of the industry.  This means that businesses are continuously reinventing the wheel and attempting to forge new paths where well trodden, safe paths already exist.  Not only can this result in an extreme degree of overspending in some cases and in dangerous risk in others but it effectively guarantees that the cost of any project is unnecessarily high.  Small business, especially, have the extreme advantage of being able to leverage the research and experience of larger businesses allowing smaller businesses to be more agile and lean.  This is a key component to making small businesses compete against the advantages of scale inherent to large businesses.  When small businesses ignore this advantage they are left with neither the scale of big business nor the advantages of being small.

There is no simple solution here – small business IT practitioners and small business managers need to step down from their pedestals and take a long, hard look at their companies and ask if they really are unique and special or if they are a normal business with normal needs.  I guarantee you are not the first to face the problems that you have.  If there isn’t a standard solution approach available already then perhaps the approach to the problem is wrong itself.  Take a step back and evaluate with an eye to understanding that many businesses share common problems and can tackle them effectively using standard patterns, approaches and often best practices.  If your immediate reaction to best practices, patterns and industry knowledge is “yes but that doesn’t apply here” you need to stop and reevaluate – because yes, it certainly does apply to you.  It is almost certainly true that you have misunderstood the uniqueness of your business or you have misunderstood how the guidance is applied resulting in the feeling that those guidelines are not applicable.  Even those rare businesses with very unique workloads only have them for a small number of their workloads and not the majority of them; the most extremely unique businesses and organizations still have many common workloads.

Patterns and best practices are our friends and allies, our trusted partners in IT.  IT, and business in general, is challenging and complex.  To excel as IT practitioners we can seek to stand on the shoulders of giants, walk the paths that have been mapped and trodden for us and leverage the work of others to make our solutions as stable, predictable and supportable as possible.  This allows us to provide maximum value to the businesses that we support.

Explaining the Lack of Large Scale Studies in IT

IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software.  This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it.  And yet, regardless of the high demand for such data there is none available.  How can this be.

Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field.  These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and/or share this data with other companies.

Cost is by far the largest factor.  If the cost of large scale studies could be overcome, all other factors could have solutions found for them.  But sadly the nature of a large scale study is that it will be costly.  As an example we can look at server reliability rates.

In order to determine failure rates on a server we need a large number of servers in order to collect this data.  This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one.  We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers.  If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment!  And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device.  If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.

Now that cost, of course, is for testing a single configuration of a single model server.  Presumably for a study to be meaningful we would need many different models of servers.  Perhaps several from each vendor to compare different lines and features.  Perhaps many different vendors.  It is easy to see how quickly the cost of a study becomes impossibly large.

This is just the beginning of the cost, however.  To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible.  This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control.  Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads.  In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.

Then, of course, we must address the needs for special sensors and testing.  What exactly constitutes a failure?  Even in production systems there is often dispute on this.  Is a hard drive failing in an array a failure, even if the array does not fail?  Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way?  There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study.  Establishing study guidelines that are applicable and useful to end users is much harder than it seems.

And the biggest cost, manual labor.  Maintaining an environment for a large study will take human capital which may equal the cost of the study itself.  It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data.  All in all, the cost are generally, simply impossible to do.

Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while still having spent a large sum of money.

The second insurmountable problem is time.  Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years.  Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful.  What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.

But that is not the biggest problem.  The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late.  The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results.  Often production equipment is only purchased for three to five years total lifespan.  Getting results even one year into this span would have little value.  And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.

The final major factor is a lack of incentive to provide existing data to those who need it.  While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such.  These are rarely done in controlled environments and often involve data collected from the field.  In many cases this data may even be private to customers and not legally able to be shared regardless.

But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist.  Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.

The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale.  These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.

Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage.  So for these reasons they are not shared.

I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity.  But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure!  The lack of failures was, itself, very valuable.  But we were unable to produce any standard statistic like Mean Time to Failure.  To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough.  Perhaps millions of servers years would have been necessary.  There is no way to truly know.

Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist.  When they do they will be isolated and almost certainly crippled by the necessities of reality.  There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research.  As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data.  It is surprising that so many people in the field expect this type of data to be available when it never has been historically.

Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques.  This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.