Category Archives: IT Management

Understanding Bias

I often write about the importance of alignment in goals between IT and vendors and how critical it is to avoid getting advice from those that you are not paying for that advice, because that makes them salespeople, basically the importance of getting advice and guidance from a buyer’s agent rather than directly from the seller’s agent.  This leads to questions about bias; clearly the idea is that a salesperson is biased in a way that is likely unfavourable to you.  But, it should be obvious, that all people are biased.

This is true, all people have bias.  We cannot seek to escape or remove all bias, that is simply impossible.  In fact, in many ways, when we see advice whether it be from a paid consultant whose job it is to present us with a good option, from IT itself doing the same or getting feedback from a friend on products that they have tested – it is actually their biases that we are seeking!

What we need to do is strive to understand the biases and motivations of the people with whom we speak and receive advice, be self reflecting to understand our own biases, have a good knowledge of what biases are good for us and attempt to get advice from people who have a general bias-alignment with us.

Biases come in many forms.  We can have good and bad biases, strong and weak ones.

The biggest biases typically come externally in the form of monetary or near-monetary compensation for bias.  This might be someone being paid as a sales person to promote the products that they are available to sell, commission structures would take this to an even more acute level.  Someone paid to do sales might face two of the strongest biases: monetary (they get money if they make the sale) and ethical (they made an agreement to sell this product if possible and they are ethically bound to try to do so.)  These are the standard biases of the “seller’s agent” or sales person.

On the other hand a consultant is paid by the buyer or customer and is a buyer’s agent and as the same monetary and ethical biases, but in the favour of the buyer rather than against them.  (I use the term buyer and customer here mostly interchangeably to represent the business or IT department, the ones receiving advice or guidance on what to do or buy.)  These biases are pretty evident and easily to control and I have covered them before – never get advice from the seller’s agent, always get your advice from the buyer’s agent.

If we assume that these big biases, those of alignment, are covered we still have a large degree of bias from our buyer’s agent that we need to uncover and understand.

One of the most common biases is one towards familiarity.  This is not a bad bias, but we must be aware of it and how it colours recommendations.  This bias can run very deep and affect decision making in ways that we may not understand without investigation.  At the highest level, the idea is simply that most anyone is going to favour, possibly unintentionally, solutions and products with which they have familiarity and the stronger that familiarity often the stronger the bias towards those products will be.

This may seem obvious but it is a bias that is commonly overlooked.  People turning to consultants will often seek their advice from someone with a very small set of experiences which serves as a means by which the resulting recommendations are likely drawn.  In a way, this is effectively the buyer preselecting the desired outcome and choosing a consultant that will deliver the desired outcome.  An example of this would be choosing a network engineer to design a solution when that engineer only knows one product line; naturally the engineer will almost certainly design a solution from that product line.  In choosing someone with limited experience in that area we are, for all intents and purposes, directly the results by picking based on a strong bias.  This happens extremely often in IT, presumably because those hiring consultants base this decision on what they think are foregone conclusions about what the resulting advice will be and forgetting to step back and get advice at a higher level.

Of course, like with many things, there is also an offset bias to the familiarity bias, the exploration bias.  While we tend to be strongly biased towards things that we know, there is also a bias towards the unknown and the opportunity to explore and learn.  This bias tends to be extremely weak compared to the familiarity bias, but far from trivial in many IT practitioners.  It is a bias that should not be ignored and is important for helping broaden the potential scope of advice from a single consultant.

Of course there are more biases that stem from familiarity.  There is a natural, strong bias towards companies that we have found to have good products, have good support or interact well.  Companies with whom we have experienced product, support or interaction issues we tend to be strongly biased against.  These, of course, are highly valuable biases that we specifically want consultants to bring with them.

One of the worst biases, however, and one that affects everyone is marketing bias.  Companies with large or well made marketing campaigns or that align with industry marketing campaigns can induce a large amount of bias that is not based on something valuable to the end user.  Similarly, market share is an almost valueless and often negative factor (large companies often charge more for equal products – e.g. you “pay for the name”) but can be a strong bias, one often brought to the table by the customer.  Customers commonly either directly control this bias by demanding only well marketed, seemingly popular or large vendor promoted recommendations be made or fail to react properly to apparently alternative solutions: both reactions heavily influence what a consultant is willing to recommend.  This is known as “no one ever got fired for buying IBM” from the 1980s, and is often an amazingly costly bias and a difficult one to overcome.  Of course it applies much more broadly than only to IBM and does not primarily pertain to them today, but the term became famous during IBM’s heyday of IT.

Of course the main bias that we seek is the bias of “what is the best option for the customer.”  This is itself, a bias.  One that we hope, when combined with other positive biases, overpowers the influence of negative biases.  And likewise there is a prestige bias, a desire to produce advice that is so good that it increases the respect for the consultant.

Biases come in many different types and are both the value in advice and the dangers in it.  Leveraging bias requires an understanding of the major biases that are or are likely at play in any specific instance as well as having empathy for the people that give advice.  If you take time to learn about what their financial, ethics, experiential and objective biases are, you can understand their role far better and you can better filter their advice based on that knowledge.

Take the time to consider the biases of the people from whom you get advice.  Likely you already know a lot of which biases affect them significantly and may be able to guess what more of them are.  Everyone has different biases and all people react to them differently.  What is a strong bias for one person is a weak one for someone else.  Consider talking to your consultants about their biases, they should be open to this conversation (and if not, be extra cautious) and hopefully have thought about it themselves, even if not in depth or in the same terms.

The people from whom you get advice should have biases that strongly align favourably towards you and your goals.

 

Types of IT Service Providers

A big challenge, both to IT Service Providers and to their customers, is in attempting to define exactly what an IT vendor is and how their customers should expect to interact with them.  Many people see IT Service Providers (we will call them ITSPs for short here) as a single type of animal but, in reality, ITSPs come in all shapes and sizes and need to be understood in order to leverage a relationship with them well.  Even if we lack precise or universally accepted terms, the concepts are universal.

Even within the ITSP industry there is little to no standardization of naming conventions, even though there are relatively tried and true company structures which are nearly always followed.  The really important aspect of this discussion is not to carefully define the names of service providers but to explain the competing approaches so that, when engaging a service provider, a meaningful discussion around these models can be had so that an understanding of an appropriate resultant relationship can be achieved.

It is also important to note that any given service provider many use a hybrid or combination of models.  Using one model does not preclude the use of another as well.  In fact it is most common for a few approaches to be combined as multiple approaches makes it easier to capture revenue which is quite critical and IT service provisioning is a relatively low margin business.

Resellers and VARs:  The first, largest and most important to identify category is that of a reseller.  Resellers are the easiest to identify as they, as their name indicates, resell things.  Resellers vary from pure resellers, those companies that do nothing but purchase from vendors on one side and sell to customers on the other (vendors like NewEgg and Amazon would fit into this category while not focusing on IT products) to the more popular Value Added Resellers who not only resell products but maintain some degree of skill or knowledge around products.

Value Added Resellers are a key component of the overall IT vendor ecosystem as they supply more than just a product purchasing supply chain but maintain key skills around those products.  Commonly VARs will have skills around product integration, supply chain logistics, supported configurations, common support issues, licensing and other factors.  It is common for customers and even other types of ITSPs to lean on a VAR in order to get details about product specifics or insider information.

Resellers of any type, quite obviously, earn their money through markup and margins on the goods that they resell.  This creates for an interesting relationship between customers and the vendor as the vendor is always in a position of needing to make a sale in order to produce revenue.  Resellers are often turned to for advice, but it must be understood that the relationship is one of sales and the reseller only gets compensated when a sale takes place.  This makes the use of a reseller somewhat complicated as the advice or expertise sought may come at a conflict with what is in the interest of the reseller.  The relationship with a reseller requires careful management to ensure that guidance and direction coming from the reseller is aligned with the customer’s needs and is isolated to areas in which the reseller is an expert and in ways that is mutually beneficial to both parties.

Managed Service Providers or MSPs: The MSP has probably the most well known title in this field.  In recent years the term MSP has come to be used so often that it is often simply used to denote any IT service provider, whether or not they provide something that would appropriately be deemed to be a “managed service.”  To understand what an MSP is truly meant to be we have to understand what a “managed service” is meant to be in the context of IT.

The idea of managed services is generally understood to be related to the concept of “packaging” a service.  That is producing a carefully designed and designated service or set of services that can be sold with a fixed or relatively predictable price.  MSPs typically have very well defined service offerings and can often provide very predictable pricing.  MSPs take the time up front to develop predictable service offerings allowing customers to be able to plan and budget easily.

This heavy service definition process generally means that selecting an MSP is normally done very tightly around specific products or processes and nearly always requires customers to conform to the MSPs standards.  In exchange, MSPs can provide very low cost and predictable pricing in many cases.  Some of the most famous approaches from MSPs include the concepts of “price per desktop”, “price per user” or “price per server” packages where a customer might pay one hundred dollars per desktop per month and work from a fixed price for whatever they need.  The MSP, in turn, may define what desktops will be used, what operating system is used and what software may be run on top of it.  MSPs almost universally have a software package or a set of standard software packages that are used to manage their customers.   MSPs generally rely on scaling across many customers with shared processes and procedures in order to create a cost effective structure.

MSPs typically focus on internal efficiencies to maximize profits.  The idea being that a set price service offering can be made to be more and more effective by adding more nearly identical customers and improving processes and tooling in order to reduce the cost of delivering the service.  This can be a great model with a high degree of alignment between the needs of the vendor and the customer as both benefit from an improvement in service delivery and the MSP is encouraged to undertake the investments to improve operational efficiency in order to improve profits.  The customer benefits from set pricing and improved services while the vendor benefits from improved margins.  The caveat here is that there is a risk that the MSP will seek to skirt responsibilities or to lean towards slow response or corner cutting since the prices are fixed and only the services are flexible.

IT Outsourcers & Consultants: IT Outsourcing may seen like the most obvious form of ITSP but it is actually a rather uncommon approach.  I lump together the ideas of IT Outsourcing and consulting because, in general, they are actually the same thing but simply handled at two different scales.  The behaviours are essentially the same between them.  In contrast with MSPs, we could also think of this group as Unmanaged Service Providers.  IT Outsourcers do not develop heavily defined service packages but instead rely on flexibility and a behaviour much more akin to that of an internal IT department.  IT Outsourcers literally act like an external IT department or portion thereof.  An IT Outsourcer will typically have a technological specialty or a range of specialties but many are also very generalized and will handle nearly any technological need.

This category can act in a number of different ways when interacting with a business.  When brought in for a small project or a single technological issue they are normally thought of as a consultancy – providing expertise and advice around a single issue or set of issues.  Outsourcing can also mean using the provider as a replacement for the entire IT department allowing a company to exist without any IT staff of their own.  And there is a lot of middle ground where the IT Outsourcer might be brought in only to handle specific roles within the larger IT organization such as only running and manning the help desk, only doing network engineering or providing continuous management and oversight but not doing hands on technical work.  IT Outsources are very hard to define because they are so flexible and can exist in so many different ways.  Each IT Outsourcer is unique as is, in most cases, every client engagement.

IT Outsourcing is far more common, and almost ubiquitous, within the large business and enterprise spaces.  It is a very rare enterprise that does not turn to outsourcing for at least some role within the organization.  Small businesses use IT Outsourcers heavily but are more likely to use the more well defined MSP model than their larger counterparts.  The MSP market is focused primarily on the small and medium business space.

It is imperative, of course, that the concept of outsourcing not be conflated with off-shoring which is the practice of sending IT jobs overseas.  These two things are completely unrelated.  Outsourcing often means sending work to a company down the street or at least in the same country or region.  Off-shoring means going to a distant country, presumably across the ocean.  It is off-shoring that has the bad reputation but sadly people often use the term outsourcing to incorrectly refer to it which leads to much confusion.  Many companies use internal staff in foreign markets to off-shore while being able to say that no jobs are outsourced.  The misuse of this term has made it easy for companies to hide off-shoring of labor and given the local use of outsourced experts a bad reputation without cause.

It is common for IT Outsourcing relationships to be based around a cost per hour or per “man day” or on something akin to a time and materials relationship.  These arrangements come in all shapes and sizes, to be sure, but generally the alignment of an IT Outsourcer to a business is the most like the relationship that a business has with its own internal IT department.  Unlike MSPs who generally have a contractual leaning towards pushing for efficiency and cutting corners to add to profits, Outsourcers have a contractual leaning towards doing more work and having more billable hours.  Understanding how each organization makes its money and where it is likely to “pad” or where cost is likely to creep is critical in managing the relationships.

Professional Services: Professional Services firms overlap heavily with the more focused consulting role within IT Outsourcing and this makes both of these roles rather hard to define.  Professional Services tend to be much more focused, however, on very specific markets whether horizontal, vertical or both.  Professional Services firms generally do not offer full IT department or fully flexible arrangements like the IT Outsourcer does but are not packaged services like the MSP model.  Typically a Professional Services firm might be centered around a small group of products that compete for a specific internal function and invest heavily in the expertise around those functions.  Professional Services tend to be brought in more on a project basis than Outsourcers who, in turn are more likely to be project based than MSPs.

Professional Services firms tend to bill based on project scope.  This means that the relationship with a PS firm requires careful scope management.  Many IT Outsourcers will do project based work as well and when billing in this way this would apply equally to them and some PS firms will be billing by the hour and so the IT Outsourcing relationship would apply.  In a project it is important that everyone be acutely aware of the scope and how it is defined.  A large amount of overhead must go into the scoping by both sides as it is the scope document that will define the ability for profits and cost.  PS firms are by necessity experts at ensuring that scopes are well defined and profitable to them.  It is very easy for a naive IT department to improperly scope a project and be left with a project that they feel is incomplete.  If scope management is, and you will excuse the pun, out of scope for your organization then it is wise to pursue Professional Services arrangements via a more flexible term such as hourly or time and materials.

All of these types of firms have an important role to play in the IT ecosystem.  Rarely can an internal IT department have all of the skills necessary to handle every situation on their own, it requires the careful selection and management of outside firms to help to round out the needs of a business to cover what is needed in the best ways possible.  At a minimum, internal IT must work with vendors and resellers to acquire the gear that they need for IT to exist.  Rarely does it stop there.  Whether an IT department needs advice on a project, extra hands when things get busy, oversight on something that has not been done before, support during holidays or off hours or just peers off of whom ideas can be bounced, IT departments of all sizes and types turn to IT Service Providers to fill in gaps both big and small.

Any IT role or function can be moved from internal to external staff.  The only role that ultimately can never be moved to an external team is the top level of vendor management.  At some point, someone internal to the business in question must oversee the relationship with at least one vendor (or else there must be a full internal IT staff fulfilling all roles.)  In many modern companies it may make sense for a single internal person to the company, often a highly trusted senior manager, be assigned to oversee vendor relationships but allow a vendor or a group of vendors to actually handle all aspects of IT.  Some vendors specialize in vendor relationship management and may bring experience with and quality management of other vendors with them as part of their skill set.  Often these are MSPs or IT Outsources who are bringing IT Management as part of their core skill set.  This can be a very valuable component as often these vendors work with other vendors a great deal and have a better understanding of performance expectations, cost expectations and leverage more scale and reputation than the end customer will.

Just as an internal IT department is filled with variety, so are IT service and product vendors.  Your vendor and support ecosystem is likely to be large and unique and will play a significant role in defining how you function as an IT department.  The key to working well with this ecosystem is understanding what kind of organization it is that you are working with, considering their needs and motivations and working to establish relationships based on mutual business respect coupled with relational guidelines that promote mutual success.

Remember that as the customer you drive the relationship with the vendor; they are stuck in the position of delivering the service requested or declining to do so.  But as the customer, you are in a position to push for a good working relationship that makes everyone able to work together in a healthy way.  Not every relationship is going to work out for the best, but there are ways to encourage good outcomes and to put the best foot forward in starting a new relationship.

You Are Not Special

It is not my intention for this to sound harsh, but I think that it has to be said: “You are not special.”  And by “you” here, of course, I mean your business.  The organization that you, as an IT practitioner, support.  For decades we have heard complaints about how modern education systems attempt to make every student feel unique and special, when awards are given out schools attempt to find a way, especially with elementary students, to make sure that every student gets an award of some sort.  Awards for best attendance, posture, being quiet in class or whatever are created to award completely irrelevant things in order to make every student not only feel like part of the group, but to be a special, unique individual that has accomplished something better than anyone else.

This attitude, this belief that everyone is special and that all of those statistics, general rules and best practices apply to “someone else” has become pervasive in IT now as well, manifesting itself in the belief that each business, each company is so special and unique that IT industry knowledge does not apply in this situation.  IT practitioners with whom I have spoken almost always agree that best practices and accumulated industry knowledge are good and apply in nearly every case – except for their own.  All of those rules of thumb, all of those guidelines are great for someone else, but not for them.  The problem is that nearly everyone feels this way, but this cannot be the case.

I have found this problem to be most pronounced and, in fact, almost exclusive to the small business market where, in theory, the likelihood of a company being highly unique is actually much lower than the large enterprise space of the Fortune 100 where uniqueness is somewhat expected.  But instead of small businesses assuming uniformity and enormous businesses expecting uniqueness the opposite appears to happen.  Large businesses understand that even at massive scale IT problems are mostly standard patterns and by and large should be solved using tried and true, normal approaches.  And likewise, small businesses, seemingly driven by an emotional need to be “special” claim a need for avoiding industry patterns often eschewing valuable knowledge to a ludicrous degree and often while conforming to the most textbook example of the use case for the pattern.  It almost seems, from my experience, that the more “textbook” a small business is, the more likely that its IT department will avoid solutions designed exactly for them and attempt to reinvent the wheel at any cost.

Common solutions and practices apply to the majority of businesses and workloads, easily in excess of 99.9% of them.  Even in larger companies where there is opportunity for uniqueness we expect to only see rare workloads that fall into a unique category.  Even in the world’s largest businesses the average workload is, well, average.  Large enterprises with tens of thousands of servers and workloads often find themselves with a handful of very unique situations for which there is no industry standard to rely on.  But even so, they have many thousands of very standard workloads that are not special in any way.  The smaller the business not only the less opportunity for a unique workload but the less chance of it occurring on a workload by workload basis because they have so many fewer workloads.

One of the reasons that small businesses, even ones very unique as small businesses go, are rarely actually unique is because when a small business has an extreme need for say performance, capacity, scale or security it [almost] never means that it needs that thing in excess of existing standards for larger businesses.  The standards of how to deal with large data sets or extreme security, for example, are already well established in the industry at large and small businesses need only leverage the knowledge and practices developed for larger players.

What is surprising is when a small business with relatively trivial revenue believes that its data requires a level of secrecy and security in excess of the security standards of the world’s top financial institutions, military organizations, governments, hospitals or nuclear power facilities.  What makes the situation more absurd is that in pursuing these extremes of security, small businesses almost always result in very low security standards.  They often cite needs for “extreme security” for doing insecure or as we often say “tin foil hat” procedures.

Security is one area where this behavior if very pronounced.  Often it is small business owners or small business IT “managers” who create this feeling of distrusting industry standards, not IT practitioners themselves, although the feeling that a business is unique often trickles down and is seen there as well.

Similar to security, the need for unlimited uptime and highly available systems, rarely needed even for high end enterprise workloads, seem an almost ubiquitous goal in small businesses.  Small businesses often spend orders of magnitude more money, in relationship to revenue, on procuring high availability systems compared to their larger business counterparts.  Often this is done with the mistaken belief that large businesses always use high availability and that small business must do so to compete, that if they do not that they are not a viable business or that any downtime equates to business collapse.  None of these are true.  Enterprises have far lower cost of reliability compared to revenue and still do considerable cost analysis to see what reliability expenditures are justified through risk.  Small businesses rarely do that best practice analysis and jump, almost universally, to the very unlikely belief that their workloads are dramatically more valuable than even the largest enterprises and that they have no means of mitigating downtime.  Eschewing both business best practices (doing careful cost and risk analysis before investing in risk mitigation), financial best practices (erring on the side of up front cost savings) or technology best practices (high availability only when needed and justified) leaves many businesses operating from the belief that they are “special” and none of the normal rules apply to them.

By approaching all technology needs from the assumption of being special, businesses that do this are unable to leverage the vast, accumulated knowledge of the industry.  This means that businesses are continuously reinventing the wheel and attempting to forge new paths where well trodden, safe paths already exist.  Not only can this result in an extreme degree of overspending in some cases and in dangerous risk in others but it effectively guarantees that the cost of any project is unnecessarily high.  Small business, especially, have the extreme advantage of being able to leverage the research and experience of larger businesses allowing smaller businesses to be more agile and lean.  This is a key component to making small businesses compete against the advantages of scale inherent to large businesses.  When small businesses ignore this advantage they are left with neither the scale of big business nor the advantages of being small.

There is no simple solution here – small business IT practitioners and small business managers need to step down from their pedestals and take a long, hard look at their companies and ask if they really are unique and special or if they are a normal business with normal needs.  I guarantee you are not the first to face the problems that you have.  If there isn’t a standard solution approach available already then perhaps the approach to the problem is wrong itself.  Take a step back and evaluate with an eye to understanding that many businesses share common problems and can tackle them effectively using standard patterns, approaches and often best practices.  If your immediate reaction to best practices, patterns and industry knowledge is “yes but that doesn’t apply here” you need to stop and reevaluate – because yes, it certainly does apply to you.  It is almost certainly true that you have misunderstood the uniqueness of your business or you have misunderstood how the guidance is applied resulting in the feeling that those guidelines are not applicable.  Even those rare businesses with very unique workloads only have them for a small number of their workloads and not the majority of them; the most extremely unique businesses and organizations still have many common workloads.

Patterns and best practices are our friends and allies, our trusted partners in IT.  IT, and business in general, is challenging and complex.  To excel as IT practitioners we can seek to stand on the shoulders of giants, walk the paths that have been mapped and trodden for us and leverage the work of others to make our solutions as stable, predictable and supportable as possible.  This allows us to provide maximum value to the businesses that we support.

Explaining the Lack of Large Scale Studies in IT

IT practitioners ask for these every day and yet, none exist – large scale risk and performance studies for IT hardware and software.  This covers a wide array of possibilities, but common examples are failure rates between different server models, hard drives, operating systems, RAID array types, desktops, laptops, you name it.  And yet, regardless of the high demand for such data there is none available.  How can this be.

Not all cases are the same, of course, but by and large there are three really significant factors that come into play keeping this type of data from entering the field.  These are the high cost of conducting a study, the long time scale necessary for a study and a lack of incentive to produce and/or share this data with other companies.

Cost is by far the largest factor.  If the cost of large scale studies could be overcome, all other factors could have solutions found for them.  But sadly the nature of a large scale study is that it will be costly.  As an example we can look at server reliability rates.

In order to determine failure rates on a server we need a large number of servers in order to collect this data.  This may seem like an extreme example but server failure rates is one of the most commonly requested large scale study figures and so the example is an important one.  We would need perhaps a few hundred servers for a very small study but to get statistically significant data we would likely need thousands of servers.  If we assume that a single server is five thousand dollars, which would be a relatively entry level server, we are looking at easily twenty five million dollars of equipment!  And that is just enough to do a somewhat small scale test (just five thousand servers) of a rather low cost device.  If we were to talk about enterprise servers we would easily just to thirty or even fifty thousand dollars per server taking the cost even to a quarter of a billion dollars.

Now that cost, of course, is for testing a single configuration of a single model server.  Presumably for a study to be meaningful we would need many different models of servers.  Perhaps several from each vendor to compare different lines and features.  Perhaps many different vendors.  It is easy to see how quickly the cost of a study becomes impossibly large.

This is just the beginning of the cost, however.  To do a good study is going to require carefully controlled environments on par with the best datacenters to isolate environmental issues as much as possible.  This means highly reliable electric, cooling, airflow, humidity control, vibration and dust control.  Good facilities like this are very expensive and are why many companies do not pay for them, even for valuable production workloads.  In a large study this cost could easily exceed the cost of the equipment itself over the course of the study.

Then, of course, we must address the needs for special sensors and testing.  What exactly constitutes a failure?  Even in production systems there is often dispute on this.  Is a hard drive failing in an array a failure, even if the array does not fail?  Is predictive failure a failure? If dealing with drive failure in a study, how do you factor in human components such as drive replacement which may not be done in a uniform way?  There are ways to handle this, but they add complication and make the studies skew away from real world data to contrived data for a study.  Establishing study guidelines that are applicable and useful to end users is much harder than it seems.

And the biggest cost, manual labor.  Maintaining an environment for a large study will take human capital which may equal the cost of the study itself.  It takes a large number of people to maintain a study environment, run the study itself, monitor it and collect the data.  All in all, the cost are generally, simply impossible to do.

Of course we could greatly scale back the test, run only a handful of servers and only two or three models, but the value of the test rapidly drops and risks ending up with results that no one can use while still having spent a large sum of money.

The second insurmountable problem is time.  Most things need to be tested for failure rates over time and as equipment in IT is generally designed to work reliably for decades, collecting data on failure rates requires many years.  Mean Time to Failure numbers are only so valuable, Mean Time Between Failures and failure types, modes and statistics on that failure is very important in order for a study to be useful.  What this means is that for a study to be truly useful it must run for a very long time creating greater and greater cost.

But that is not the biggest problem.  The far larger issue is that for a study to have enough time to generate useful failure numbers, even if those numbers were coming out “live” as they happened it would already be too late.  The equipment in question would already be aging and nearing time for replacement in the production marketplace by the time the study was producing truly useful early results.  Often production equipment is only purchased for three to five years total lifespan.  Getting results even one year into this span would have little value.  And new products may replace those in the study even more rapidly than the products age naturally making the study only valuable from a historic context without any use in determining choices in a production decision role – the results would be too old to be useful by the time that they were available.

The final major factor is a lack of incentive to provide existing data to those who need it.  While few sources of data exists, a few do, but nearly all are incomplete and exist for large vendors to measure their own equipment quality, failure rates and such.  These are rarely done in controlled environments and often involve data collected from the field.  In many cases this data may even be private to customers and not legally able to be shared regardless.

But vendors who collect data do not collect it in an even, monitored way so sharing that data could be very detrimental to them because there is no assurance that equal data from their competitors would exist.  Uncontrolled statistics like that would offer no true benefit to the market nor do the vendors who have them so vendors are heavily incentivized to keep such data under tight wraps.

The rare exception are some hardware studies from vendors such as Google and BackBlaze who have large numbers of consumer class hard drives in relatively controlled environments and collect failure rates for their own purposes but have little or no risk from their own competitors leveraging that data but do have public relations value in doing so and so, occasionally, will release a study of hardware reliability on a limited scale.  These studies are hungrily devoured by the industry even though they generally contain relatively little value as their data is old and under unknown conditions and thresholds, and often do not contain statistically meaningful data for product comparison and, at best, contain general industry wide statistical trends that are somewhat useful for predicting future reliability paths at best.

Most other companies large enough to have internal reliability statistics have them on a narrow range of equipment and consider that information to be proprietary, a potential risk if divulged (it would give out important details of architectural implementations) and a competitive advantage.  So for these reasons they are not shared.

I have actually been fortunate enough to have been involved and run a large scale storage reliability test that was conducted somewhat informally, but very valuably on over ten thousand enterprise servers over eight years resulting in eighty thousand server years of study, a rare opportunity.  But what was concluded in that study was that while it was extremely valuable what it primarily showed is that on a set so large we were still unable to observe a single failure!  The lack of failures was, itself, very valuable.  But we were unable to produce any standard statistic like Mean Time to Failure.  To produce the kind of data that people expect we know that we would have needed hundreds of thousands of server years, at a minimum, to get any kind of statistical significance but we cannot reliably state that even that would have been enough.  Perhaps millions of servers years would have been necessary.  There is no way to truly know.

Where this leaves us is that large scale studies in IT simply do not exist and will never, likely, exist.  When they do they will be isolated and almost certainly crippled by the necessities of reality.  There is no means of monetizing studies on the scale necessary to be useful, mostly because failure rates of enterprise gear is so low while the equipment is so expensive, so third party firms can never cover the cost of providing this research.  As an industry we must accept that this type of data does not exist and actively pursue alternatives to having access to such data.  It is surprising that so many people in the field expect this type of data to be available when it never has been historically.

Our only real options, considering this vacuum, are to collect what anecdotal evidence exists (a very dangerous thing to do which requires careful consideration of context) and the application of logic to assess reliability approaches and techniques.  This is a broad situation where observation necessarily fails us and only logic and intuition can be used to fill the resulting gap in knowledge.