Category Archives: Best Practices

You Can’t Virtualize That!

We get this all of the time in IT, a vendor tells us that a system cannot be virtualized.  The reasons are numerous.  On the IT side, we are always shocked that a vendor would make such an outrageous claim; and often we are just as shocked that a customer (or manager) believes them.  Vendors have worked hard to perfect this sales pitch over the years and I think that it is important to dissect it.

The root cause of problems is that vendors are almost always seeking ways to lower costs to themselves while increasing profits from customers.  This drives a lot of what would otherwise be seen as odd behaviour.

One thing that many, many vendors attempt to do is limit the scenarios under which their product will be supported.  By doing this, they set themselves up to be prepared to simply not provide support – support is expensive and unreliable.  This is a common strategy.  It some cases, this is so aggressive that any acceptable, production deployment scenario fails to even exist.

A very common means of doing this is to fail to support any supported operating system, de facto deprecating the vendor’s own software (for example, today this would mean only supporting Windows XP and earlier.)  Another example is only supporting products that are not licensed for the use case (an example would be requiring the use of a product like Windows 10 be used as a server.)  And one of the most common cases is forbidding virtualization.

These scenarios put customers into difficult positions because on one hand they have industry best practices, standard deployment guidelines, in house tooling and policies to adhere to; and on the other hand they have vendors often forbidding proper system design, planning and management.  These needs are at odds with one another.

Of course, no one expects every vendor to support every potential scenario.   Limits must be applied.  But there is a giant chasm between supporting reasonable, well deployed systems and actively requiring unacceptably bad deployments.  We hope that our vendors will behave as business partners and share a common interest in our success or, at the very least, the success of their product and not directly seek to undermine both of these causes.  We would hope that, at a very minimum, best effort support would be provided for any reasonable deployment scenario and that guaranteed support would be likely offered for properly engineered, best practice scenarios.

Imagine a world where driving the speed limit and wearing a seatbelt would violate your car warranty and that you would only get support if you drove recklessly and unprotected!

Some important things need to be understood about virtualization.  The first is that virtualization is a long standing industry best practice and is expected to be used in any production deployment scenario for services.  Virtualization is in no way new, even in the small business market it has been in the best practice category for well over a decade now and for many decades in the enterprise space.  We are long past the point where running systems non-virtualized is considered acceptable, and that includes legacy deployments that have been in place for a long time.

There are, of course, always rare exceptions to nearly any rule.  Some systems need access to very special case hardware and virtualization may not be possible, although with modern hardware passthrough this is almost unheard of today.  And some super low latency systems cannot be virtualized but these are normally limited to only the biggest international investment banks and most aggressive hedgefunds and even the majority of those traditional use cases have been eliminated by improvements in virtualization making even those situations rare.  But the bottom line is, if you can’t virtualize you should be sad that you cannot, and you will know clearly why it is impossible in your situation.  In all other cases, your server needs to be virtual.

Is it not important?

If a vendor does not allow you to follow standard best practices for healthy deployments, what does this say about the vendor’s opinion of their own product?  If we were talking about any other deployment, we would immediately question why we were deploying a system so poorly if we plan to depend on it.  If our vendor forces us to behave this way, we should react in the same manner – if the vendor doesn’t take the product to the same degree that we take the least of our IT services, why should we?

This is an “impedance mismatch”, as we say in engineering circles, between our needs (production systems) and how the vendor making that system appears to treat them (hobby or entertainment systems.)  If we need to depend on this product for our businesses, we need a vendor that is on board and understands business needs – has a production mind set.  If the product is not business targeted or business ready, we need to be aware of that.  We need to question why we feel we should be using a service in production, on which we depend and require support, that is not intended to be used in that manner.

Is it supported?  Is it being tested?

Something that is often overlooked from the perspective of customers is whether or not the necessary support resources for a product are in place.  It’s not uncommon for the team that supports a product to become lean, or even disappear, but the company to keep selling the product in the hopes of milking it for as much as they can and bank on either muddling through a problem or just returning customer funds should the vendor be caught in a situation where they are simply unable to support it.

Most software contracts state that the maximum damage that can be extracted from the vendor is the cost of the product, or the amount spent to purchase it.  In a case such as this, the vendor has no risk from offering a product that they cannot support – even if charging a premium for support.  If the customer manages to use the product, great they get paid. If the customer cannot and the vendor cannot support it, they only lose money that they would never have gotten otherwise.  The customer takes on all the risk, not the vendor.

This suggests, of course, that there is little or no continuing testing of the product as well, and this should be of additional concern.  Just because the product runs does not mean that it will continue to run.  Getting up and running with an unsupported, or worse unsupportable, product means that you are depending more and more over time on a product with a likely decreasing level of potential support, slowly getting worse over time even as the need for support and the dependency on the software would be expected to increase.

If a proprietary product is deployed in production, and the decision is made to forgo best practice deployments in order to accommodate support demands, how can this fit in a decision matrix? Should this imply that proper support does not exist? Again, as before, this implies a mismatch in our needs.

 

Is It Still Being Developed?

If the deployment needs of the software follow old, out of date practices, or require out of date (or not reasonably current software or design) then we have to question the likelihood that the product is currently being developed.  In some cases we can determine this by watching the software release cycle for some time, but not in all cases.  There is a reasonable fear that the product may be dead, with no remaining development team working on it.  The code may simply be old, technical debt that is being sold in the hopes of making a last, few dollars off of an old code base that has been abandoned.  This process is actually far more common than is often believed.

Smaller software shops often manage to develop an initial software package, get it on the market and available for sale, but fail to be able to afford to retain or restaff their development team after initial release(s).  This is, in fact, a very common scenario.  This leaves customers with a product that is expected to become less and less viable over time with deployment scenarios becoming increasingly risky and data increasing hard to extricate.

 

How Can It Be Supported If the Platform Is Not Supported?

A common paradox of some more extreme situations is software that, in order to qualify as “supported”, requires other software that is either out of support or was never supported for the intended use case.  Common examples of this are requiring that a server system be run on top of a desktop operating system or requiring versions of operating systems, databases or other components, that are no longer supported at all.  This last scenario is scarily common.  In a situation like this, one has to ask if there can ever be a deployment, then, where the software can be considered to be “supported”?  If part of the stack is always out of support, then the whole stack is unsupported.  There would always be a reason that support could be denied no matter what.   The very reason that we would therefore demand that we avoid best practices would equally rule out choosing the software itself in the first place.

Are Industry Skills and Knowledge Lacking?

Perhaps the issue that we face with software support problems of this nature are that the team(s) creating the software simply do not know how good software is made and/or how good systems are deployed.  This is among the most reasonable and valid reasons for what would drive us to this situation.  But, like the other hypothesis reasons, it leaves us concerned about the quality of the software and the possibility that support is truly available.  If we can’t trust the vendor to properly handle the most visible parts of the system, why would we turn to them as our experts for the parts that we cannot verify?

The Big Problem

The big, overarching problem with software that has questionable deployment and maintenance practice demands in exchange for unlocking otherwise withheld support is not, as we typically assume a question of overall software quality, but one of viable support and development practices.  That these issues suggest a significant concern for long term support should make us strongly question why we are choosing these packages in the first place while expecting strong support from them when, from the onset, we have very visible and very serious concerns.

There are, of course, cases where no other software products exist to fill a need or none of any more reasonable viability.  This situation should be extremely rare and if such a situation exists should be seen as a major market opportunity for a vendor looking to enter that particular space.

From a business perspective, it is imperative that the technical infrastructure best practices not be completely ignored in exchange for blind or nearly blind following of vendor requirements that, in any other instance, would be considered reckless or unprofessional. Why do we so often neglect to require excellence from core products on which our businesses depend in this way?  It puts our businesses at risk, not just from the action itself, but vastly moreso from the risks that are implied by the existence of such a requirement.

The End of the GUI Era

We should set the stage by looking at some historical context around GUIs and their role within the world of systems administration.

In the “olden days” we did not have graphical user interfaces on any computers at all, let alone on our servers. Long after GUIs began to become popular on end user equipment, servers still did not have them. In the 1980s and 1990s the computational overhead necessary to produce a GUI was significant in terms of the total computing capacity of a machine and using what little that there was to produce a GUI was rather impractical, if often not even completely impossible. The world of systems administration grew up in this context, working from command lines because there was no other option available to us. It was not common for people to desire GUIs for systems administration, perhaps because the idea had not occurred to people yet.

In the mid-1990s Microsoft, along with some others, began to introduce the idea of GUI-driven systems administration for the entry level server market. At first the approach was not that popular as it did not match how experienced administrators were working in the market. But slowly, as new Windows administrators and to some degree as Novell Netware administrators, began to “grow up” with access to GUI-based administration tools there began to be an accepted place in the server market for these systems. In the mid to late 1990s the UNIX and other non-Windows servers completely dominated the market. Even VMS was a major player still and on the small business and commodity server side Novell Netware was the dominant player mid-decade and still a very serious contender late in the decade. Netware offered a GUI experience but one that was very light and should probably be considered only “semi-GUI” in comparison to Windows NT’s rich GUI experience offered by at least 1996 and to some degree earlier with the NT 3.x family, although Windows NT was only just finding its place in the world before NT 4’s release.

Even at the time, the GUI-driven administration market remained primarily a backwater. Microsoft and Windows still had no major place on the server side but was beginning to make inroads via the small business market where their low cost and easy to use products made a lot sense. But it was truly the late 1990s panic and market expansion brought on by the combination of the Y2K scare, the dotcom market bubble and excellent product development and marketing by Microsoft that a significant growth and shift to a GUI-driven administration market occurred.

The massive expansion of the IT market in the late 1990s meant that there was not enough time or resources to train new people entering IT. The learning curve for many systems, including Solaris and Netware, was very steep and the industry needed a truly epic number of people to go from zero to “competent IT professional” faster than it was possible to do with the existing platforms of the day. The market growth was explosive and there was so much money to be made working in IT that there were no available resources to effectively train new people who needed to be coming into IT as anyone qualified to handle educational duties was also able to earn so much more working in the industry rather than working in education. As the market grew, the value of mature, experienced professionals became extremely high, as they were more and more rare in the ever expanding field as a whole.

The market responded to this need in many ways but one of the biggest ones was to fundamentally change how IT was approached. Instead of pushing IT professionals to overcome the traditional learning curves and develop the needed skills to effectively manage the systems that were on the market at the time, the market changed which tools that they were using to accommodate less experienced and less knowledgeable IT staff. Simpler and often more expensive tools often with GUI interfaces began to flood the market allowing those with less training and experience to at least begin to be useful and productive almost immediately even without ever having seen a product previously.

This change coincided with the natural advancement of the performance of computer hardware. It was during this era that for the first time the power of many systems was such that while the GUI still made a rather significant impact to performance, the lower cost of support staff and speed at which systems could be deployed and managed generally offset this loss of computing capacity taken by the GUI. The GUI rapidly became a standard addition to systems that just a few years before would never have seen one.

To improve the capabilities of these new IT professionals and to rush them into the marketplace the industry also shifted heavily towards certifications, more or less a new innovation at the time, which allowed new IT pros, often with no hands on experience of any kind, to establish some degree of competence and to do so commonly without needing any significant interaction or investment from existing IT professionals like university programs would require. Both the GUI-based administration market, as well as the certification industry, boomed; and the face IT significantly changed.

The result certainly was a flood of new, untrained or lightly trained IT professionals entering the market at a record pace. In the short term this change work for the industry. The field went from dramatically understaffed to relatively well staffed years faster than it could have done so otherwise. But it did not take long before the penalties for this rapid uptake of new people began to appear.

One of the biggest impacts to the industry was that there was an industry-wide “baby boom” with all of the growing pains that that would entail. An entire generation of IT professionals grew up in the boot camps and rapid “certification training” programs of the late 1990s. This resulted in a long term effect of the rules of thumb and general approaches common in that era becoming often codified to the point of near religious belief in a way that previous, as well as later, approaches would not. Often, because education was done quickly and shallowly, many concepts had to be learned by rote without an understanding of the fundamentals behind them. As the “Class of 1998” grew into the senior IT professionals in their companies over time, they became the mentors of new generations and that old rote learning has very visibly trickled down through similar approaches in the years since, even long after the knowledge is outdated or impractical and in many cases it has been interpreted incorrectly and is wrong in predicable ways even for the era from which it sprang.

Part of this learning of the era was a general acceptance that GUIs were not just acceptable but that they were practical and expected. The baby boom effect meant that there was little mentorship from the former era and previously established practices and norms were often swept away. The baby boom effect meant that the industry did not exactly reinvent itself as much as it simple invested itself. Even the concept of Information Technology as a specific industry unto itself took its current form and took hold in the public consciousness during this changing of the guards. Instead of being a vestige or other departments or disciplines, IT came into its own; but it did so without the maturing and continuity of practices that would have existed with more organic growth leaving the industry in possibly a worse position than it might have been would it have developed in a continuous fashion.

The lingering impact of the late 1990s IT boom will be felt for a very long time as it will take many generations for the trends, beliefs and assumptions of that time period to finally be swept away. Slowly, new concepts and approaches are taking hold, often only when old technologies disappear and knew ones are introduced breaking the stranglehold of tradition. One of these is the notion of the GUI being the dominant method by which systems administration is accomplished.

As we pointed out before, the GUI at its inception was a point of differentiation between old systems and the new world of the late 1990s. But since that time GUI administration tools have become ubiquitous in their availability. Every significant platform has and has long had graphical administration options so the GUI no longer sets any platform apart in a significant way. This means that there is no longer any vendor with a clear agenda driving them to push the concept of the GUI. The marketing value of the GUI is effectively gone. Likewise, not only did systems that previously lacked a strong GUI nearly all develop one (or more) but the GUI-based systems that did not have strong command line tools went back and developed those as well and developed new professional ecosystems around them. The tide most certainly turned.

Furthermore, over the past nearly two decades the rhetoric of the non-GUI world has begun to take hold. System administrators working from a position of a mastery of the command line, on any platform, generally outperform their counterparts leading to more career opportunities, more challenging roles and higher incomes. Companies focused on command line administration find themselves with more skilled workers and a higher administration density which, in turn, lowers overall cost.

This alone was enough to make the position of the GUI begin to falter. But there was always the old argument that GUIs, even in the late 1990s, used a small amount of system resources and only added a very small amount of additional attack surface. Even if they were not going to be used, why not have them installed “just in case.” As CPUs got faster, memory got larger, storage got cheaper and as system design improved the impact of the GUI became less and less so this argument of having GUIs available got stronger. Especially strong was the proposal that GUIs allowed junior staff to do tasks as well making them more useful. But it was far too common for senior staff to retain the GUI as a crutch in these circumstances.

With the advent of virtualization in the commodity server space, this all began to change. The cost of a GUI became suddenly noticeable again. A system running twenty virtual machines would suddenly use twenty times the CPU resources and twenty times the memory and twenty times the storage capacity of a single GUI instance. The footprint of the GUI was noticeable again. As virtual machine densities began to climb, so did the relative impact of the GUI.

Virtualization gave rise to cloud computing. Cloud computing increased virtual machine deployment densities and exposed other performance impacts of GUIs, mostly in terms of longer instance build times and more complex remote console access. Systems requiring a GUI began to noticeably lag behind their GUI-less counterparts in adoption and capabilities.

But the far bigger factor was the artifact of cloud computing’s standard billing methodologies. Because cloud computing typically exposes per-instance costs in a raw, fully visible way IT departments had no means of fudging or overlooking the costs of GUI deployments whose additional overhead would often even double the cost of a single cloud instance. Accounting would very clearly see bills for GUI systems costing far more than their GUI-less counterparts. Even non-technical teams could see that the cost of GUIs was adding up even before considering the cost of management.

This cost continues to increase as we move towards container technologies where the scale of individual instances becomes small and smaller means that the relative overhead of the GUI becomes more significant.

But the real impact, possibly the biggest exposure of the issues around GUI driven systems is the industry’s move towards the DevOps system automation models. Today only a relatively small percentage of companies are actively moving to a full cloud-enabled, elastically scalable DevOps model of system management but the trend is there and the model leaves GUI administrators and their systems completely behind. With DevOps models direct access to machines is no longer a standard mode of management and systems have gone even farther than working solely from the command line to being built completely in code meaning that not only do system administrators working in the DevOps world need to interact with their systems at a command line but they must do so programmatically.

The market is rapidly moving towards fewer, more highly skilled systems administrators working with many, many more servers “per admin” than in any previous era. The idea that a single systems administrator can only manage a few dozen servers, a common belief in the GUI world, has long been challenged even in traditional “snowflake” command line systems administration with numbers easily climbing into the few hundred range. But the DevOps model or similar automation models take those numbers into the thousands of servers per administrator. The overhead of GUIs is becoming more and more obvious.

As new technologies like cloud, containers and DevOps automation models become pervasive so does the natural “sprawl” of workloads. This means that companies of all sizes are seeing an increase in the numbers of workloads that need to be managed. Companies that traditionally had just two or three servers today may have ten or twenty virtual instances! The number of companies that need only one or two virtual machines is dwindling.

This all hardly means that GUI administration is going to go away in the near, or even the distant, future. The need for “one off” systems administration will remain. But the ratio of administrators able to work in a GUI administration “one off” mode versus those that need to work through the command line and specifically through scripted or even fully automated systems (a la Puppet, Chef, Ansible) is already tipping incredibly rapidly towards non-GUI system administration and DevOps practices.

What does all of this mean for us in the trenches of the real world? It means that even roles, such as small business Windows administration, that traditionally have had little or no need to work at the command line need to reconsider the dependence on the local server GUI for our work.   Command line tools and processes are becoming increasingly powerful, well known and how we are expected to work. In the UNIX world the command line has always remained and the need to rely on GUI tools would almost always be seen as a major handicap. This same impression is beginning to apply to the Windows world as well. Slowly those that rely on GUI tools exclusively are being seen as second class citizens and increasingly relegated to more junior roles and smaller organizations.

The improvement in scripting and automation tools also means that the value of scale is getting better so that the cost to administer small numbers of servers is becoming very high on a per workload basis which means that there is a very heavy encouragement for smaller companies to look towards management consolidation through the use of outside vendors who are able to specialize in large scale systems management and leverage scripting and automation techniques to bring their costs more in line with larger businesses’ costs. The ability to use outside vendors to establish scale or an approximation to it will be very important, over time, for smaller businesses to remain cost competitive in their IT needs while still getting the same style of computing advantages that larger businesses are beginning to experience today.

It should be noted that happening in tandem with this industry shift towards the command line and automation tools is the move to more modern, powerful and principally remote GUIs. This is a far less dramatic shift but one that should not be overlooked. Tools like Microsoft’s RSAT and Server Administrator provide a GUI view that is leveraging command line and API interfaces under the hood. Likewise Canonical’s Ubuntu world now has Landscape. These tools are less popular in the enterprise but are beginning to make the larger SMB market able to maintain a GUI dependency while also managing a larger set of server instances. The advancement in these types of GUI tools may be the strongest force slowing the adoption of command line tools across the board.

Whether we are interested in the move from the command line, to GUIs and back to the command line as an interesting artifact of the history of Information Technology as an industry or if we are looking at this as a means to understanding how systems administration is evolving as a career path or business approach for our own uses it is good for us to appreciate the factors that caused it to occur and why the ebb and flow of the industry is now taking us back out to the sea of the command line once again. By understanding these forces we can more practically asses where the future will take us, when the tide may again change, how to best approach our own careers or decide on both technology and human talent for our organizations.

Avoiding Local Service Providers

Inflammatory article titles aside, the idea of choosing a technology service provider based on the fact or partially based on the fact that they are in some way located geographically near to where you are currently, is almost always a very bad idea.  Knowledge based services are difficult enough to find at all, let alone finding the best potential skills, experience and price while introducing artificial and unnecessary constraints to limit the field of potential candidates.

With the rare exception of major global market cities like New York City and London, it is nearly impossible to find a full range of skills in Information Technology in a single locality, at least not in conjunction with a great degree of experience and breadth.  This is true of nearly all highly technical industries – expertise tends to focus around a handful of localities around the world and the remaining skills are scattered in a rather unpredictable manner often because those people in the highest demand can command salary and locations as desired and live where they want to, not where they have to.

IT, more than nearly any other field, has little value in being geographically near to the business that it is supporting.  Enterprise IT departments, even when located locally to their associated businesses and working in an office on premises are often kept isolated in different buildings away from both the businesses that they are supporting and the physical systems on which they work.  It is actually very rare that enterprise server admins would physically ever see their servers or network admins see their switches and routers.  This becomes even less likely when we start talking about roles like database administrators, software developers and others who have even less association with devices that have any physical component.

Adding in a local limitation when looking for consulting talent (and in many cases even internal IT staff) adds an artificial constraint that eliminates nearly the entire possible field of talented people while encouraging people to work on site even for work for which it makes no sense.  Often working on site causes a large increase in cost and loss of productivity due to interruptions, lack of resources, poor work environment, travel or similar.  Working with exclusively or predominantly remote resources encourages a healthy investment in efficient working conditions that generally pay off very well.  But it is important to keep in mind that just because a service company is remote does not imply that the work that they will do will be remote.  In many cases this will make sense, but in others it will not.

Location agnostic workers have many advantages.  By not being tied to a specific location you get far more flexibility as to skill level (allowing you to pursue the absolute best people) or cost (by allowing you to hire people living in low cost areas) or simply offering flexibility as an incentive or get broader skill sets, larger staff, etc.  Choosing purely local services simply limits you in many ways.

Companies that are not based locally are not necessarily unable to provide local resources.  Many companies work with local resources, either local companies or individuals, to allow them to have a local presence.  In many cases this is simply what we call local “hands” and is analogous to how most enterprises work internally with centrally or remotely based IT staff and physical “hands” existing only at locations with physical equipment to be serviced.  In cases where specific expertise needs to be located with physical equipment or people it is common for companies to either staff locally in cases where the resource is needed on a very regular basis or to have specific resources travel to the location when needed.  These techniques are generally far more effective than attempting to hire firms with the needed staff already coincidentally located in the best location.  This can easily be more cost effective than working with a full staff that is already local.

As time marches forward needs change as well.  Companies that work local only can find themselves facing new challenges when they expand to include other regions or locations.  Do they choose vendors and partners only where they were originally located?  Or where they are moving to or expanding to?  Do they choose local for each location separately?  The idea of working with local resources only is nearly exclusive to the smallest of business.  Typically as businesses grow the concept of local begins to change in interesting ways.

Locality and jurisdiction may represent different things.  In many cases it may be necessary to work with businesses located in the same state or country as your business due to legal or financial logistical reasoning and this can often make sense.  Small companies especially may not be prepared the tackle the complexities of working with a foreign firm.  Larger companies may find these boundaries to be worthy of ignoring as well.  But the idea that location should be ignored should not be taken to mean that jurisdiction, by extension, should also be ignored.  Jurisdiction still plays a significant role – one that some IT service providers or other vendors may be able to navigate on your behalf allowing you to focus on working with a vendor within your jurisdiction while getting the benefits of support from another jurisdiction.

As with many artificial constraint situations, not only do we generally eliminate the most ideal vendor candidates, but we also risk “informing” the existing vendor candidate pool that we care more about locality than quality of service or other important factors.  This can lead to a situation where the vendor, especially in a smaller market, feels that they have a lock in to you as the customer and do not need to perform up to a market standard level, price competitively (as there is no true competition given the constraints) or worse.  A vendor who feels that they have a trapped customer is unlikely to perform as a good vendor long term.

Of course we don’t want to avoid companies simply because they are local to our own businesses, but we should not be giving undue preference to companies for this reason either.  Some work has advantages to being done in person, there is no denying this.  But we must be careful not to extend this to rules and needs that do not have this advantage nor should we confuse the location of a vendor with the location(s) where they do or are willing to do business.

In extreme cases, all IT work can, in theory, be done completely remotely and only bench work (the physical remote hands) aspects of IT need an on premises presence.  This is extreme and of course there are reasons to have IT on site.  Working with a vendor to determine how best service can be provided, whether locally, remotely or a combination of the two can be very beneficial.

In a broader context, the most important concept here is to avoid adding artificial or unnecessary constraints to the vendor selection process.  Assuming that a local vendor will be able or willing to deliver a value that a non-local vendor can or will do is just one way that we might bring assumption or prejudice to a process such as this.  There is every possibility that the local company will do the best possible job and be the best, most viable vendor long term – but the chances are far higher than you will find the right partner for your business elsewhere.  It’s a big world and in IT more than nearly any other field it is becoming a large, flat playing field.

The Jurassic Park Effect

“If I may… Um, I’ll tell you the problem with the scientific power that you’re using here, it didn’t require any discipline to attain it. You read what others had done and you took the next step. You didn’t earn the knowledge for yourselves, so you don’t take any responsibility for it. You stood on the shoulders of geniuses to accomplish something as fast as you could, and before you even knew what you had, you patented it, and packaged it, and slapped it on a plastic lunchbox, and now …” – Dr. Ian Malcolm, Jurassic Park

When looking at building a storage server or NAS, there is a common feeling that what is needed is a “NAS operating system.”  This is an odd reaction, I find, since the term NAS means nothing more than a “fileserver with a dedicated storage interface.”  Or, in other words, just a file server with limited exposed functionality.  The reason that we choose physical NAS appliances is for the integrated support and sometimes for special, proprietary functionality (NetApp being a key example of this offering extensive SMB and NFS integration and some really unique RAID and filesystem options or Exablox offering fully managed scale out file storage and RAIN style protection.)  Using a NAS to replace a traditional file server is, for the most part, a fairly recent phenomenon and one that I have found is often driven by misconception or the impression that managing a file server, one of the  most basic IT workloads, is special or hard.  File servers are generally considered the most basic form of server and traditionally what people meant when using the term server unless additional description was added and the only form commonly integrated into the desktop (every Mac, Windows and Linux desktop can function as a file server and it is very common to do so.)

There is, of course, nothing wrong with turning to a NAS instead of a traditional file server to meet your storage needs, especially as some modern NAS options, like Exablox, offer scale out and storage options that are not available in most operating systems.  But it appears that the trend to use a NAS instead of a file server has led to some odd behaviour when IT professionals turn back to considering file servers again.  A cascading effect, I suspect, where the reasons for why NAS are sometimes preferred and the goal level thinking are lost and the resulting idea of “I should have a NAS” remains, so that when returning to look at file server options there is a drive to “have a NAS” regardless of whether there is a logical reason for feeling that this is necessary or not.

First we must consider that the general concept of a NAS is a simple one, take a traditional file server, simplify it by removing options and package it with all of the necessary hardware to make a simplified appliance with all of the support included from the interface down to the spinning drives and everything in between.  Storage can be tricky when users need to determine RAID levels, drive types, monitor effectively, etc.  A NAS addresses this by integrating the hardware into the platform.  This makes things simple but can add risk as you have fewer support options and less ability to fix or replace things yourself.  A move from a file server to a NAS appliance is truly about support almost exclusively and is generally a very strong commitment to a singular vendor.  You chose the NAS approach because you want to rely on a vendor for everything.

When we move to a file server we go in the opposite direction.  A file server is a traditional enterprise server like any other.  You buy your server hardware from one vendor (HP, Dell, IBM, etc.) and your operating system from another (Microsoft, Red Hat, Suse, etc.)  You specify the parts and the configuration that you need and you have the most common computing model for all of IT.  With this model you generally are using standard, commodity parts allowing you to easily migrate between hardware vendors and between software vendors. You have “vendor redundancy” options and generally everything is done using open, standard protocols.  You get great flexibility and can manage and monitor your file server just like any other member of your server fleet, including keeping it completely virtualized.  You give up the vertical integration of the NAS in exchange for horizontal flexibility and standardization.

What is odd, therefore, is when returning to the commodity model but seeking, what is colloquially known as, a NAS OS.  Common examples of these include NAS4Free, FreeNAS and OpenFiler.  This category of products is generally nothing more than a standard operating system (often FreeBSD as it has ideal licensing, or Linux because it is well known) with a “storage interface” put onto it and no special or additional functionality that would not exist with the normal operating system.  In theory they are a “single function” operating system that does only one thing.  But this is not reality.  They are general purpose operating systems with an extra GUI management layer added on top.  One could say the same thing about most physical NAS products themselves, but they typically include custom engineering even at the storage level, special features and, most importantly, an integrated support stack and true isolation of the “generalness” of the underlying OS.  A “NAS OS” is not a simpler version of a general purpose OS, it is a more complex, yet less functional version of it.

What is additionally odd is that general OSes, with rare exception, already come with very simple, extremely well known and fully supported storage interfaces.  Nearly every variety of Windows or Linux servers, for example, have included simple graphical interfaces for these functions for a very long time.  These included GUIs are often shunned by system administrators as being too “heavy and unnecessary” for a simple file server.  So it is even more unusual that adding a third party GUI, one that is not patched and tested by the OS team and not standardly known and supported, would then be desired as this goes against the common ideals and practices of using a server.

And this is where the Jurassic Park effect comes in – the OS vendors (Red Hat, Microsoft, Oracle, FreeBSD, Suse, Canonical, et. al.) are giants with amazing engineering teams, code review, testing, oversight and enterprise support ecosystems.  While the “NAS OS” vendors are generally very small companies, some with just one part time person, who stand on the shoulders of these giants and build something that they knew that they could but they never stopped to ask if they should.  The resulting products are wholly negative compared to their pure OS counterparts, they do not make systems management easier nor do they fill a gap in the market’s service offerings. Solid, reliable, easy to use storage is already available, more vendors are not needed to fill this place in the market.

The logic often applied to looking at a NAS OS is that they are “easy to set up.”   This may or may not be true as easy, here, must be a relational term.  For there to be any value a NAS OS has to be easy in comparison to the standard version of the same operating system.  So in the case of FreeNAS, this would mean FreeBSD.  FreeNAS would need to be appreciably easier to set up than FreeBSD for the same, dedicated functions.  And this is easily true, setting up a NAS OS is generally pretty easy.  But this ease is only a panacea and one of which IT professionals need to be quite aware.  Making something easy to set up is not a priority in IT, making something that is easy to operate and repair when there are problems is what is important.  Easy to set up is nice, but if it comes at a cost of not understanding how the system is configured and makes operational repairs more difficult it is a very, very bad thing.  NAS OS products routinely make it dangerously easy to get a product into production for a storage role, which is almost always the most critical or nearly the most critical role of any server in an environment, that IT has no experience or likely skill to maintain, operate or, most importantly, fix when something goes wrong.  We need exactly the opposite, a system that is easy to operate and fix.  That is what matters.  So we have a second case of “standing on the shoulders of giants” and building a system that we knew we could, but did not know if we should.

What exacerbates this problem is that the very people who feel the need to turn to a NAS OS to “make storage easy” are, by the very nature of the NAS OS, the exact people for whom operational support and the repair of the system is most difficult.  System administrators who are comfortable with the underlying OS would naturally not see a NAS OS as a benefit and avoid it, for the most part.  It is uniquely the people for whom it is most dangerous to run a not fully understood storage platform that are likely to attempt it.  And, of course, most NAS OS vendors earn their money, as we could predict, on post-installation support calls for customers who deployed and got stuck once they were in production so that they are at the mercy of the vendors for exorbitant support pricing.  It is in the interest of the vendors to make it easy to install and hard to fix.  Everything is working against the IT pro here.

If we take a common example and look at FreeNAS we can see how this is a poor alignment of “difficulties.”  FreeNAS is FreeBSD with an additional interface on top.  Anything that FreeNAS can do, FreeBSD an do.  There is no loss of functionality by going to FreeBSD.  When something fails, in either case, the system administrator must have a good working knowledge of FreeBSD in order to exact repairs.  There is no escaping this.  FreeBSD knowledge is common in the industry and getting outside help is relatively easy.  Using FreeNAS adds several complications, the biggest being that any and all customizations made by the FreeNAS GUI are special knowledge needed for troubleshooting on top of the knowledge already needed to operate FreeBSD.  So this is a large knowledge set as well as more things to fail.  It is also a relatively uncommon knowledge set as FreeNAS is a niche storage product from a small vendor and FreeBSD is a major enterprise IT platform (plus all use of FreeNAS is FreeBSD use but only a tiny percentage of FreeBSD use is FreeNAS.)  So we can see that using a NAS OS just adds risk over and over again.

This same issue carries over into the communities that grow up around these products.  If you look to communities around FreeBSD, Linux or Windows for guidance and assistance you deal with large numbers of IT professionals, skilled system admins and those with business and enterprise experience.  Of course, hobbyists, the uninformed and others participate too, but these are the enterprise IT platforms and all the knowledge of the industry is available to you when implementing these products.  Compare this to the community of a NAS OS.  By its very nature, only people struggling with the administration of a standard operating system and/or storage basics would look at a NAS OS package and so this naturally filters the membership in their communities to include only the people from whom we would be best to avoid getting advice.  This creates an isolated culture of misinformation and misunderstandings around storage and storage products.  Myths abound, guidance often becomes reckless and dangerous and industry best practices are ignored as if decades of accumulated experience had never happened.

A NAS OS also, commonly, introduces lags in patching and updates.  A NAS OS will almost always and almost necessarily trail its parent OS on security and stability updates and will very often follow months or years behind on major features.  In one very well known scenario, OpenFiler, the product was built on an upstream non-enterprise base (RPath Linux) which lacked community and vendor support, failed and was abandoned leaving downstream users, included everyone on OpenFiler, abandoned without the ecosystem needed to support them.  Using a NAS OS means trusting not just the large, enterprise and well known primary OS vendor that makes the base OS but trusting the NAS OS vendor as well.  And the NAS OS vendor is orders of magnitude more likely to fail if they are basing their products on enterprise class base OSes.

Storage is a critical function and should not be treated carelessly and should not be ignored as if its criticality did not exist.  NAS OSes tempt us to install quickly and forget, hoping that nothing ever goes wrong or that we can move on to other roles or companies completely before bad things happen.  It sets us up for failure where failure is most impactful.  When a typical application server fails we can always copy the files off of its storage and start fresh.  When storage fails, data is lost and systems go down.

“John Hammond: All major theme parks have delays. When they opened Disneyland in 1956, nothing worked!

Dr. Ian Malcolm: Yeah, but, John, if The Pirates of the Caribbean breaks down, the pirates don’t eat the tourists.”

When storage fails, businesses fail.  Taking the easy route to setting up storage and ignoring the long term support needs and seeking advice from communities that have filtered out the experienced storage and systems engineers increases risk dramatically.  Sadly, the nature of a NAS OS, is that the very reason that people turn to it (lack of deep technical knowledge to build the systems) is the very reason they must avoid it (even greater need for support.)  The people for whom NAS OSes are effectively safe to use, those with very deep and broad storage and systems knowledge would rarely consider these products because for them they offer no benefits.

At the end of the day, while the concept of a NAS OS sounds wonderful, it is not a panacea and the value of a NAS does not carry over from the physical appliance world to the installed OS world and the value of standard OSes is far too great for NAS OSes to effectively add real value.

“Dr. Alan Grant: Hammond, after some consideration, I’ve decided, not to endorse your park.

John Hammond: So have I.”