Understanding Technical Debt

From Wikipedia: “Technical debt (also known as design debt or code debt) is “a concept in programming that reflects the extra development work that arises when code that is easy to implement in the short run is used instead of applying the best overall solution”.

Technical debt can be compared to monetary debt. If technical debt is not repaid, it can accumulate ‘interest’, making it harder to implement changes later on. Unaddressed technical debt increases software entropy. Technical debt is not necessarily a bad thing, and sometimes (e.g., as a proof-of-concept) technical debt is required to move projects forward. On the other hand, some experts claim that the “technical debt” metaphor tends to minimize the impact, which results in insufficient prioritization of the necessary work to correct it.”

The concept of technical debt comes from the software engineering world, but it applies to the world of IT and business infrastructure just as much. Like software engineering, we design our systems and our networks, and taking shortcuts in our designs, which includes working with less than ideal designs, incorporating existing hardware and other bad design practices produce technical debt.  One of the more significant forms of this comes from investing in the “past” rather than in the “future” and is quite often triggered through the sunk cost fallacy (a.k.a. throwing good money after bad.)

It is easy to see this happening in businesses every day.  New plans are made for the future, but before they are implemented investments are made in making an old system design continue working, work better, expand or whatever.  This investment then either turns into a nearly immediate financial loss or, more often, becomes incentive to not invest in the future designs as quickly, as thoroughly or possible, at all.  The investment in the past can become crippling in the worst cases.

This happens in numerous ways and is generally unintentional.  Often investments are needed to keep an existing system running properly and, under normal conditions, would simply be made.  But in a situation where there is a future change that is needed or potentially planned this investment can be problematic.  Better cost analysis and triage planning can remedy this, in many cases, though.

In a non-technical example, imagine owning an older car that has served well but is due for retirement in three months.  In three months you plan to invest in a new car because the old one is no longer cost effective due to continuous maintenance needs, lower efficiency and so forth.  But before your three month plan to buy a new car comes around, the old car suffers a minor failure and now requires a significant investment to keep it running.  Putting money into the old car would be an new investment in the technical debt.  Rather than spending a large amount of money to make an old car run for a few months, moving up the time table to buy the new one is obviously drastically more financially sound.  With cars, we see this easily (in most cases.)  We save money, potentially a lot of it, by quickly buying a new car.  If we were to invest heavily in the old one, we either lose that investment in a few months or we risk changes our solid financial planning for the purchase of a new car that was already made.  Both cases are bad financially.

IT works the same way.  Spending a large sum of money to maintain an old email system six months before a planned migration to a hosted email system would likely be very foolish.  The investment is either lost nearly immediately when the old system is decommissioned or it undermines our good planning processes and leads us to not migrate as planned and do a sub-par job for our businesses because we allowed technical debt to drive our decision making rather than proper planning.

Often a poor triage operation or improper authority to triage players can be the factor that causes emergency technical debt investments rather than rapid future looking investments.  This is only one area where major improvements may address issues, but it is a major one.  This can also be mitigated, in some cases, through “what if” planning to have investment plans in place contingent on common or expected emergencies that might arise, which may be as simple as capacity expansion needs due to growth that happen before systems planning comes into play.

Another great example of common technical debt is server storage capacity expansion.  This is a scenario that I see with some frequency and demonstrates technical debt well.  It is common for a company to purchase servers that lack large internal storage capacity.  Either immediately or sometime down the road more capacity is needed.  If this happens immediately we can see that the server purchased was a form of technical debt in improper design and obviously represents a flaw in the planning and purchasing process.

But a more common example is needing to expand storage two or three years after a server has been purchased.  Common expansion choices include adding an external storage array to attach to the server or modifying the server to accept more local storage.  Both of these approaches tend to be large investments in an already old server, a server that is easily forty percent or higher through its useful lifespan.  In many cases the same or only slightly higher investment in a completely new server can result in new hardware, faster CPUs, more RAM, the storage needed, purpose designed and built, aligned and refreshed support lifespan, smaller datacenter footprint, lower power consumption, newer technologies and features, better vendor relationships and more all while retaining the original server to reuse, retire or resell.  One way spends money supporting the past, the other often can spend comparable money on the future.

Technical debt is a crippling factor for many businesses.  It increases the cost of IT, sometimes significantly, and can lead to high levels of risk through a lack of planning and most spending being emergency based.

 

New Hyperconvergence, Old Storage

We all dream of the day that we get to build a new infrastructure from the ground up without any existing technical debt to hold us back.  A greenfield deployment where we pick what is best, roll it out fresh and enjoy.  But most of us live in the real world where that is not very realistic and what we actually face is a world where we have to plan for the future but work with what we already have, as well.

Making do with what we have is nearly an inevitable fact of life in IT and when approaching storage when moving from an existing architecture to hyperconvergence things will be no different.  In a great many cases we will be facing a situation where an existing investment in storage will be in place that we do not want to simply discard but does not necessarily fit neatly into our vision of a hyperconverged future.

There are obvious options to consider, of course, such as returning leased gear, retiring older equipment or selling still useful equipment outright.  These are viable options and should be considered.  Eliminating old gear or equipment that does not fit well into the current plans can be beneficial as we can simplify our networks, reduce power consumption and possible even recoup some degree of our investments.

In reality, however, these options are rarely financially viable and we need to make more productive use of our existing technology investments.  What options are available to us, of course, depend on a range of factors.  But we will look at some examples of how common storage devices can be re-purposed in a new hyperconverged-based system in order to maintain their utility either until they are ready to retire or even, in some cases, indefinitely.

The easiest re-purposing of existing storage, and this applies equally to both NAS and SAN in most cases, is to designate them as backup or archival targets.  Traditional NAS and SAN devices are excellent backup hardware and are generally usable by nearly any backup mechanism, regardless of approach or vendor.  And because they are generic backup targets if a mixture of backup mechanisms are used, such as agent based, agentless and custom scripts, these can all work to the same target.  Backups so rarely get the attention and investment that they deserve that this is not just the easiest but often the most valuable use of pre-existing storage infrastructure.

Of course anything that is appropriate for backups can also be used for archival storage.  Archival needs are generally less needed (only a percentage of firms need archival storage while all need backups) and are of lower priority, so this is more of an edge reuse case, but still one to consider, especially for organizations that may be working to re-purpose a large number of possibly disparate storage devices.  However it is worth noting that moving to hyperconvergence does tend to “flatten” the compute and storage space in a way that may easily introduce a value to lower performance, lower priority archival storage that may not have existed or existed so obviously prior to the rearchitecting of the environment.

NAS has the unique advantageous use cases of being usable as general purpose network storage, especially for home directories of end users.  NAS storage can be used in so many places on the network, it is very easy to continue to use, after moving core architectures.  The most popular case is for users’ own storage needs with the NAS connected directly to end user devices which allows for storage capacity, performance and network traffic to be offloaded from the converged infrastructure to the NAS.    It would actually be very rare to remove a NAS from a hyperconverged network as its potential utility is so high and apparent.

Both SAN and NAS have the potential to be attached directly to the virtual machines running on top of a hyperconverged infrastructure as well.  In this way they can continue to be utilized in a traditional manner until such time as they are no longer needed or appropriate.  While not often the recommended approach, attaching network storage to a VM directly, there are use cases for this and it allows systems to behave as they always have in a physical realm into the future.  This is especially useful for mapped drives and user directories via a NAS, much as we mentioned for end user devices, but the cases are certainly not limited to this.

A SAN can provide some much needed functionality in some cases for certain workloads that require shared block storage that otherwise is not available or exposed on a platform.  Workloads on a VM will use the SAN as they always have and not even be aware that they are virtualized or converged.  Of course we can also attach a SAN to a virtualized file server or NAS head running on our hyperconverged infrastructure if the tiering for that kind of workload is deemed appropriate as well.

Working with existing infrastructure when implementing a new one does present a challenge, but one that we can tackle with creativity and logical approach.  Storage is a nearly endless challenge and having existing storage to re-purpose may easily end up being exceptionally advantageous.

You Can’t Virtualize That!

We get this all of the time in IT, a vendor tells us that a system cannot be virtualized.  The reasons are numerous.  On the IT side, we are always shocked that a vendor would make such an outrageous claim; and often we are just as shocked that a customer (or manager) believes them.  Vendors have worked hard to perfect this sales pitch over the years and I think that it is important to dissect it.

The root cause of problems is that vendors are almost always seeking ways to lower costs to themselves while increasing profits from customers.  This drives a lot of what would otherwise be seen as odd behaviour.

One thing that many, many vendors attempt to do is limit the scenarios under which their product will be supported.  By doing this, they set themselves up to be prepared to simply not provide support – support is expensive and unreliable.  This is a common strategy.  It some cases, this is so aggressive that any acceptable, production deployment scenario fails to even exist.

A very common means of doing this is to fail to support any supported operating system, de facto deprecating the vendor’s own software (for example, today this would mean only supporting Windows XP and earlier.)  Another example is only supporting products that are not licensed for the use case (an example would be requiring the use of a product like Windows 10 be used as a server.)  And one of the most common cases is forbidding virtualization.

These scenarios put customers into difficult positions because on one hand they have industry best practices, standard deployment guidelines, in house tooling and policies to adhere to; and on the other hand they have vendors often forbidding proper system design, planning and management.  These needs are at odds with one another.

Of course, no one expects every vendor to support every potential scenario.   Limits must be applied.  But there is a giant chasm between supporting reasonable, well deployed systems and actively requiring unacceptably bad deployments.  We hope that our vendors will behave as business partners and share a common interest in our success or, at the very least, the success of their product and not directly seek to undermine both of these causes.  We would hope that, at a very minimum, best effort support would be provided for any reasonable deployment scenario and that guaranteed support would be likely offered for properly engineered, best practice scenarios.

Imagine a world where driving the speed limit and wearing a seatbelt would violate your car warranty and that you would only get support if you drove recklessly and unprotected!

Some important things need to be understood about virtualization.  The first is that virtualization is a long standing industry best practice and is expected to be used in any production deployment scenario for services.  Virtualization is in no way new, even in the small business market it has been in the best practice category for well over a decade now and for many decades in the enterprise space.  We are long past the point where running systems non-virtualized is considered acceptable, and that includes legacy deployments that have been in place for a long time.

There are, of course, always rare exceptions to nearly any rule.  Some systems need access to very special case hardware and virtualization may not be possible, although with modern hardware passthrough this is almost unheard of today.  And some super low latency systems cannot be virtualized but these are normally limited to only the biggest international investment banks and most aggressive hedgefunds and even the majority of those traditional use cases have been eliminated by improvements in virtualization making even those situations rare.  But the bottom line is, if you can’t virtualize you should be sad that you cannot, and you will know clearly why it is impossible in your situation.  In all other cases, your server needs to be virtual.

Is it not important?

If a vendor does not allow you to follow standard best practices for healthy deployments, what does this say about the vendor’s opinion of their own product?  If we were talking about any other deployment, we would immediately question why we were deploying a system so poorly if we plan to depend on it.  If our vendor forces us to behave this way, we should react in the same manner – if the vendor doesn’t take the product to the same degree that we take the least of our IT services, why should we?

This is an “impedance mismatch”, as we say in engineering circles, between our needs (production systems) and how the vendor making that system appears to treat them (hobby or entertainment systems.)  If we need to depend on this product for our businesses, we need a vendor that is on board and understands business needs – has a production mind set.  If the product is not business targeted or business ready, we need to be aware of that.  We need to question why we feel we should be using a service in production, on which we depend and require support, that is not intended to be used in that manner.

Is it supported?  Is it being tested?

Something that is often overlooked from the perspective of customers is whether or not the necessary support resources for a product are in place.  It’s not uncommon for the team that supports a product to become lean, or even disappear, but the company to keep selling the product in the hopes of milking it for as much as they can and bank on either muddling through a problem or just returning customer funds should the vendor be caught in a situation where they are simply unable to support it.

Most software contracts state that the maximum damage that can be extracted from the vendor is the cost of the product, or the amount spent to purchase it.  In a case such as this, the vendor has no risk from offering a product that they cannot support – even if charging a premium for support.  If the customer manages to use the product, great they get paid. If the customer cannot and the vendor cannot support it, they only lose money that they would never have gotten otherwise.  The customer takes on all the risk, not the vendor.

This suggests, of course, that there is little or no continuing testing of the product as well, and this should be of additional concern.  Just because the product runs does not mean that it will continue to run.  Getting up and running with an unsupported, or worse unsupportable, product means that you are depending more and more over time on a product with a likely decreasing level of potential support, slowly getting worse over time even as the need for support and the dependency on the software would be expected to increase.

If a proprietary product is deployed in production, and the decision is made to forgo best practice deployments in order to accommodate support demands, how can this fit in a decision matrix? Should this imply that proper support does not exist? Again, as before, this implies a mismatch in our needs.

 

Is It Still Being Developed?

If the deployment needs of the software follow old, out of date practices, or require out of date (or not reasonably current software or design) then we have to question the likelihood that the product is currently being developed.  In some cases we can determine this by watching the software release cycle for some time, but not in all cases.  There is a reasonable fear that the product may be dead, with no remaining development team working on it.  The code may simply be old, technical debt that is being sold in the hopes of making a last, few dollars off of an old code base that has been abandoned.  This process is actually far more common than is often believed.

Smaller software shops often manage to develop an initial software package, get it on the market and available for sale, but fail to be able to afford to retain or restaff their development team after initial release(s).  This is, in fact, a very common scenario.  This leaves customers with a product that is expected to become less and less viable over time with deployment scenarios becoming increasingly risky and data increasing hard to extricate.

 

How Can It Be Supported If the Platform Is Not Supported?

A common paradox of some more extreme situations is software that, in order to qualify as “supported”, requires other software that is either out of support or was never supported for the intended use case.  Common examples of this are requiring that a server system be run on top of a desktop operating system or requiring versions of operating systems, databases or other components, that are no longer supported at all.  This last scenario is scarily common.  In a situation like this, one has to ask if there can ever be a deployment, then, where the software can be considered to be “supported”?  If part of the stack is always out of support, then the whole stack is unsupported.  There would always be a reason that support could be denied no matter what.   The very reason that we would therefore demand that we avoid best practices would equally rule out choosing the software itself in the first place.

Are Industry Skills and Knowledge Lacking?

Perhaps the issue that we face with software support problems of this nature are that the team(s) creating the software simply do not know how good software is made and/or how good systems are deployed.  This is among the most reasonable and valid reasons for what would drive us to this situation.  But, like the other hypothesis reasons, it leaves us concerned about the quality of the software and the possibility that support is truly available.  If we can’t trust the vendor to properly handle the most visible parts of the system, why would we turn to them as our experts for the parts that we cannot verify?

The Big Problem

The big, overarching problem with software that has questionable deployment and maintenance practice demands in exchange for unlocking otherwise withheld support is not, as we typically assume a question of overall software quality, but one of viable support and development practices.  That these issues suggest a significant concern for long term support should make us strongly question why we are choosing these packages in the first place while expecting strong support from them when, from the onset, we have very visible and very serious concerns.

There are, of course, cases where no other software products exist to fill a need or none of any more reasonable viability.  This situation should be extremely rare and if such a situation exists should be seen as a major market opportunity for a vendor looking to enter that particular space.

From a business perspective, it is imperative that the technical infrastructure best practices not be completely ignored in exchange for blind or nearly blind following of vendor requirements that, in any other instance, would be considered reckless or unprofessional. Why do we so often neglect to require excellence from core products on which our businesses depend in this way?  It puts our businesses at risk, not just from the action itself, but vastly moreso from the risks that are implied by the existence of such a requirement.

The Commoditization of Architecture

I often talk about the moving “commodity line”, this line affects essentially all technology, including designs.  Essentially, when any new technology comes out it will start highly proprietary, complex and expensive.  Over time the technology moves towards openness, simplicity and becomes inexpensive.  At some point any given technology becomes goes so far in that direction that it falls over the “commodity” line where it moves from being unique and a differentiator to becoming a commodity and accessible to essentially everyone.

Systems architecture is no different from other technologies in this manner, it is simply a larger, less easily defined topic.  But if we look at systems architecture, especially over the last few decades, we can easily system servers, storage and complete systems moving from the highly proprietary towards the commodity.  Systems were complex and are becoming simple, they were expensive and are becoming inexpensive, they were proprietary and they are becoming open.

Traditionally we dealt with systems that were physical operating systems on bare metal hardware.  But virtualization came along and abstracted this.  Virtualization gave us many of the building blocks for systems commonidization.  Virtualization itself commoditized very quickly and today we have a market flush with free, open and highly enterprise hypervisors and toolsets making virtualization totally commoditized even several years ago.

Storage moved in a similar manner.  First there was independent local storage.  Then the SAN revolution of the 1990s brought us power through storage abstraction and consolidation.  Then the replicated local storage movement moved that complex and expensive abstraction to a more reliable, more open and more simple state.

Now we are witnessing this same movement in the orchestration and management layers of virtualization and storage.  Hyperconvergence is currently taking the majority of systems architectural components and merging them into a cohesive, intelligent singularity that allows for a reduction in human understanding and labour while improving system reliability, durability and performance.  The entirety of the systems architecture space is moving, quite rapidly, toward commoditization.  It is not fully commoditized yet, but the shift is very much in motion.

As in any space, it takes a long time for commoditization to permeate the market.  Just because systems have become commoditized does not mean that non-commodity remnants will not remain in use for a long time to come or that niche proprietary (non-commodity) aspects will not linger on.  Today, for example, systems architecture commoditization is highly limited to the SMB market space as there are effective upper bound limits to hyperconvergence growth that have yet to be tackled, but over time they will be tackled.

What we are witnessing today is a movement from complex to simple within the overall architecture space and we will continue to witness this for several years as the commodity technologies mature, expand, prove themselves, become well known, etc.  The emergence of what we can tell will be commodity technologies has happened but the space has not yet commoditized.  It is an interesting moment where we have what appears to be a very clear vision of the future, some scope in which we can realize its benefits today, a majority of systems and thinking that reside in the legacy proprietary realm and a mostly clear path forward as an industry both in technology focus as well as in education, that will allow us to commoditize more quickly.

Many feel that systems are becoming overly complex, but the opposite is true. Virtualization, modern storage systems, cloud and hyperconverged orchestration layers are all coming together to commoditize first individual architectural components and then architectural design as a whole.  The move towards simplicity, openness and effectiveness is happening, is visible and is moving at a very healthy pace.  The future of systems architecture is one that clearly is going to free IT professionals from spending so much time thinking about systems design and more time thinking about how to drive competitive advantage to their individual organizations.