Feb 3rd, 2011
A question that comes up on a pretty regular basis is whether or not servers should be routinely rebooted, such as once per week, or if they should be allowed to run for as long as possible to achieve maximum “uptime.” To me the answer is simple – with rare exception, regular reboots are the most appropriate choice for servers.
As with any rule, there are cases when it does not apply. For example, some businesses running critical systems have no allotment for downtime and must be available 24/7. Obviously systems like this cannot simply be rebooted in a routine way. However, if a system is so critical that it can never go down then this situation should trigger a red flag that this system is a point of failure and perhaps consideration for how to handle downtime, whether planned or unplanned, should be initiated.
Another exception is some AIX systems need significant uptime, greater than a few weeks, to obtain maximum efficiency as the system is self tuning and needs time to obtain usage information and to adjust itself accordingly. This tends to be limited to large, seldom-changing database servers and similar use scenarios that are less common than other platforms.
In IT we often worship the concept of “uptime” – how long a system can run without needing to restart. But “uptime” is not a concept that brings value to the business and IT needs to keep the business’ needs in mind at all times rather than focusing on artificial metrics. The business is not concerned with how long a server has managed to stay online without rebooting – they only care that the server is available and ready when needed for business processing. These are very different concepts.
For most any normal business server, there is a window when the server needs to be available for business purposes and a window when it is not needed. These windows may be daily, weekly or monthly but it is a rare server that is actually in use around the clock without exception.
I often hear people state that because they run operating system X rather than Y that they no longer need to reboot, but this is simply not true. There are two main reasons to reboot on a regular basis: to verify the ability of the server to reboot successfully and to apply patches that cannot be applied without rebooting.
Applying patches is why most businesses reboot. Almost all operating systems receive regular updates that require rebooting in order to take effect. As most patches are released for security and stability purposes, especially those requiring a reboot, the importance of applying them is rather high. Making a server unnecessarily vulnerable just to maintain uptime is not wise.
Testing a server’s capacity to reboot successfully is what is often overlooked. Most servers have changes applied to them on a regular basis. Changes might be patches, new applications, configuration changes, updates or similar. Any change introduces risk. Just because a server is healthy immediately after a change is applied does not mean that the server nor the applications running on it will start as expected on reboot.
If the server is never rebooted then we never know if it can reboot successfully. Over time the number of changes having been applied since the last reboot will increase. This is very dangerous. What we fear is a large number of changes having been made, possibly many of them undocumented, and a reboot then failing. At that point identifying what change is causing the system to fail could be an insurmountable process. No single change to roll back, no known path to recoverability. This is when panic sets in. Of course, a box that is never rebooted intentionally is more likely to reboot unintentionally – meaning the chance of a failed reboot is both more likely to occur and more likely to occur while in active use.
While regular reboots are not intended to reduce the frequency of failed reboots, in fact they actually increase the occurrence of failures, the purpose is to make those failures easily manageable from a “known change” standpoint and, more importantly, to control when those reboots occur to ensure that they happen at a time when the server is designated as being available for maintenance and is designed to be stressed so that problems are found at a time when they can be mitigated without business impact.
I have heard many a system administrator state that they avoid weekend reboots because they do not want to be stuck working on Sundays due to servers failing to come back up after rebooting. I have been paged many a Sunday morning from a failed reboot myself, but every time I receive that call I feel a sense of relief. I know that we just caught an issue at a time when the business is not impacted financially. Had that server not been restarted during off hours, it might have not been discovered to be “unbootable” until it had failed during active business hours and caused a loss of revenue.
Thanks to regular weekend reboots, we can catch pending disasters safely and, thanks to knowing that we only have one week’s worth of changes to investigate, we are routinely able to fix the problems with generally little effort and great confidence that we understand what changes had been made prior to the failure.
Regular reboots are about protecting the business from outages and downtime that can be mitigated through very simple and reliable processes.