It’s interesting to take a look back at our industry and our own customer experiences over the past 5 years and analyze what really causes downtime in our customer environments. First, we have to agree on what the term “downtime” means. An end-user that launches outlook but cannot connect to the mail server is suffering “downtime”, but it is restricted to one user / one application. For our purposes, we think of downtime as any condition that causes a service outage (mail, internet, Line of Business Application) affecting more than a single user, and for a period of at least 5 minutes or longer.
The first, and generally the most obvious cause of downtime would be equipment failure. Interestingly, our experience has been that over the course of the last 5 years, core computing devices (PC’s, laptops, servers) have become much more reliable, and rarely if ever fail. Hard drives are still the most likely culprit, but often machines become obsolete before they ever suffer a hardware failure. Based on our definition, a single PC failure cannot create a down condition as it would only affect one user. So really, this new level of hardware reliability relates specifically to servers. On the other hand, core networking devices (switches, firewalls, wireless access points), which definitely have the scope to affect the entire network seem to be more of a problem. Networking device failures almost always affect more than one user, and when failed, generally cause a complete loss of connectivity. So, one has to consider why devices that by design can cause big headaches when they fail are seemingly less reliable. First, let’s consider the devices themselves, and the markets they serve. With the expansion of wireless and internet access into almost every home in America, networking devices tend to fall into two different categories:
- Inexpensive (consumer grade) – Designed to be very price competitive
- Expensive (Enterprise grade) – Designed to be very reliable and manageable
Within the small and medium business space, which is generally very price conscious, there is a great temptation to purchase less expensive consumer grade networking equipment, as it seems to have the same basic features as the more expensive enterprise grade networking gear. In our experience, this could not be further from the truth. Consumer grade gear is definitely less reliable, has no built in tools for diagnosing performance or reliability issues, and most importantly, cannot be monitored. You might think the ability to monitor a network device is not that important, and when the device fails outright, it isn’t. But lots of “reliability issues” revolve around slow performance and/or intermittent connectivity problems, and without the ability to monitor the device, diagnosing a problem becomes impossible. When a device is monitored, we collect lots of performance data, and it gives us a base model to determine if the device is behaving normally, and trending data so we can see how the device has been operating for the last 24 hours, week, month, or even year.
Another major cause of “downtime” is loss of internet connectivity. As most businesses cannot operate without access to the internet, i.e. sending and receiving email, browsing, remote access, remote office connectivity, VoIP service), loss of or degraded internet access can really cause major problems for a business. Internet outages can be caused by the loss of a carrier connection, a router, a firewall, a web-filtering device, over-subscription of the purchased bandwidth, internal network issues, malware, etc. This is another instance in which diagnosing a cause can be virtually impossible if network devices have no built-in diagnostic tools, and cannot be monitored. The more complex the environment, the more likely that issues like these will develop, and the less likely they will be resolved in any kind of reasonable time frame, if ever. I cannot list the number of times that we have had carriers claim that nothing was wrong with their connection, only to provide them with reports showing constant periods of high latency and packet loss.
Years ago, 4IT adopted a basic technical management philosophy which goes like this:
If you can’t measure it, you can’t manage it.
If you aren’t monitoring your IT environment, then you can’t measure it.
Good luck managing it.