Systems Management has been around for a very long time and the concepts are used by nearly all companies world wide.
When you have applications running on an Operating System on a server — maybe even multiple servers working together — you need to track the health of those systems, from basic up/down monitoring to in-depth monitoring around the health of technologies (i.e., web server, database, etc.).
Several years ago, customers would ask application vendors to provide health metrics, such as log files and reports, on their applications. Customers learned that in many cases, these metrics were not enough and/or not reliable.
The information was helpful for understanding the general health of the application, but not for getting a deeper look at issues. For instance, the web server could log that a port was blocked, but it couldn’t provide outwards to another application about response time measurements if the CPU was pegged or if there was an intermittent network problem. With the complexity we deal with today, a corporate website — even one load balanced in a cluster — has many metrics from different silos of technology, including the actual web server application (i.e., IIS, tomcat), the Operating System (Windows, Linux, etc.), a database (Oracle, mysql, etc.) running on another server, the database storage system, routers, firewalls, switches, on and on.
IT departments realized they needed tools that focused on monitoring the health and availability of critical systems within IT. There are numerous tools available from big companies like IBM, CA, BMC, HP, etc., and from smaller vendors (NetIQ, Zenoss, Solarwinds, etc.). The open-source community (OpenNMS, Nagios) offers even more.
The interesting thing is, each tool has a list of things it does well and things it doesn’t do well. I have worked with many large corporations, specifically in the area of monitoring the health of the enterprise, and they typically use several tools from multiple vendors. They have tools that focus on discovery of new devices and report on basic up and down; tools geared toward reporting on the health of Operating Systems and monitoring the health of applications. Other tools focus on being the best at reporting the health of specific applications. Of course they have a whole other set of tools for Network, Storage, Response Time and more.
I like to summarize it this way: There are seven layers in the OSI Stack. While there may be a couple of tools out there able to monitor each individual layer, there is typically a “best of breed” for each. One vendor may be the best of breed for one or more layers, but they are not the best of breed for all of the layers. With that in mind, if you want to monitor the overall health and availability of your enterprise end to end, you typically need to monitor several (if not all) the layers, and in turn, IT needs to leverage tools from multiple vendors, maybe even some custom-built tools.
Over time, due to uptime requirements, disaster recovery, fault tolerance and other challenges, IT departments started embracing third-party companies to provide an external location for their servers and equipment. The idea was, if there was a power outage in the office, critical IT systems could continue to run. These locations (Data Centers or CoLo) typically have generators and multiple connections to the Internet in case one of the connections fails. But IT still wanted to have measurements and monitoring, and these companies provided some basic up/down monitoring, as well as local staff for hands-on assistance (think rebooting). But in the end, the IT department still owned the health and availability of the applications (e-mail, web server, etc.) so they put tools (instrumentation, agents, probes, etc.) on the systems to monitor it.
Then along came virtualization and the cloud with it’s complex pricing on CPU ticks, disk utilization, network traffic, etc., etc. Some IT departments decided to trim down workloads in order to reduce costs for using the cloud. In turn, many dropped the agents and instrumentation that provided the health and availability visibility. They felt that having a full-fledged agent on the server was too heavy for monitoring and visibility. Some run blind (no agents or probes), while others use remote monitoring, such as robotically accessing URLs to test up/down and responsiveness.
A further shift in the industry came from companies that provide a service such as email, sales tracking, help desk tracking, etc., via a Service Offering. With this approach, there are no servers or applications to install; you typically access these services via a browser. They tend to be critical to the business, and IT needs to monitor them, but IT can’t install an agent on these third-party services (i.e., Salesforce.com, Office 365, ServiceNow, etc). So IT needs to set up some other way to test the basics: Can the employees of the company access the URL and log into email; can they send an email, etc. Robotic URL testing is a pretty common approach.
I think it makes sense for vendors to provide a layer of health monitoring via logs, dashboard and reports with even the Cloud Provider showing metrics and stats. But in the end, IT needs a wider picture of health; relying just on logs or a vendor-provided metrics dashboard is not enough. IT needs to either instrument that system (server, application, etc.) and/or do robotic URL testing.
How much monitoring is enough? Do you need to monitor every layer of the OSI stack? Is URL testing enough? For some applications, should a full- fledged agent be installed? It comes down to how much risk IT is willing to sign up for.
The less visibility IT has into the health of the servers, applications, etc., the more risk it has into missing outages and potentially costing the business revenue and/or creating costs. I think at a minimum, the monitoring should do some type of response-time testing, exercising basic features of the application, as well as general testing of up/down.
For instance, suppose you’re monitoring a web-based time sheet application. Due to the number and/or geographic location of users, the web server is clustered. Just because a transaction to log into the website passed with a good response time, it doesn’t mean that all the nodes in the cluster are up and running. Maybe one geography is down. Maybe one or more nodes in the cluster are down. This may be fine during off-hours but will pose a problem soon enough.
How much monitoring for cloud is enough? What tools work best? Feel free to add your comments.
– Tobin Isenberg