A while back I blogged about monitoring. I covered a little bit of history and listed some of the challenges IT faces on what to monitor, how much monitoring is enough, etc. Today, I want to talk about how the Agility Platform approaches monitoring.
The Agility Platform is able to monitor the instances that are provisioned; it just needs to be configured and turned on. The platform provides important information around the health and availability of the instance, but it is limited on instrumenting and/or probing all of the layers of the application for overall service health. Based on that limitation, we advise our customers to leverage the Systems Management tools they have already invested in to provide a clearer picture of health for the services they offer.
Some IT departments remain server-focused; while I see the value in that, they need to also understand the service is just as, if not, more important than servers. A single Web server going down in a cluster should be of interest to IT. But if the overall service is still operating and response times are good, maybe it is not as important as a single database server that supports multiple applications going down, for instance.
IT can test the user-facing side of the application by:
- Robotically sending an email as a test to ensure the corporate email service is up and running with a good end-user response time.
- Robotically issuing a trade against a test account to ensure all the layers of the Trading Application are working properly.
(Hopefully this makes sense. If not, reach out to me and we can discuss it further or add your question as a comment to the blog, your choice.)
Now, back to Agility and how it can tie into the Systems Management realm.
Since Agility is the place to deploy new corporate Service Offerings, it is also a good place to add in some automation. What I mean by automation in this case is this: When a group of servers (or a single server) is deployed, Agility can reach out to the Systems Management Tools and have the server(s) added to the different monitoring groups.
Within Agility, I am talking about using a LifeCycle Policy, adding in the API calls from the monitoring tool and telling the tool the machine name and/or TCP/IP address of the new server that was deployed. It could be something like telling Sitescope to start doing a standard set of robotic testing against the different servers, or telling Nagios to monitor the new servers. Then of course, if the service is scaled up, scaled down or undeployed, IT can reach out and instruct the monitoring tool to monitor more servers, less servers or stop monitoring the servers altogether.
There are some other options as well. For instance, maybe prior to adding the server to the monitoring tool via an API call in the policy, there could be an additional API call within the policy to the CMDB to gather additional details required for monitoring. These could include
- The list of types of monitoring (http test, jdbc test, ftp test, etc.)
- The monitoring group to add the servers to
- Escalation/owner details and more
These details can be passed on to the monitoring tool and/or update properties within Agility so they are available to the administrators of the service.
As you can see, there are a lot of options. Typically the vendor of the systems management tool has an API; they also make it easy to add or remove servers being monitored, so building the policy shouldn’t be a significant amount of work.
In turn, you will have automated monitoring of servers as they are deployed, which reduces the workload of IT and potentially the “Oops, I forgot to add those servers to be monitored” moments. OUCH!
– Tobin Isenberg