What is RTAM and why is it important?
Real Time Alert Monitoring (RTAM) is an approach to monitoring systems and applications by interrogating logs issued by the systems and applications, and issuing alerts when required based on thresholds.
There are also various other alerts that are important to monitor and trigger when certain thresholds are exceeded. These could be Operations Level Agreement (OLA) failures and Service Level Agreement (SLA) failures, to name a few.
A model for RTAM is Microsoft's Event Viewer. Alerts, based on thresholds, are written to a data base table and flagged as high, medium, or low based on thresholds. Actions, based on alert level are taken, which could be email notification to stakeholders, support groups, or other interested parties. All high alerts should generate an incident support ticket in your ticketing system, which would cause the appropriate support group(s) to be immediately notified and the issue to be tracked.
A result of implementing a real time alert monitoring system is the timely identification of issues or potential issues. The sooner you can identify and resolve an issue or potential issue the operational and/or financial impact will be reduced.The frequency for capturing data should vary based on the systems to be monitored. Alerts are written to a data base table based on the nature of the alert. Generally speaking, alerts should be categorized as follows;
It’s important to provide a User Interface (UI) to view the alerts, display detailed information about the alert, update the thresholds, and generate reports. Your support team would have access to the data through the UI. It is also important to include as much detail as possible in the incident support ticket (i.e. relevant information from the RTAM high alert).
RTAM systems cannot be developed and left unattended. There are a number of ways that monitoring and alerting systems fail;
It’s important to ensure that alerting does not overwhelm the support teams. However, alerting should be able to capture all situations but only alert when conditions warrant it. Having said that, it is important that monitoring thresholds are tweaked regularly. When a high alert is issued and addressed by the support team may be an opportunity to confirm whether or not the high alert was warranted.
An RTAM database can be used as a source for reporting. Trends and anomalies can be identified by generating reports on a regular basis over a period of time. These reports can and should be used for adjusting thresholds when and where appropriate.
Sample Report 1 shows an anomaly for the month of January, which has impacted the trend. A previous investigation would have identified the issue for January, which may simply be a time-of-year aberration.
Sample Report 2 shows a slight trend upwards over a six month period. Continued monitoring is required.
There are a number of RTAM solutions in the marketplace. However, you could develop your own in-house RTAM solution. There are a number of contributing factors when deciding to build or buy. Some of which would be unique to your specific requirements including budget, complexity of alerts, monitoring scope, and platform to name a few.
Whichever approach you take will ultimately help you address issues quickly and efficiently ensuring that your OLA’s and SLA’s are met and your internal and external clients remain satisfied with your service.
Cover-All Managed Cloud and IT Services is a Managed Services Provider (MSP) delivering cost effective solutions for Managed IT, Cloud, Co-Location, Backup & Recovery/DR, and Cloud and IT Consulting Services. Contact Cover-All at 1-833-268-3788 or visit our website at www.msp.cover-all.ca