In SCOM there is several monitors that monitor agent health and alert accordingly. This series of blogs will aim in providing in-depth details of the functioning of these monitors.
For a start I’ve decided to blog about two of the most intriguing monitors for agent health.
1. Event Collection Health.
2. Performance data collection health.
The monitor target for both these monitors are the “Health Service Watcher” and they are living in the “System Center Core Monitoring” management pack.
These monitors are disabled by default
These monitors has now Diagnostic or Recovery actions linked to it.
Parameters of interest for these monitors are
1. Max Event Age in Hours – Any event/performance counter older than this number of hours is deemed old data and makes the monitor red. ($Config/MaxEventAgeHR$ in the below queries)
2. Query Timeout – Query timeout of the SQL query used
3. Watch Period in Hours – This is the timespan that the query uses to gather event details for ($Config/WatchPeriodHR$ in the below queries used by the OLEDB)
4. Interval – This is how often the monitor performs it check
Performance data collection health checks for any performance data that exists in the OperationsManager database for each and every agent for which this monitor is enabled.
The monitor makes use of a OLEDB datasource with the following SQL query
select CAST(ME.Path as nvarchar(255)), CAST(Max(TimeSampled) As nvarchar(50)) As ‘LastSample’, CASE WHEN Isnull(MAX(TimeSampled),’01-01-80′) < DateAdd(hh,-$Config/MaxPerfAgeHr$,getutcdate()) Then ‘KO’ Else ‘OK’ END from dbo.ManagedEntityGenericView ME inner join dbo.ManagedTypeView MT on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’ left join dbo.PerformanceCounterView C on ME.Id = C.ManagedEntityId left join dbo.PerformanceDataAllView P on C.PerformanceSourceInternalId=P.PerformanceSourceInternalId and P.TimeSampled > dateadd(hh,-$Config/WatchPeriodHr$,getutcdate()) where ME.IsDeleted=0 group by ME.Path
This query returns the date of the last performance counter received per object of class “Microsoft.SystemCenter.HealthService”, with a status of either “KO” if there is a problem or “OK” if performance data was received in the specified time.
Event collection health
This monitor also utilizes a OLEDB datasource with the following SQL query
select #T.Path, CAST(MAX(LastTime) as nvarchar(50)) As ‘LastEvent’, CASE WHEN Isnull(MAX(LastTime),’01-01-80′) < DateAdd(hh,-$Config/MaxEventAgeHr$,getutcdate()) Then ‘KO’ Else ‘OK’ END As ‘Status’ From ( Select CAST(ME.Path as nvarchar(255)) As [Path], CASE WHEN IsNull([Path], ”)=” THEN ” WHEN CHARINDEX(‘.’, [Path]) = 0 Then [Path] ELSE SUBSTRING(Path,1,CHARINDEX(‘.’, Path)-1) END As ‘Netbios’ From dbo.ManagedEntityGenericView ME Inner join dbo.ManagedTypeView MT on ME.MonitoringClassId=MT.Id AND MT.Name = ‘Microsoft.SystemCenter.HealthService’ where IsDeleted=0 ) As #T left join ( select distinct LoggingComputer, MAX(TimeGenerated) As ‘LastTime’ from dbo.EventView where TimeGenerated > dateadd(hh,-$Config/WatchPeriodHr$,getutcdate()) group by LoggingComputer ) As #E on #E.LoggingComputer = #T.Path or #E.LoggingComputer=#T.[Netbios] group by Path
The query returns the date and time of the last event received and again when the last event was received inside specified date/time the status is “OK” if not the status returned is “KO”
Extract from the “Microsoft.SystemCenter.2007” management (System Center Core Monitoring friendlyname)
Example of the alert as it appears in the SCOM console
With these two monitors you can monitor and get alerted on any performance and event collection problems experienced by the SCOM agents, it’s really bad come the end of the month and you are unable to run availability or performance reports or the reports does not contain all the servers because the servers has stopped sending performance and event data through several weeks earlier.
By enabling these monitors the SCOM administrator can be proactively notified of agent problems and fix them.