This was definitely an odd one. I noticed that one of our systems was showing as having a healthy SCOM Agent yet it if you ran an availability report against the windows computer object it would show monitoring as being unavailable. After confirming that the data warehouse was not running behind I found that this was actually happening with more than one of our servers.
Running an availability report would look as follows:
Brody Kilpatrick has a nice post on his blog explaining one of the possible causes and solutions which involves running some unsupported scripts against the data warehouse. I highly recommend reading his post and all credit for this solution must go to him. With that said, I found that the SQL queries he posted have issues that caused them to fail, at least in my environment. (Brody responded that he is updating the queries so it is likely that by the time you read this they will be fixed.) There were also some slight discrepancies between the results of his queries and my results so I opted to use his work as a template, but to modify things ever so slightly so that it would actually work in my environment which is running OpsMgr 2012 SP1 with the datawarehouse running on a dedicated Server 2008 r2 box running SQL 2008 R2.
First on your datawarehouse server you are going to want to run the following query:
If nothing is returned, that is fantastic, and you aren’t experiencing the problem this post will solve. If you do get results they will look something like this:
The EndDateTime with Null is not necessarily indicative of a problem. In some cases it was just a server that had been shutdown for a period of time, but had not been removed from SCOM. However, some of these NULL’s were for the servers that were showing healthy SCOM agents with availability reporting showing monitoring unavailable.
As useful as HealthServiceOutageRowId is it can be helpful to actually know the name of the associated system. Run the following query to join in Name and DisplayName:
Your results should look like this with the right-most DisplayName column providing the FQDN of the affected system:
At this point Brody’s post recommends confirming that the systems are all experiencing the problem, backing up your datawarehouse, and at your own risk modifying the values of the EndDateTime column via custom SQL. I tend to be a little risk averse, at least in my production environments so the first thing I tried now that I had narrowed down the issue was to uninstall the SCOM agent from one of the misbehaving systems, and then immediately reinstalling it. For that system this resolved the issue immediately with the proper availability monitoring returning post reinstall:
However, one of my affected servers was a domain controller which had a manually installed agent. I had no way of uninstalling, and reinstalling the agent without bugging our domain administrator.
So for this case I backed up the datawarehouse and then did the following (Again you could do this via raw SQL, but sometimes I think it is easier to have a clear understanding of what you are doing to a database rather than just copying some code someone else wrote)
Please keep in mind this solution is not supported by Microsoft:
Right click the dbo.HealthServiceOutage table:
Select Edit Top 200 Rows:
In the right hand properties box hit the + sign next to Top Specification and increase the Expression value to include the value of the HealthServiceOutageRowID of the sytem you want to fix:
At the bottom of your query you will see query changed, right click and select Execute SQL:
Scroll down to the HealthServiceOutageRowID which matches your affected server. The EndDateTime should show Null. Copy the value from the StartDateTime, and paste it into the box for the EndDateTime and close out of the editor.
And then for good measure run this script again to confirm that the your modification worked and the server should no longer be returned:
So two fixes for this issue:
Recommended Fix Reinstall the SCOM agent
Optional Not Supported back up your datawarehouse first Fix:
Modify the EndDateTime value from Null to match the StartDateTime, either via management studio edit, or via SQL Query.
Just to reiterate, if you opt to use this post as a solution– read Brody’s post as well, he found the solution and presents a much deeper understanding of how availability is actually calculated and the extra info is extremely useful. His method of fixing this via SQL rather than a manual edit via management studio is also far more scalable if you happen to have this problem on more than a handful of servers.