Troubleshooting: SCOM Agent Healthy, but availability report for server shows monitoring unavailable

This was definitely an odd one. I noticed that one of our systems was showing as having a healthy SCOM Agent yet it if you ran an availability report against the windows computer object it would show monitoring as being unavailable. After confirming that the data warehouse was not running behind I found that this was actually happening with more than one of our servers.

Running an availability report would look as follows:

01

Brody Kilpatrick has a nice post on his blog explaining one of the possible causes and solutions which involves running some unsupported scripts against the data warehouse. I highly recommend reading his post and all credit for this solution must go to him. With that said, I found that the SQL queries he posted have issues that caused them to fail, at least in my environment. (Brody responded that he is updating the queries so it is likely that by the time you read this they will be fixed.) There were also some slight discrepancies between the results of his queries and my results so I opted to use his work as a template, but to modify things ever so slightly so that it would actually work in my environment which is running OpsMgr 2012 SP1 with the datawarehouse running on a dedicated Server 2008 r2 box running SQL 2008 R2.

First on your datawarehouse server you are going to want to run the following query:

02

If nothing is returned, that is fantastic, and you aren’t experiencing the problem this post will solve. If you do get results they will look something like this:

3

 

The EndDateTime with Null is not necessarily indicative of a problem. In some cases it was just a server that had been shutdown for a period of time, but had not been removed from SCOM. However, some of these NULL’s were for the servers that were showing healthy SCOM agents with availability reporting showing monitoring unavailable.

As useful as HealthServiceOutageRowId is it can be helpful to actually know the name of the associated system. Run the following query to join in Name and DisplayName:

04

Your results should look like this with the right-most DisplayName column providing the FQDN of the affected system:

05

At this point Brody’s post recommends confirming that the systems are all experiencing the problem, backing up your datawarehouse, and at your own risk modifying the values of the EndDateTime column via custom SQL. I tend to be a little risk averse, at least in my production environments so the first thing I tried now that I had narrowed down the issue was to uninstall the SCOM agent from one of the misbehaving systems, and then immediately reinstalling it. For that system this resolved the issue immediately with the proper availability monitoring returning post reinstall:

06

However, one of my affected servers was a domain controller which had a manually installed agent. I had no way of uninstalling, and reinstalling the agent without bugging our domain administrator.

So for this case I backed up the datawarehouse and then did the following (Again you could do this via raw SQL, but sometimes I think it is easier to have a clear understanding of what you are doing to a database rather than just copying some code someone else wrote)

Please keep in mind this solution is not supported by Microsoft:

Right click the dbo.HealthServiceOutage table:

07

Select Edit Top 200 Rows:

08

In the right hand properties box hit the + sign next to Top Specification and increase the Expression value to include the value of the HealthServiceOutageRowID of the sytem you want to fix:

09

At the bottom of your query you will see query changed, right click and select Execute SQL:

10

Scroll down to the HealthServiceOutageRowID which matches your affected server. The EndDateTime should show Null. Copy the value from the StartDateTime, and paste it into the box for the EndDateTime and close out of the editor.

11

And then for good measure run this script again to confirm that the your modification worked and the server should no longer be returned:

 

04

So two fixes for this issue:

Recommended Fix Reinstall the SCOM agent

Optional Not Supported back up your datawarehouse first Fix:

Modify the EndDateTime value from Null to match the StartDateTime, either via management studio edit, or via SQL Query.

Just to reiterate, if you opt to use this post as a solution– read Brody’s post as well, he found the solution and presents a much deeper understanding of how availability is actually calculated and the extra info is extremely useful. His method of fixing this via SQL rather than a manual edit via management studio is also far more scalable if you happen to have this problem on more than a handful of servers.

Troubleshooting: Product Evaluation is expiring in 60 days (SCOM 2012)

With some System Center 2012 products like Service Manager the install GUI requires you to enter a license key. While this is annoying during the install process this is nice in that it makes sure that you don’t forget to enter a license key. With Operations Manager 2012 the installer does not prompt for a license key and by default all installs are technically 180 day evaluation copies. This is fine except eventually you will log into the OpsMgr console and see the following:

eval expiring

This can be a little scary especially when you are seeing this in a production environment.

SCOM

The official Microsoft instructions for adding a license key can be found here.

You will need to run the following powershell commands on each of your SCOM Management servers:

Launch Powershell Run as an Administrator

Type the following:

Import-Module operationsmanager

New-SCOMManagementGroupConnection

Set-SCOMLicense -ProductID “Enter your license key here”

Y

Hit Enter

Full-censored

The Microsoft instructions then tell you to run:

Get-SCOMManagementGroup | ft skuforlicense, version, timeofexpiration -a

For me this would consistently return the following result with the Management server still appearing to be running a Eval copy:

Eval

However if you reboot the management server and rerun the commands you should see something like this:

yay!

So the reboot seems to be key after running through all the steps above.

 

 

Troubleshooting: MsDtsServer100 IS Package Failed (SQL Management Pack)

Every once in awhile an engineer will have this error pop up for one of their systems:

01

If the engineer is a SQL DBA than there is no problem as they will understand both the source and ultimately how to fix the problem.

Sadly not everyone who has or manages a SQL server is a DBA. There are plenty of cases  where a sysadmin acquires a few SQL servers which they know the basics of managing or at least how to point an app server at to use it, but they may never have had the time to dig deeper into SQL, thus MsDtsServer100 IS Package Failed is not always particularly useful.

The Alert Description offers a useful clue “Maintenance Plan” Failed (Though this title will vary based on if the default plan name is used)

02

So how do you troubleshoot this error?

If you remote the SQL server referenced in the error you can launch SQL Management Studio and connect to the instance in question. If you expand the Management folder you will find a folder called Maintenance Plans

03

In this case the Maintenance Plan has been renamed to “Nightly Backups”

If you right click and select view history for the Maintenance Plan you will be presented with the following:

view history

This is where things get confusing, everything looks like it ran perfectly as per the little green check marks of success. You see a Rebuild Indexes, a History Cleanup, some generic Maintenance task, and a DB Backup. All successful.

So where is the error coming from?

If you navigate to the Application Event Log on the SQL server for the time the alert was generated you will find the answer:

event log

Subplan II actually had two components: one was a rebuild indexes which you can see from the SQL Management Studio history occurred successfully. The other item in this particular case was a reorganize indexes which was failing.  Reorganizing indexes immediately after rebuilding them doesn’t sound like a very good order of operations. For this specific issue I recommended that the engineer remove the reorganize indexes from subplan II and the error has never happened since. So if you see MsDtsServer 100 IS Package Failed you are going to want to go to the Application Event Log of the SQL Server to figure out the source of the problem.

Building Better Service Level Dashboards

Microsoft has added a lot of functionality into SCOM 2012 to make creating dashboards easy. The only problem is they have given you a blank canvas without much in the way of guidance. This can be great, but it can also be problematic. The fact that you can make a 9 cell grid layout filled with graphs and data doesn’t mean that you should.

What you should do, is strive to build effective dashboards.  What is an effective dashboard? There is no right answer– I am making up the phrase– though I would argue that effective dashboards are ones in which the dashboard is designed to give insight into a service with a specific audience in mind.

A dashboard that is useful for your engineers or sysadmins is going to–OR SHOULD–look very different from a dashboard for Tier I support. Much like a dashboard for Tier I should look different from a dashboard for non IT customers. I like to break down service level dashboards into specific sub categories based on audience.

For the sake of this post lets divide potential dashboards into three groups:

1. Dashboards for non technical internal clients often published on an internal sharepoint site.

2. Dashboards for Tier I Support and upper IT management published via limited rights login to SCOM web console.

3. Dashboards for Systems engineers and Sysadmins.

Obviously this is going to vary greatly depending on what business you are in, but you get the idea.

I think in general we tend to do a pretty good job with 1 and 3.  Service Level Dashboards for non technical internal clients just need to provide basic information: is the service up or down, and to the best of our monitoring ability how well are we meeting the SLA?

The out of box Service Level Dashboard in SCOM 2012 does this quite effectively:

I say to the best of our ability above, because even with synthetic transactions there is always the possibility that a complex service can be degraded or down in some respect without your monitors picking up on it. (Exchange servers are up and running perfectly, but your BES server for Blackberries is down.) Or alternatively, your monitoring picks up a problem, but isn’t smart enough to correlate it into a change in the dashboard.  At best service monitoring is an evolutionary process not one that you set up and leave alone. IT Managers may not want to hear it, but ultimately your ability to track  a service depends on the accuracy of your monitors, and building accurate monitors requires iteration and time.

Dashboards for engineers and sysadmins are often built with very specific requirements in mind, or are redundant and aren’t needed so they tend to not be a problem either.

Where I most see the most potential for people to get into trouble is in creating dashboards for their Tier I support, and also for senior IT management. The easy answer is to just have them use the simple up/down  service level dashboard. The problem is that while this is a perfectly acceptable level of transparency to provide to Non IT, it often isn’t enough info, especially for the occasional situation when your up/down dashboard says everything is fine, and users are calling in complaining with issues.

Below is an example of a dashboard I would create for an e-mail or messaging service  for Tier I operators and upper level IT management that seeks to find the middle ground:

– In the Upper left you have a state widget. It is pegged to a group which contains all servers related to e-mail service. It should be made up of not just exchange servers. Mine contains BES and ISA servers to provide a more complete picture of the health of all related parts. Some would say build a simple distributed app to do this, but this starts to get troublesome when dealing with load-balanced systems, or systems where a negative status of one system doesn’t need to roll up to the status of the entire app.

– Upper middle is a Service Level Widget which is tied to the Exchange 2010 Application from the Exchange 2010 MP. It’s not perfect, but it does a decent job of generally showing when core e-mail functionality is up or down.

– Upper right: An alerts widget which looks at anything related to the health of the servers in the group on the left.

– Middle: Graph of outlook latency. Honestly, it is unlikely that Tier I is going to gain useful info from this graphic. You can, and I have been able to see noticeable shifts if one member of a load balanced or clustered pair is down, but this falls into the category of behold the power of pretty graphs. Sometimes its nice for your Tier I and upper IT management to feel empowered, and for whatever reason I have found that pretty graphs can do that even if they may or may not know exactly what they are looking at.

– Bottom: Again empowerment via pretty graphs.

 

The contents of this site are provided “AS IS” with no warranties, or rights conferred. Example code could harm your environment, and is not intended for production use. Content represents point in time snapshots of information and may no longer be accurate. (I work @ MSFT. Thoughts and opinions are my own.)