Tag Archives: SCOM 2012

How do I: Generate a single report of all healthy agents + grey agents +timestamp of last recorded heartbeat?

This week is a training week, which means I have tiny windows of time to catch up on some blogging.

I have had this question a few times over the years. It seems like it should have a straightforward answer, but if there is one, I have not been able to find it.

When customers have asked this in the past I usually refer them to the following three posts:

https://blogs.msdn.microsoft.com/mariussutara/2008/07/24/last-contacted/

http://www.systemcentercentral.com/quicktricks-last-agent-heartbeat/

http://blog.scomskills.com/grey-agents-with-reason-gray-agents/

These do an excellent job in different ways of getting at the question of what agents are greyed out and when did heartbeats stop coming in.

Unfortunately, these do nothing to address the first part of the question, they want all agents, those that have stopped heart beating and also those that haven’t.

This is a little bit more tricky. It is easy enough to get a list of all agents, a list of grey agents, and to query for when health service heartbeat failures occur. But there is nothing easily accessible via the SDK or via the DW that (at least that I am aware of) allows us to capture a timestamp for when a non-grey agents last heartbeat came in.

So my natural question to my customer is why do you need the healthy agents heartbeat timestamp? The answer was basically that they want to feed that data into other systems in their org and they don’t want to deal with two different lists/files. They want one file, but at the end of the day they don’t actually need an exact timestamp for last heartbeat of a healthy agent.

This makes things a lot easier and lends itself to a relatively simple potential solution:

Import-Module OperationsManager

$Agent = get-scomclass -name “Microsoft.SystemCenter.Agent”
$MonitoringObjects = Get-SCOMMonitoringObject $Agent
$Date= Get-Date | Where-Object {$_.ToShortDateString()}
$DateSString= $Date.ToShortDateString()
$TimeLString= $Date.ToLongTimeString()
$DateTimeCombine = $DateSString + ” “ + $TimeLString
$UserDesktop = [Environment]::GetFolderPath(“Desktop”)
 
function GenerateAgentReport

{
    foreach ($object in $MonitoringObjects)
        {
    $result = New-Object –TypeName PSObject
    $result | Add-Member -MemberType NoteProperty -Name DisplayName -Value $object.DisplayName 
    $result | Add-Member -MemberType NoteProperty -Name Agent_Healthy -Value $object.IsAvailable
        if ($object.IsAvailable -contains “True”)
            {
             $result | Add-Member -MemberType NoteProperty -Name LastHeartbeat -Value $DateTimeCombine -PassThru
            }
        else
            {
            $result | Add-Member -MemberType NoteProperty -Name LastHeartbeat -Value $object.AvailabilityLastModified -PassThru
            }
        }
}

GenerateAgentReport | out-gridview

heartbeat

Basically this returns each agent in your management group. If the Agent is greyed out we use the AvailabilityLastModified property to pull an approximate timestamp. If the agent is still heartbeating as determined by the IsAvailable property then the AvailabilityLastModified property isn’t going to contain useful information, so in this case we substitute the current date/time for that field indicating that we have had a successful heartbeat within the past 5 minutes.

I said “approximate timestamp” when referring to agents with an IsAvailable value of false (greyed out agent) in that while in many cases AvailabilityLastModified should correspond to a when a heartbeat failure occurs flipping the agent from healthy to critical. If for some reason the agent was already in a critical state, but was still heartbeating the AvailabilityLastModified property would only be capturing when the agent went into the critical state, not the moment of last heartbeat. If you need a more or less exact moment of last heartbeat report I suggest using one of the links above. But if you need a quick PowerShell report to feed into other systems to help prioritize agent remediation the above script or some modified form of it might be mildly useful.

Tagged , , , ,

How do I: Create an Event View that excludes a particular Event ID

I had a large enterprise customer recently who was monitoring ADFS with the default management pack. They liked being able to glance at the event view which gave them a single place where they could look at the ADFS events occurring across their environment. They were using this event data as part of their correlation and tuning process to determine if there were additional actionable events that were being missed for their unique infrastructure. The eventual goal being to stop collecting the events altogether and only have alert generating rules/monitors in place for patterns of events that they cared about.

01

They quickly found that at least for their environment some of the events being collected were essentially noise, and they asked how to adjust the view so it would exclude one particular event.

This is one of those sounds really easy and of course the product should do this out of box questions that SCOM has never really had a great answer for.

If we take a look at the view it is populated by the following criteria:

02

And if we dig into the corresponding rule that collects the events we find a wildcard regex-style collection rule targeted at the ADFS log:

03

04

05

Since the collection rule is part of a sealed MP the best we could do at the rule level is to shut off this collection rule, and create a new collection rule with a modified wildcard expression such that it would collect everything the old rule did with the exception of the event ID the customer doesn’t like.

The problem with this solution is it isn’t particularly efficient/self-service friendly. If next week the customer realizes there is an additional event they want excluded the AD team has to contact the SCOM team and request further modifications.

In an ideal world the exclusion would be possible at the View level, but if you ever dig into modifying the classic OpsMgr views you will find that while you can use WildCards for some fields like Event Source to perform exclusions:

06

The same is not true for event ID’s, where wildcard exclusions are not allowed:

07

I briefly toyed with the idea of making modifications to the MP at the XML level to allow exclusions as I have occasionally done in the past to hack a subscription into meeting a customer need, but in this case such a solution doesn’t really fit. The customer needed something that was easy for them to change as they gradually winnow down the list of events they see to only the ones they care about.

They needed something that was extremely easy to edit.

Enter PowerShell and the SCOM SDK.

The first solution I put together for them to test was the following:

PowerShell Grid Widget

08

with a where-object {$_.Number -ne 31552 -and $_.PublisherName -eq “Health Service Modules” } I used a SCOM publishername since I didn’t have any ADFS events in my test environment and I wanted to use something that I could confirm that the exclusion was working as expected: 

11

Everything looked good the event I wanted excluded was dealt with properly  (Description dataObject is commented out in the code for this screenshot to make it easier to view. With Description uncommented each event takes up more lines of screen real-estate. I recommend creating two views, one with description commented out, and one where it is uncommented so customers can easily toggle between views.)

12

And if we remove the -ne $_.Number 31152 I get results as below with the event present:

10

In theory this should be all we needed, but when my customer tested out the script nothing happened. After a little bit of head scratching it became apparent what the problem was.

We were calling Get-SCOMEvent | Where-Object

which means we were telling the OpsMgr SDK to please go retrieve every single event in the OpsDB, and then once you are done with that we are going to pipe the results to a Where-Object and tell you what we really need.

In my relatively small test environment this wasn’t that big of an ask and the results returned quickly.

In my customer’s environment with thousands of servers and friendly event generating MP’s like the Exchange 2010 MP, getting every event in the OpsDB was basically a great way to enter an endless loop of dashboard timeouts with nothing ever being displayed.

So we needed to filter things down a bit up front, before piping to the Where-Object.

If you search the blogs you will find that Stefan Stranger has a nice post describing how to deal with this issue when calling the Get-SCOMAlert cmdlet with a Where-Object. Basically you use Get-SCOMAlert -criteria and then pipe to a Where-Object if still needed.

Unfortunately, Get-SCOMEvent doesn’t have a -criteria parameter because that would make things too easy and intuitive.

It does, however, have a -rule parameter which looked promising:

13

First I tried passing it a rule Name, followed by a second try with a rule GUID for an event collection rule I was interested in. In both cases I got a nice red error message:

14

While a little a cryptic it is saying that I am passing a parameter of the type string, and it wants a special SCOM specific rule type.

To give it what it wants we need to first retrieve the -rule parameter using the get-scomrule cmdlet and then pass it to get-scomevent as a variable:

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

15

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

get-scomevent -rule $rule

16

So our final script would look something like this: (I have added some additional filtering to be able to allow if you just want events from the past hour. *Keep in mind this date/time filtering doesn’t increase the efficiency of the script since it occurs after the Where-Object, the only thing making this script more efficient is that we are first only pulling back events collected from a specific rule*)

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

$DateNow = date

#Modify the .AddMinutes below to determine how far back to pull events

$DateAgo = $DateNow.AddMinutes(-60)

#$_.Number -ne(not equals) is used to indicate the event number that you want to exclude from the view

$eventView = Get-scomevent -rule $rule |where-object {$_.Number -ne 17 -and $_.TimeGenerated -ge $DateAgo -And $_.TimeGenerated -le $DateNow}|Select Id, MonitoringObjectDisplayName,  Number, TimeGenerated, PublisherName, Description| sort-object TimeRaised -descending

foreach ($object in $eventView){

     $dataObject = $ScriptContext.CreateInstance(“xsd://OpsConfig!sample/dashboard”)

     $dataObject[“Id”] = [String]($object.Id)

     $dataObject[“Event Number”] = [Int]($object.Number)

     $dataObject[“Source”] = [String]($object.MonitoringObjectDisplayName)

     $dataObject[“Time Created”] = [String]($object.TimeGenerated)

     $dataObject[“Event Source”] = [String]($object.PublisherName)

     $dataObject[“Description”] = [String]($object.Description)

     $ScriptContext.ReturnCollection.Add($dataObject)

}

And then the ADFS code would look like this, though event 17 was not the event they wanted to exclude:

$rule = get-scomrule -DisplayName “Federation server events collection”

$DateNow = date

#Modify the .AddMinutes below to determine how far back to pull events

$DateAgo = $DateNow.AddMinutes(-60)

#$_.Number -ne(not equals) is used to indicate the event number that you want to exclude from the view

$eventView = Get-scomevent -rule $rule |where-object {$_.Number -ne 17 -and $_.TimeGenerated -ge $DateAgo -And $_.TimeGenerated -le $DateNow}|Select Id, MonitoringObjectDisplayName,  Number, TimeGenerated, PublisherName, Description| sort-object TimeRaised -descending

foreach ($object in $eventView){

     $dataObject = $ScriptContext.CreateInstance(“xsd://OpsConfig!sample/dashboard”)

     $dataObject[“Id”] = [String]($object.Id)

     $dataObject[“Event Number”] = [Int]($object.Number)

     $dataObject[“Source”] = [String]($object.MonitoringObjectDisplayName)

     $dataObject[“Time Created”] = [String]($object.TimeGenerated)

     $dataObject[“Event Source”] = [String]($object.PublisherName)

     $dataObject[“Description”] = [String]($object.Description)

     $ScriptContext.ReturnCollection.Add($dataObject)

Hopefully this helps save a little bit of time for anyone else who comes across a question like this one.

Tagged , , , , , ,

How do I: Add Exclusion Criteria to SCOM Notification Subscriptions

By default all filtering criteria in MP Notification subscriptions are specific to inclusion. There is no native ability via the GUI to indicate that the subscription should pick up every alert related by a certain criteria with the exception of a specific subset of alerts. The only way to accomplish this via the GUI is if your inclusion criteria specifically enumerates every other alert instance with the exception of the alerts you want to exclude.

01

An example where this would cause problems is if you want to have one notification subscription that notifies for every Alert of a Severity of Critical with the exception of alerts from a specific monitor.

02

The first part is easy with the above config, but the second part (the excluding one specific monitor alert) is not possible. This type of scenario becomes important when you want to have two subscriptions:

 

  1. One that sends all critical alerts immediately with the exception of one specific monitor alert that has a recovery.
  2. And a second subscription that is on a 5 minute delay and gives the monitor a chance to recover and only sends an alert if post recovery running and health recalc the monitor still shows an unhealthy condition.

So to accomplish this we need to do a little custom work at the XML level.

First please note that according to the following TechNet article

https://technet.microsoft.com/en-us/library/hh212805.aspx

03

This is of course in reference to what you can and cannot do at the GUI level, but keep in mind that what you are doing is not officially supported and that you need to test carefully because it is very easy to accidentally break your subscription. (Also note that modifying a subscription in this way will require you to change your procedure for future modifications made to the GUI due to the fact that future changes in the GUI will blow away your manual XML changes so if you need to tweak the subscription at the GUI level at a later date you need to remember that you will have to run through the process below again to re-establish the exclusion.)

So to Add an Exclusion to a Notification Subscription

Administration Pane

04

Management Packs

05

Export Notifications Internal Library MP

06

Make a backup copy of this MP XML for safekeeping

Open non backup exported XML in text editor of choice. I am using Visual Studio, but anything including notepad will work.

The Channel Subscription and Subscriber info is all defined within this pack. You need to find the section of XML that corresponds to the subscription you are interested in.

In the case of my environment the Subscription is called Test Subscription:

08

If I scroll to the end of the Management Pack XML I will hit the <DisplayStrings></DisplayStrings> section where I can find the corresponding ID that will allow me to find the my subscription. (If you only have a few subscriptions you may be able to figure this out without the ID, but just to be safe it can be helpful to make sure you are editing the right subscription.

I Find my Test Subscription and see that it has a unique Element ID of: Subscription7adf1953_5ea7_4f20_85c9_67271662212a

09

If I then search the XML for references to this Element ID I will find the relevant portion of XML that we are going to want edit.

10

 

The important part that we will need to modify is contained within the <AlertChangedSubscription></AlertChangedSubscription>

 

In the case of this particular notification subscription we will change:

 

<AlertChangedSubscription Property=”Any”>

<Criteria>

<Expression>

<SimpleExpression xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance“>

<ValueExpression>

<Property>Severity</Property>

</ValueExpression>

<Operator>Equal</Operator>

<ValueExpression>

<Value>2</Value>

</ValueExpression>

</SimpleExpression>

</Expression>

</Criteria>

<ExpirationStartTime>12/11/2015 22:12:44</ExpirationStartTime>

<PollingIntervalMinutes>1</PollingIntervalMinutes>

<UserSid>S-1-5-21-2573163049-3319608367-1007842708-1106</UserSid>

<LanguageCode>ENU</LanguageCode>

<ExcludeNonNullConnectorIds>false</ExcludeNonNullConnectorIds>

<RuleId>$MPElement$</RuleId>

<TargetBaseManagedEntityId>$Target/Id$</TargetBaseManagedEntityId>

<TimeZone>E001000000000000C4FFFFFF00000B0000000100020000000000000000000300000002000200000000000000|Pacific Standard Time</TimeZone>

</AlertChangedSubscription>

To:

<AlertChangedSubscription Property=”Any”>

<Criteria>

<Expression>

<And xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance“>

<Expression>

<SimpleExpression>

<ValueExpression>

<Property>ProblemId</Property>

</ValueExpression>

<Operator>NotEqual</Operator>

<ValueExpression>

<Value>b59f78ce-c42a-8995-f099-e705dbb34fd4 </Value>

</ValueExpression>

</SimpleExpression>

</Expression>

<Expression>

<SimpleExpression>

<ValueExpression>

<Property>Severity</Property>

</ValueExpression>

<Operator>Equal</Operator>

<ValueExpression>

<Value>2</Value>

</ValueExpression>

</SimpleExpression>

</Expression>

</And>

</Expression>

</Criteria>

<ExpirationStartTime>12/11/2015 19:50:38</ExpirationStartTime>

<PollingIntervalMinutes>1</PollingIntervalMinutes>

<UserSid>S-1-5-21-2573163049-3319608367-1007842708-1106</UserSid>

<LanguageCode>ENU</LanguageCode>

<ExcludeNonNullConnectorIds>false</ExcludeNonNullConnectorIds>

<RuleId>$MPElement$</RuleId>

<TargetBaseManagedEntityId>$Target/Id$</TargetBaseManagedEntityId>

<TimeZone>E001000000000000C4FFFFFF00000B0000000100020000000000000000000300000002000200000000000000|Pacific Standard Time</TimeZone>

</AlertChangedSubscription>

This is going to vary depending on the complexity of your existing subscription you have to be careful to take into account existing <And> and <Or> tags when present.

<Value>b59f78ce-c42a-8995-f099-e705dbb34fd4</Value> Needs to be set to the appropriate ID for the alert you want to exclude.

For my example I am using the HealthService Heartbeat failure alert from my environment.

To determine the ID that is associated with a specfic rule/monitor generated alert that is currently present in the console you can use the following PowerShell from a Management server. Keep in mind if testing between a test and prod environment that ID values on custom monitors may be different. Run the PowerShell in both environments to be sure before implementing in prod.

11 12

For my test I will exclude any Health Service Heartbeat failure alerts which have the following ID:

13

If you don’t have an alert in the console to find the ID you could use the following query which will give you the ID of every Monitor in SCOM:

get-SCOMMonitor | select-object @{Name=”MP”;Expression={ foreach-object {$_.GetSCOMManagementPack().DisplayName }}},DisplayName, Priority, Enabled, Id | Out-GridView

 

14

15

You can use the Add criteria button to filter things down further:

16

Once the modifications to the management pack are complete you can reimport the newly updated management pack.

*WARNING* Keep in mind that the this UNSEALED MP will replace the existing MP on import so if there is an error in your code you could potentially break all subscriptions in your environment. This is why having a backup copy is extremely important. It is also why you need to test this procedure in a test environment before trying it in prod *WARNING* Again keep in mind that future changes to this notification subscription via the GUI will break your exclusion criteria and require you to manually modify the subscription again.

17

Click Install

18 19

Once imported generate one of the alerts that correspond with the exclude ID to see if it is properly excluded from the notification. Also test generate alerts that should be picked up by the subscription to confirm they are still being sent and that the subscription is not broken. Also Watch the console for any notification subscription specific alerts. If there are any errors in your syntax it can create a situation where you break all notifications.

Now when I have Critical Alerts The Health Service Heartbeat failures will be excluded from my subscription, but all other alerts including those made by monitors that are created in the future with a severity of critical will get picked up:

20

21

Again keep in mind the one caveat to modifying subscription XML in this way is that you lose the ability to edit that subscription via the GUI in the future. If I modify the subscription further during the GUI after making manual XML changes it will blow away the exclusion/Not Equal XML that was added. If you need to edit the subscription via the GUI just remember you need to go through the process above again to manual edit the XML.

 

Tagged , , ,

Troubleshooting: SCOM DW Database is in a Suspect State

324

Comic Credit: Abstruse Goose

So we had some severe thunderstorms roll through this past week and it took out the power at the house. This in turn took out the power to my test servers. I generally have my servers plugged into a UPS so I could gracefully shut them down during a power outage, but I was a bit lax in my unpacking after the recent move and my office is still a work in progress so when the power went out my SCOM environment didn’t exactly take it well. {Insert ad for Azure here ;o)} Once the power came back on a day later and I booted everything up I was greeted by the messages below:

01

When I checked SQL Management Studio I found that DW was in a suspect state:

02

For a quick primer on the various states a database can be in check the chart below:

State Definition
ONLINE Database is available for access. The primary filegroup is online, although the undo phase of recovery may not have been completed.
OFFLINE Database is unavailable. A database becomes offline by explicit user action and remains offline until additional user action is taken. For example, the database may be taken offline in order to move a file to a new disk. The database is then brought back online after the move has been completed.
RESTORING One or more files of the primary filegroup are being restored, or one or more secondary files are being restored offline. The database is unavailable.
RECOVERING Database is being recovered. The recovering process is a transient state; the database will automatically become online if the recovery succeeds. If the recovery fails, the database will become suspect. The database is unavailable.
RECOVERY PENDING SQL Server has encountered a resource-related error during recovery. The database is not damaged, but files may be missing or system resource limitations may be preventing it from starting. The database is unavailable. Additional action by the user is required to resolve the error and let the recovery process be completed.
SUSPECT At least the primary filegroup is suspect and may be damaged. The database cannot be recovered during startup of SQL Server. The database is unavailable. Additional action by the user is required to resolve the problem.
EMERGENCY User has changed the database and set the status to EMERGENCY. The database is in single-user mode and may be repaired or restored. The database is marked READ_ONLY, logging is disabled, and access is limited to members of the sysadmin fixed server role. EMERGENCY is primarily used for troubleshooting purposes. For example, a database marked as suspect can be set to the EMERGENCY state. This could permit the system administrator read-only access to the database. Only members of the sysadmin fixed server role can set a database to the EMERGENCY state.

So suffice to say, Suspect is not a good state for your database to be in, particularly if you know it is the result of a loss of power. If you have a nice well oiled maintenance/DR plan this would typically be where you break out the backup files and restore the database to the most recent corruption free backup.

But sometimes for whatever reason you aren’t going to have a backup. Maybe your maintenance plan failed, you go to retrieve a backup and realize it too is corrupt, you forget to setup backups for your test environment etc. Generally if this were a production system and you didn’t have a backup I would consider it a RGE or Resume Generating Event. RGE’s should be avoided at all costs. (If this happens to a production system I would highly advise opening a support case with Microsoft so that you can get assistance from engineers who are experts in SQL.)

However, as this is just my test environment, and I spin up new test environments on a fairly frequent basis I am going to show you another way to deal with Suspect databases. It isn’t pretty, or recommended, it’s irreversible, and will almost always result in data-loss. But if you are in  a pinch with a test environment without a viable backup and need to get a Suspect database back online you can use the following procedure:

First we need to put the Database in an Emergency state

03

SQL Management Studio sometimes takes a little bit to refresh and show the updated state so I will just usually query the states of the tables directly to confirm that everything worked:

04

05

From here we need to Set the Database in Single User Mode, we also need to stop any of the management servers from trying to connect to the database. To do this I just stop the SCOM related services on any of the management servers. ( I also then run a DBCC CheckDB -note the REPAIR_ALLOW_DATA_LOSS – this is that not so pretty irreversible part I was discussing earlier) *I would also advise splitting up the Set Single_User such that you can confirm that worked first before kicking off the CheckDB. I once waited for a very long time thinking that I was repairing the database where in reality I was stuck on the Set Single_User command and the CheckDB hadn’t begun yet*

06

This can take awhile so don’t be surprised if you have to wait, might be a good time for a coffee break. Since you have the WITH ALL_ERRORMSGS you aren’t going to miss anything important:

07

You will then get a nice series of messages that will look something like:

08

09

If we look at database status in the Object Explorer it still shows the DB as being in an emergency state, but if we query sys.databases we can see it is now online:

10

Hit refresh and you should now see that the DW is back online but it is still in Single User Mode

11

Run one more ALTER DATABASE and you will be all set. (Also restart the services on your Management Servers so they can reconnect to the DB)

12

For more info on this process I recommend checking out Paul Randal’s post, he wrote a lot of the code behind CHECKDB back when he worked at MSFT and far better articulates why the above method should only be used as a last resort.

Tagged , , ,

Troubleshooting: SCOM Web Console 500 – Internal Server Error

This is a problem that I occasionally see crop up in customer environments, but until now I had never bothered to document the issue.

Symptoms:

Customer is able to log into the web console successfully but when they click on certain views like the active alerts view they see the following error:

01

Anytime there is an issue with the web console my first suggestion is to attempt to recreate the problem as localhost on the machine that hosts the webconsole. If the server has the same problem locally you know where to start troubleshooting, if it is only happening on client machines it could be firewall or some problem at the client level.

When we tested from SCOM web console server we get the same error, but this time with a little bit more information:

02

This may not seem useful, but this is actually telling us exactly where the problem is. Note the Version=2.0.0.0 in regards to .Net Framework. This is telling what version of the Framework is expected for this particular app pool.

If we navigate to IIS we see the following:

03

OperationsManagerMonitoringView is set to v.40, but it should be set to v.2.0. We need to edit the Basic Setting and select the drop down with v2.0

04

After that we need to recycle the Application Pool via the Application Pool task

05

Then just logout of the webconsole and back in and all will be well.

Tagged , , ,

Troubleshooting: The installed version of SQL Server is Not Supported (SCOM 2012)

Awhile back I rebuilt one of my test environments. Post rebuild something very strange happened- I could not for the life of me get SCOM reporting to install. All the initial pre-req checks would pass, everything else would install just fine, but I would keep hitting this error.

If you mouse over the little Red X you would get the following:

If you consult the install log files in %userprofile%\AppData\Local\SCOM\LOGS I would find:

Searching for the error online returns a number of posts which while well meaning offer solutions which are unfortunately ultimately not very helpful.

I then spun up a brand new all in one test environment just to try to narrow things down and found that once again the error was present even though the installed version of SQL was a supported version.

After more troubleshooting than I would like to admit this left me with one option, there was something wrong with my SQL media I was using. At first glance it looks just like any other SQL media I have downloaded from MSDN:

But then I looked at the entire name of the media file:

Somehow in a moment of test environment building delirium I had downloaded an x86 copy of SQL 2012 Enterprise, and apparently one of the little known side effects of accidentally installing 32-bit SQL on a 64-bit Operating System is that you will get an SRS Couldn’t Check Version Exception, but everything else will install and work just fine.

I have come across a few instances of other people reporting this problem on the forums, but never actually arriving at a solution. Hopefully this post will be of some use. Once 64-bit SQL was installed on 64-bit Windows Server 2012 everything installs fine as it always has in the past.

 

Tagged , , , ,

Troubleshooting: SCOM Agent Healthy, but availability report for server shows monitoring unavailable

This was definitely an odd one. I noticed that one of our systems was showing as having a healthy SCOM Agent yet it if you ran an availability report against the windows computer object it would show monitoring as being unavailable. After confirming that the data warehouse was not running behind I found that this was actually happening with more than one of our servers.

Running an availability report would look as follows:

01

Brody Kilpatrick has a nice post on his blog explaining one of the possible causes and solutions which involves running some unsupported scripts against the data warehouse. I highly recommend reading his post and all credit for this solution must go to him. With that said, I found that the SQL queries he posted have issues that caused them to fail, at least in my environment. (Brody responded that he is updating the queries so it is likely that by the time you read this they will be fixed.) There were also some slight discrepancies between the results of his queries and my results so I opted to use his work as a template, but to modify things ever so slightly so that it would actually work in my environment which is running OpsMgr 2012 SP1 with the datawarehouse running on a dedicated Server 2008 r2 box running SQL 2008 R2.

First on your datawarehouse server you are going to want to run the following query:

02

If nothing is returned, that is fantastic, and you aren’t experiencing the problem this post will solve. If you do get results they will look something like this:

3

 

The EndDateTime with Null is not necessarily indicative of a problem. In some cases it was just a server that had been shutdown for a period of time, but had not been removed from SCOM. However, some of these NULL’s were for the servers that were showing healthy SCOM agents with availability reporting showing monitoring unavailable.

As useful as HealthServiceOutageRowId is it can be helpful to actually know the name of the associated system. Run the following query to join in Name and DisplayName:

04

Your results should look like this with the right-most DisplayName column providing the FQDN of the affected system:

05

At this point Brody’s post recommends confirming that the systems are all experiencing the problem, backing up your datawarehouse, and at your own risk modifying the values of the EndDateTime column via custom SQL. I tend to be a little risk averse, at least in my production environments so the first thing I tried now that I had narrowed down the issue was to uninstall the SCOM agent from one of the misbehaving systems, and then immediately reinstalling it. For that system this resolved the issue immediately with the proper availability monitoring returning post reinstall:

06

However, one of my affected servers was a domain controller which had a manually installed agent. I had no way of uninstalling, and reinstalling the agent without bugging our domain administrator.

So for this case I backed up the datawarehouse and then did the following (Again you could do this via raw SQL, but sometimes I think it is easier to have a clear understanding of what you are doing to a database rather than just copying some code someone else wrote)

Please keep in mind this solution is not supported by Microsoft:

Right click the dbo.HealthServiceOutage table:

07

Select Edit Top 200 Rows:

08

In the right hand properties box hit the + sign next to Top Specification and increase the Expression value to include the value of the HealthServiceOutageRowID of the sytem you want to fix:

09

At the bottom of your query you will see query changed, right click and select Execute SQL:

10

Scroll down to the HealthServiceOutageRowID which matches your affected server. The EndDateTime should show Null. Copy the value from the StartDateTime, and paste it into the box for the EndDateTime and close out of the editor.

11

And then for good measure run this script again to confirm that the your modification worked and the server should no longer be returned:

 

04

So two fixes for this issue:

Recommended Fix Reinstall the SCOM agent

Optional Not Supported back up your datawarehouse first Fix:

Modify the EndDateTime value from Null to match the StartDateTime, either via management studio edit, or via SQL Query.

Just to reiterate, if you opt to use this post as a solution– read Brody’s post as well, he found the solution and presents a much deeper understanding of how availability is actually calculated and the extra info is extremely useful. His method of fixing this via SQL rather than a manual edit via management studio is also far more scalable if you happen to have this problem on more than a handful of servers.

Tagged , ,

Troubleshooting: Product Evaluation is expiring in 60 days (SCOM 2012)

With some System Center 2012 products like Service Manager the install GUI requires you to enter a license key. While this is annoying during the install process this is nice in that it makes sure that you don’t forget to enter a license key. With Operations Manager 2012 the installer does not prompt for a license key and by default all installs are technically 180 day evaluation copies. This is fine except eventually you will log into the OpsMgr console and see the following:

eval expiring

This can be a little scary especially when you are seeing this in a production environment.

SCOM

The official Microsoft instructions for adding a license key can be found here.

You will need to run the following powershell commands on each of your SCOM Management servers:

Launch Powershell Run as an Administrator

Type the following:

Import-Module operationsmanager

New-SCOMManagementGroupConnection

Set-SCOMLicense -ProductID “Enter your license key here”

Y

Hit Enter

Full-censored

The Microsoft instructions then tell you to run:

Get-SCOMManagementGroup | ft skuforlicense, version, timeofexpiration -a

For me this would consistently return the following result with the Management server still appearing to be running a Eval copy:

Eval

However if you reboot the management server and rerun the commands you should see something like this:

yay!

So the reboot seems to be key after running through all the steps above.

 

 

Tagged , , ,

Troubleshooting: MsDtsServer100 IS Package Failed (SQL Management Pack)

Every once in awhile an engineer will have this error pop up for one of their systems:

01

If the engineer is a SQL DBA than there is no problem as they will understand both the source and ultimately how to fix the problem.

Sadly not everyone who has or manages a SQL server is a DBA. There are plenty of cases  where a sysadmin acquires a few SQL servers which they know the basics of managing or at least how to point an app server at to use it, but they may never have had the time to dig deeper into SQL, thus MsDtsServer100 IS Package Failed is not always particularly useful.

The Alert Description offers a useful clue “Maintenance Plan” Failed (Though this title will vary based on if the default plan name is used)

02

So how do you troubleshoot this error?

If you remote the SQL server referenced in the error you can launch SQL Management Studio and connect to the instance in question. If you expand the Management folder you will find a folder called Maintenance Plans

03

In this case the Maintenance Plan has been renamed to “Nightly Backups”

If you right click and select view history for the Maintenance Plan you will be presented with the following:

view history

This is where things get confusing, everything looks like it ran perfectly as per the little green check marks of success. You see a Rebuild Indexes, a History Cleanup, some generic Maintenance task, and a DB Backup. All successful.

So where is the error coming from?

If you navigate to the Application Event Log on the SQL server for the time the alert was generated you will find the answer:

event log

Subplan II actually had two components: one was a rebuild indexes which you can see from the SQL Management Studio history occurred successfully. The other item in this particular case was a reorganize indexes which was failing.  Reorganizing indexes immediately after rebuilding them doesn’t sound like a very good order of operations. For this specific issue I recommended that the engineer remove the reorganize indexes from subplan II and the error has never happened since. So if you see MsDtsServer 100 IS Package Failed you are going to want to go to the Application Event Log of the SQL Server to figure out the source of the problem.

Tagged , , ,

Talk: Tips & Tricks for Creating Custom Management Packs

I was perusing through some of the talks from last years TechEd and came across this excellent talk by Mickey Gousset on creating custom management packs:

For more talks from TechEd 2012 click here.

Tagged , , , , ,