How do I: Create an Event View that excludes a particular Event ID

I had a large enterprise customer recently who was monitoring ADFS with the default management pack. They liked being able to glance at the event view which gave them a single place where they could look at the ADFS events occurring across their environment. They were using this event data as part of their correlation and tuning process to determine if there were additional actionable events that were being missed for their unique infrastructure. The eventual goal being to stop collecting the events altogether and only have alert generating rules/monitors in place for patterns of events that they cared about.

01

They quickly found that at least for their environment some of the events being collected were essentially noise, and they asked how to adjust the view so it would exclude one particular event.

This is one of those sounds really easy and of course the product should do this out of box questions that SCOM has never really had a great answer for.

If we take a look at the view it is populated by the following criteria:

02

And if we dig into the corresponding rule that collects the events we find a wildcard regex-style collection rule targeted at the ADFS log:

03

04

05

Since the collection rule is part of a sealed MP the best we could do at the rule level is to shut off this collection rule, and create a new collection rule with a modified wildcard expression such that it would collect everything the old rule did with the exception of the event ID the customer doesn’t like.

The problem with this solution is it isn’t particularly efficient/self-service friendly. If next week the customer realizes there is an additional event they want excluded the AD team has to contact the SCOM team and request further modifications.

In an ideal world the exclusion would be possible at the View level, but if you ever dig into modifying the classic OpsMgr views you will find that while you can use WildCards for some fields like Event Source to perform exclusions:

06

The same is not true for event ID’s, where wildcard exclusions are not allowed:

07

I briefly toyed with the idea of making modifications to the MP at the XML level to allow exclusions as I have occasionally done in the past to hack a subscription into meeting a customer need, but in this case such a solution doesn’t really fit. The customer needed something that was easy for them to change as they gradually winnow down the list of events they see to only the ones they care about.

They needed something that was extremely easy to edit.

Enter PowerShell and the SCOM SDK.

The first solution I put together for them to test was the following:

PowerShell Grid Widget

08

with a where-object {$_.Number -ne 31552 -and $_.PublisherName -eq “Health Service Modules” } I used a SCOM publishername since I didn’t have any ADFS events in my test environment and I wanted to use something that I could confirm that the exclusion was working as expected: 

11

Everything looked good the event I wanted excluded was dealt with properly  (Description dataObject is commented out in the code for this screenshot to make it easier to view. With Description uncommented each event takes up more lines of screen real-estate. I recommend creating two views, one with description commented out, and one where it is uncommented so customers can easily toggle between views.)

12

And if we remove the -ne $_.Number 31152 I get results as below with the event present:

10

In theory this should be all we needed, but when my customer tested out the script nothing happened. After a little bit of head scratching it became apparent what the problem was.

We were calling Get-SCOMEvent | Where-Object

which means we were telling the OpsMgr SDK to please go retrieve every single event in the OpsDB, and then once you are done with that we are going to pipe the results to a Where-Object and tell you what we really need.

In my relatively small test environment this wasn’t that big of an ask and the results returned quickly.

In my customer’s environment with thousands of servers and friendly event generating MP’s like the Exchange 2010 MP, getting every event in the OpsDB was basically a great way to enter an endless loop of dashboard timeouts with nothing ever being displayed.

So we needed to filter things down a bit up front, before piping to the Where-Object.

If you search the blogs you will find that Stefan Stranger has a nice post describing how to deal with this issue when calling the Get-SCOMAlert cmdlet with a Where-Object. Basically you use Get-SCOMAlert -criteria and then pipe to a Where-Object if still needed.

Unfortunately, Get-SCOMEvent doesn’t have a -criteria parameter because that would make things too easy and intuitive.

It does, however, have a -rule parameter which looked promising:

13

First I tried passing it a rule Name, followed by a second try with a rule GUID for an event collection rule I was interested in. In both cases I got a nice red error message:

14

While a little a cryptic it is saying that I am passing a parameter of the type string, and it wants a special SCOM specific rule type.

To give it what it wants we need to first retrieve the -rule parameter using the get-scomrule cmdlet and then pass it to get-scomevent as a variable:

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

15

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

get-scomevent -rule $rule

16

So our final script would look something like this: (I have added some additional filtering to be able to allow if you just want events from the past hour. *Keep in mind this date/time filtering doesn’t increase the efficiency of the script since it occurs after the Where-Object, the only thing making this script more efficient is that we are first only pulling back events collected from a specific rule*)

$rule = get-scomrule -DisplayName “Operations Manager Data Access Service Event Collector Rule”

$DateNow = date

#Modify the .AddMinutes below to determine how far back to pull events

$DateAgo = $DateNow.AddMinutes(-60)

#$_.Number -ne(not equals) is used to indicate the event number that you want to exclude from the view

$eventView = Get-scomevent -rule $rule |where-object {$_.Number -ne 17 -and $_.TimeGenerated -ge $DateAgo -And $_.TimeGenerated -le $DateNow}|Select Id, MonitoringObjectDisplayName,  Number, TimeGenerated, PublisherName, Description| sort-object TimeRaised -descending

foreach ($object in $eventView){

     $dataObject = $ScriptContext.CreateInstance(“xsd://OpsConfig!sample/dashboard”)

     $dataObject[“Id”] = [String]($object.Id)

     $dataObject[“Event Number”] = [Int]($object.Number)

     $dataObject[“Source”] = [String]($object.MonitoringObjectDisplayName)

     $dataObject[“Time Created”] = [String]($object.TimeGenerated)

     $dataObject[“Event Source”] = [String]($object.PublisherName)

     $dataObject[“Description”] = [String]($object.Description)

     $ScriptContext.ReturnCollection.Add($dataObject)

}

And then the ADFS code would look like this, though event 17 was not the event they wanted to exclude:

$rule = get-scomrule -DisplayName “Federation server events collection”

$DateNow = date

#Modify the .AddMinutes below to determine how far back to pull events

$DateAgo = $DateNow.AddMinutes(-60)

#$_.Number -ne(not equals) is used to indicate the event number that you want to exclude from the view

$eventView = Get-scomevent -rule $rule |where-object {$_.Number -ne 17 -and $_.TimeGenerated -ge $DateAgo -And $_.TimeGenerated -le $DateNow}|Select Id, MonitoringObjectDisplayName,  Number, TimeGenerated, PublisherName, Description| sort-object TimeRaised -descending

foreach ($object in $eventView){

     $dataObject = $ScriptContext.CreateInstance(“xsd://OpsConfig!sample/dashboard”)

     $dataObject[“Id”] = [String]($object.Id)

     $dataObject[“Event Number”] = [Int]($object.Number)

     $dataObject[“Source”] = [String]($object.MonitoringObjectDisplayName)

     $dataObject[“Time Created”] = [String]($object.TimeGenerated)

     $dataObject[“Event Source”] = [String]($object.PublisherName)

     $dataObject[“Description”] = [String]($object.Description)

     $ScriptContext.ReturnCollection.Add($dataObject)

Hopefully this helps save a little bit of time for anyone else who comes across a question like this one.

Best Practices: How to load test your website? Part I

Earlier this week I received a question about using SCOM to do large scale load testing on a website. While SCOM can be a great resource with synthetic transactions to simulate a finite number of user transactions and APM can give you code level instrumentation of how your .NET app is performing it isn’t really designed as large scale load test rig. You could certainly set up a large number of synthetic transactions executed from anywhere you have a SCOM agent, but when you start talking about the need to simulate 10,000+ transactions against a single site then you are entering a territory where SCOM isn’t the right tool for the job.

If you want to do some basic load testing you can do this natively inside Visual Studio Online

https://www.visualstudio.com/en-us/products/what-is-visual-studio-online-vs.aspx

01

First enter the url of the site you want to load test & give your test a name.

02 03

Next pick which Microsoft Azure Datacenter you want your test to be executed from:

04

Select the number of test users. Simple tests are capped at a max of 200 users (You get 20,000 free user minutes per month)

05

Select how long you want the test to run for:

06

Set the Think-time

07

Select a browser distribution

08

Then just click Test now

09

It will take a few minutes to acquire the necessary resources and configure agents

10

Results will appear as below:

11 93

For more advanced load testing check out the Part II post.

Best Practices: Agent Remediation Tool

This is a proof of concept script consisting of a mix of PowerShell with some .NET for a GUI that can serve as an automated playbook for agent remediation.

Typically I prefer to remediate agents via the SCOM console, but there are instances where an agent is locked down such that remote management is not possible, and the SCOM Team may not have access to remote a server and fix an agent. This script empowers non SCOM sysadmins, DBA’s et cetera to be able to perform basic troubleshooting on their agents without the fear of accidentally deleting the wrong thing.

-The script must be run in PowerShell as an administrator

-This script is compatible with PowerShell 2.0, 3.0 & 4.0

*There are some dependencies on the .NET Framework. It is designed for .NET 3.5, but in testing it does work with .NET 2.0, though it will throw some fun red errors for certain non critical display elements which will not be able to load.

This script is designed for SCOM 2012 R2 Agents

opsconfigtool

Functionality:

1. Restart SCOM Agent (This restarts the Microsoft Monitoring Agent)

2. Flush SCOM Agent Cache (This stops the SCOM agent, Clears the Health Service State Agent Caches, Starts the agent and rebuilds the Agent Cache)

3. Uninstall SCOM Agent (This queries WMI to determine appropriate GUID that is associated with the SCOM agent installation and then passes this GUID to an automated uninstall.)

4. Install SCOM Agent (This is a placeholder for either manual agent install instructions, or it can be adapted to call a function to kick off a command-line based agent install assuming agent media is on an accessible UNC file share)

*It appears I neglected to include the link to the script, it can be downloaded here: TechNet Gallery *

 

Best Practices: SCOM Health Check Script/Report OpsConfig ed v1.0

This weekend I came across a fantastic SCOM HealthCheck Script/Report written by Tim Culham of http://www.culham.net

I would strongly encourage you to visit the site and check out his original script as he did all the heavy lifting.

http://www.culham.net/powershell/scom-2012-scom-2012-r2-daily-check-powershell-script-html-report/

I decided to extend/tweak his script a bit by adding in a number of the more in depth SQL Queries that I frequently ask customers to run when troubleshooting performance issues with the OpsDB and DW. Many of the queries are modified versions of the KH Useful SQL Queries, though there are a few that might be new to all of you. This sacrifices some of the speed and elegance of Tim’s script, but the information that you get back is invaluable.

This script should be run as administrator from a SCOM Management Server by an account that has permissions to connect and read from the Ops & DW DBs. You can just run the script without inputting any parameters. It will open the report upon script completion. (My version can take anywhere from 30 seconds to 10 minutes to run depending on the size/performance of your environment)

*At times this script is running queries directly against the OpsDB -while this is a completely common practice for troubleshooting and diagnosing issues it is also technically not supported. The script is provided AS-IS without warranty of any kind*

My version of the script can be downloaded here:

https://gallery.technet.microsoft.com/SCOM-Health-Check-fd2272ec

What this version of the script will give you:(Some of these are just features which are carried over from the original, many are added)

01. Version/Service Pack/Edition of SQL for each SCOM DB Server
02. Disk Space Info for Ops DB, DW DB, and associated Temp DB’s
03. Database Backup Status for all DB’s except Temp.
04. Top 25 Largest Tables for Ops DB and DW DB
05. Number of Events Generated Per Day (Ops DB)
06. Top 10 Event Generating Computers (Ops DB)
07. Top 25 Events by Publisher (Ops DB)
08. Number of Perf Insertions Per Day (Ops DB)
09. Top 25 Perf Insertions by Object/Counter Name (Ops DB)
10. Top 25 Alerts by Alert Count
11. Alerts with a Repeat Count higher than 200
12. Stale State Change Data
13.  Top 25 Monitors Changing State in the last 7 Days
14. Top 25 Monitors Changing State By Object
15. Ops DB Grooming History
16. Snapshot of DW Staging Tables
17. DW Grooming Retention
18. Management Server checks (Works well on prem, seems to have some issues with gateways due to remote calls-if you see some errors flash by have no fear though I wouldn’t necessarily trust the results coming back from a Gateway server in the report depending on firewall settings)
19. Daily KPI
20. MP’s Modified in the Last 24 hours
21. Overrides in Default MP Check
22. Unitialized Agents
23. Agent Stats (Healthy, Warning, Critical, Unitialized, Total)
24. Agent Pending Management Summary
25. Alert Summary
26. Servers in Maintenance Mode

Report Output: (Only grabbing a screenshot of the first few pages as you get the basic idea)

Report Output

report output 2

 

 

Troubleshooting: SCOM reports yield weird data/what the heck does 9.221E+07 mean?

Eventually when running a report in SCOM you are going to end up with a report like the one below.

01

At first glance everything looks okay. But then you start looking at the data that was returned and it can sometimes be a little confusing.

02

Usually the questions I get from customers ranges from “I think this report is broken” to “what the heck does 9.221E+07 mean?”

Fear not, reporting is not broken and 9.221E+07 is not nearly as confusing as it may seem. Basically, what is going on is that the dataset you have returned is so large in regards to the number of digits that in order to display it in a meaningful way the system is presenting the data using some shorthand commonly known as scientific notation. All you need to understand is that +07 indicates the number of times the decimal point would need to be moved to the right to display the full number.

So 9.221E+07 = 92210000

And if we look at the top of the chart we will note that the particular performance counter that we are reporting on is being returned in Bytes so we are dealing with:

92210000 Bytes

For those of you who like me are not particularly mathematically inclined and prefer to leave conversions to someone else I recommend using the wonderful built-in functionality of PowerShell.

If you enter 9.221E+07 and hit enter it will automatically output the full value for you:

03

If the original unit–in this case Bytes–is not your unit of choice and you want to know what the value is in MB  just enter the value in scientific notation form and then divide by 1 MB:

9.221E+07  / 1MB

04

Same goes for GB

9.221E+07  / 1GB

05

 

The contents of this site are provided “AS IS” with no warranties, or rights conferred. Example code could harm your environment, and is not intended for production use. Content represents point in time snapshots of information and may no longer be accurate. (I work @ MSFT. Thoughts and opinions are my own.)