Monthly Archives: March 2013

Talk: Why We Fail – An Architect’s Journey to the Private Cloud

I forgot how much I enjoyed this talk from MMS last year: “If you talk to senior management in IT organizations, it’s very rare to find one that doesn’t want cloud. I want some of that cloud stuff. What is that? I don’t know, but I want it. OK, here’s some crayons, there you go, draw a cloud.” -Alex Jauch

Alex Jauch’s Blog can be found here.

Ars Technica write up from the conference which mentions the talk.

Tagged , , , ,

Talk: Tips & Tricks for Creating Custom Management Packs

I was perusing through some of the talks from last years TechEd and came across this excellent talk by Mickey Gousset on creating custom management packs:

For more talks from TechEd 2012 click here.

Tagged , , , , ,

Building Better Service Level Dashboards

Microsoft has added a lot of functionality into SCOM 2012 to make creating dashboards easy. The only problem is they have given you a blank canvas without much in the way of guidance. This can be great, but it can also be problematic. The fact that you can make a 9 cell grid layout filled with graphs and data doesn’t mean that you should.

What you should do, is strive to build effective dashboards.  What is an effective dashboard? There is no right answer– I am making up the phrase– though I would argue that effective dashboards are ones in which the dashboard is designed to give insight into a service with a specific audience in mind.

A dashboard that is useful for your engineers or sysadmins is going to–OR SHOULD–look very different from a dashboard for Tier I support. Much like a dashboard for Tier I should look different from a dashboard for non IT customers. I like to break down service level dashboards into specific sub categories based on audience.

For the sake of this post lets divide potential dashboards into three groups:

1. Dashboards for non technical internal clients often published on an internal sharepoint site.

2. Dashboards for Tier I Support and upper IT management published via limited rights login to SCOM web console.

3. Dashboards for Systems engineers and Sysadmins.

Obviously this is going to vary greatly depending on what business you are in, but you get the idea.

I think in general we tend to do a pretty good job with 1 and 3.  Service Level Dashboards for non technical internal clients just need to provide basic information: is the service up or down, and to the best of our monitoring ability how well are we meeting the SLA?

The out of box Service Level Dashboard in SCOM 2012 does this quite effectively:

I say to the best of our ability above, because even with synthetic transactions there is always the possibility that a complex service can be degraded or down in some respect without your monitors picking up on it. (Exchange servers are up and running perfectly, but your BES server for Blackberries is down.) Or alternatively, your monitoring picks up a problem, but isn’t smart enough to correlate it into a change in the dashboard.  At best service monitoring is an evolutionary process not one that you set up and leave alone. IT Managers may not want to hear it, but ultimately your ability to track  a service depends on the accuracy of your monitors, and building accurate monitors requires iteration and time.

Dashboards for engineers and sysadmins are often built with very specific requirements in mind, or are redundant and aren’t needed so they tend to not be a problem either.

Where I most see the most potential for people to get into trouble is in creating dashboards for their Tier I support, and also for senior IT management. The easy answer is to just have them use the simple up/down  service level dashboard. The problem is that while this is a perfectly acceptable level of transparency to provide to Non IT, it often isn’t enough info, especially for the occasional situation when your up/down dashboard says everything is fine, and users are calling in complaining with issues.

Below is an example of a dashboard I would create for an e-mail or messaging service  for Tier I operators and upper level IT management that seeks to find the middle ground:

– In the Upper left you have a state widget. It is pegged to a group which contains all servers related to e-mail service. It should be made up of not just exchange servers. Mine contains BES and ISA servers to provide a more complete picture of the health of all related parts. Some would say build a simple distributed app to do this, but this starts to get troublesome when dealing with load-balanced systems, or systems where a negative status of one system doesn’t need to roll up to the status of the entire app.

– Upper middle is a Service Level Widget which is tied to the Exchange 2010 Application from the Exchange 2010 MP. It’s not perfect, but it does a decent job of generally showing when core e-mail functionality is up or down.

– Upper right: An alerts widget which looks at anything related to the health of the servers in the group on the left.

– Middle: Graph of outlook latency. Honestly, it is unlikely that Tier I is going to gain useful info from this graphic. You can, and I have been able to see noticeable shifts if one member of a load balanced or clustered pair is down, but this falls into the category of behold the power of pretty graphs. Sometimes its nice for your Tier I and upper IT management to feel empowered, and for whatever reason I have found that pretty graphs can do that even if they may or may not know exactly what they are looking at.

– Bottom: Again empowerment via pretty graphs.

 

Tagged , , ,

Talk: Randy Pausch – Time Management

Many people probably know Randy Pausch from his Last LectureHowever, the talk below given on November 2007 at the University of Virginia was truly his last lecture, and remains one of the most useful talks that I have ever seen. I am always surprised that while many have seen the first lecture, this talk remains relatively unknown.

I think for anyone, particularly those who work in IT this talk is worth taking the time to watch:

If you have a favorite talk, lecture, podcast, et cetera let me know with a comment. I would like to start posting a new one every Friday. I have quite a few of my own, but eventually I will run out.

Tagged , , , ,

System Center Operations Manager Blogs you should read:

Most of my neck of the woods is shut down and in snowday mode  so it seems like a good day to catch up on some of the OpsMgr blogs.

In no particular order:

01. TechNet – Kevin Holman’s Blog

02. Thoughts on OpsMgr and System Center 2012 – Marnix Wolf’s Blog

03. SCOM, SCCM, Powershell – Tao Yang’s Blog

04. Kevin Greene’s Blog

05. Cameron Fuller’s Blog

06. TechNet – Operations Manager Engineering Blog

07. Bob Cornelissen’s Blog

08. Everything System Center – Tim McFadden’s Blog

09. Matthew Long’s Blog

These are a few of my favorites, if you have any suggestions leave a comment and I will add them to the list.

UPDATE:

Rod Trent kindly pointed out that I had left out myITforum which while not dedicated to SCOM is certainly an excellent resource for all things Microsoft & System Center.

 

Tagged , , , , ,

Critical Alert: WMI is unhealthy

Over the past few months I have had a few alerts related to a failure of WMI on servers. The Product Knowledge within SCOM recommends the following:

Unfortunately, I haven’t had much luck with the recommended fixes. In the past with other systems I used to just reboot the server in question. but I hate having to rely on a reboot to fix a problem as it’s not a particularly good long term solution.

When I try running winmgmt /verifyrepository I get a failure message:

If I try searching for anything that might be hogging all the threads, nothing obvious stands out.

If I run the handy WMI diagnosis tool I get more or less the same thing, along with some useful information that other than the threads being created issue all seems to be well.

I am 99% certain if I were to reboot the system it would resolve the issue, but my guess is this would be only a temporary fix. The particular system I am having the problem on now happens to be of the mission critical cannot reboot under any circumstances without change management and a team of skilled surgeons on hand to bring it back to life should it decide to crash post reboot.

In the interest of a long term solution, I am going to try running the recommended hotfixes to make WMI more robust as recommended by Marnix Wolf on his excellent Blog on OpsMgr.

Direct Link to Microsoft Recommended WMI Hotfixes

I will continue to update this post with any further info related to WMI troubleshooting that I come across in the future.

Tagged , , , ,

Management Pack Tuning: Logical Disk Fragmentation is High

One of the first floods of warnings new SCOM admins often get in their inbox is Logical Disk Fragmentation is High.

Here is how it usually goes down:
1. You import the Microsoft Windows Server Management packs on Monday.

2. You spend a day or two tuning out the memory and CPU spikes that fall into a range that you would consider noise.

3. By the end of the week you are feeling pretty good about yourself and decide to setup some notification channels and subscriptions for some peace of mind over the weekend.

4. You wake up Saturday morning to find your inbox or console full of Disk Fragmentation warnings.
What have you done wrong? Nothing, it is just that by default the Windows Server MP runs its disk fragmentation check every Saturday at 3:00 AM so unless you preemptively made the necessary overrides to your environment you will be treated to this nice little surprise. Certainly not the end of the world, but here is where many admins screw up. SCOM offers the wonderful functionality of being able to fix the fragmentation problem with two clicks via the Logical Disk Defragmentation task.

You have fragmented disks, you can defrag them in two clicks, how is this not a good thing?

The first question you have to ask before even thinking about defraging a server via SCOM is if the fragmented server is virtual or physical.? If the server is physical the answer to if you should defrag is–maybe. If the server is virtual– the answer is no.

For a good explanation of the implications of defraging a virtual server I would recommend Cormac Hogan’s September 20, 2011 post on the VMware vSphere Blog.

The post is specific to why this is a really bad idea for VMware shops, but much of the reasoning is applicable to any virtualized server.

The key points that apply to any virtualized servers are that:

1. You are unlikely to see any benefits from defraging virtual disks since generally multiple VM’s run on any given datastore/storage pool in an enterprise environment.

2. If any of your disks are thin provisioned the defrag process will cause your VMDK/VHD to expand and to unnecessarily chew up space. (If you were to defrag a large number of thin provisioned servers at the same time you could theoretically cause an outage if any of your datastores/storage pools are oversubscribed.)

3. Defragging creates more I/O which could result in a temporary drop in performance for the duration of the defrag.

 

Tagged , , , , , ,