Critical Alert: WMI is unhealthy

Over the past few months I have had a few alerts related to a failure of WMI on servers. The Product Knowledge within SCOM recommends the following:

Unfortunately, I haven’t had much luck with the recommended fixes. In the past with other systems I used to just reboot the server in question. but I hate having to rely on a reboot to fix a problem as it’s not a particularly good long term solution.

When I try running winmgmt /verifyrepository I get a failure message:

If I try searching for anything that might be hogging all the threads, nothing obvious stands out.

If I run the handy WMI diagnosis tool I get more or less the same thing, along with some useful information that other than the threads being created issue all seems to be well.

I am 99% certain if I were to reboot the system it would resolve the issue, but my guess is this would be only a temporary fix. The particular system I am having the problem on now happens to be of the mission critical cannot reboot under any circumstances without change management and a team of skilled surgeons on hand to bring it back to life should it decide to crash post reboot.

In the interest of a long term solution, I am going to try running the recommended hotfixes to make WMI more robust as recommended by Marnix Wolf on his excellent Blog on OpsMgr.

Direct Link to Microsoft Recommended WMI Hotfixes

I will continue to update this post with any further info related to WMI troubleshooting that I come across in the future.

Management Pack Tuning: Logical Disk Fragmentation is High

One of the first floods of warnings new SCOM admins often get in their inbox is Logical Disk Fragmentation is High.

Here is how it usually goes down:
1. You import the Microsoft Windows Server Management packs on Monday.

2. You spend a day or two tuning out the memory and CPU spikes that fall into a range that you would consider noise.

3. By the end of the week you are feeling pretty good about yourself and decide to setup some notification channels and subscriptions for some peace of mind over the weekend.

4. You wake up Saturday morning to find your inbox or console full of Disk Fragmentation warnings.
What have you done wrong? Nothing, it is just that by default the Windows Server MP runs its disk fragmentation check every Saturday at 3:00 AM so unless you preemptively made the necessary overrides to your environment you will be treated to this nice little surprise. Certainly not the end of the world, but here is where many admins screw up. SCOM offers the wonderful functionality of being able to fix the fragmentation problem with two clicks via the Logical Disk Defragmentation task.

You have fragmented disks, you can defrag them in two clicks, how is this not a good thing?

The first question you have to ask before even thinking about defraging a server via SCOM is if the fragmented server is virtual or physical.? If the server is physical the answer to if you should defrag is–maybe. If the server is virtual– the answer is no.

For a good explanation of the implications of defraging a virtual server I would recommend Cormac Hogan’s September 20, 2011 post on the VMware vSphere Blog.

The post is specific to why this is a really bad idea for VMware shops, but much of the reasoning is applicable to any virtualized server.

The key points that apply to any virtualized servers are that:

1. You are unlikely to see any benefits from defraging virtual disks since generally multiple VM’s run on any given datastore/storage pool in an enterprise environment.

2. If any of your disks are thin provisioned the defrag process will cause your VMDK/VHD to expand and to unnecessarily chew up space. (If you were to defrag a large number of thin provisioned servers at the same time you could theoretically cause an outage if any of your datastores/storage pools are oversubscribed.)

3. Defragging creates more I/O which could result in a temporary drop in performance for the duration of the defrag.

 

The contents of this site are provided “AS IS” with no warranties, or rights conferred. Example code could harm your environment, and is not intended for production use. Content represents point in time snapshots of information and may no longer be accurate. (I work @ MSFT. Thoughts and opinions are my own.)