Reading List: SCOMzilla/scom_atlas

Continuing in this week’s series of blogs that you may not know about that you should read I present Matt T’s blog:

http://blogs.technet.com/b/scom_atlas/

Best known for his awesome scheduled maintenance mode SCORCH runbook solution Matt T is an excellent source of SCOM/SCORCH related knowledge.

Tagged , , ,

Reading List: The OMX Blog

For whatever reason there are a handful of amazing SCOM blogs that have never gotten the attention that they deserve. I am going to call out a few over the next week, but I wanted to start out with this one:

http://blogs.technet.com/b/omx/

There aren’t a ton of posts, but it is well worth subscribing to as each post represents a truly unique contribution to SCOM monitoring knowledge. I have had the pleasure of sitting in a room and listening to Michael Repperger talk about SCOM and monitoring two times now and each time I learn an enormous amount. Last week I sat and watched as his session end time rolled past and while it was clear his staff had places to go, a handful of us would have been happy to order dinner and keep him talking for as long as possible. He is without question one of the brightest minds working with SCOM.

How do I: Alert on SQL Errors that aren’t logged to the windows event log

This is one of those common questions that if you ask a SQL DBA they will probably know the answer, but it is less common information within the SCOM community.

First if you want to get a sense of all the errors that SQL can generate to its own internal logs run the following from your server (Language ID will of course vary):

01

For my SQL 2014 Server I am getting back 11548 rows of messages:

02

 

Column name Data type Description
message_id int ID of the message. Is unique across server. Message IDs less than 50000 are system messages.
language_id smallint Language ID for which the text in text is used, as defined in syslanguages. This is unique for a specified message_id.
severity tinyint Severity level of the message, between 1 and 25. This is the same for all message languages within a message_id.
is_event_logged bit 1 = Message is event-logged when an error is raised. This is the same for all message languages within a message_id.
text nvarchar(2048) Text of the message used when the corresponding language_id is active.

For the most part the SQL MP’s will give you access to any of the events you might care about in both the SQL and Windows Application event logs. In those cases where this doesn’t happen there is a built in stored procedure in SQL that lets you write SQL errors to the Windows Application log to allow you to pick it up in other systems like SCOM.

sp_altermessage

If you dive into the code for the SQL replication MP’s you will find that this is how replication monitoring is implemented in SCOM. A series of sp_altermessage commands for different replication errors to turn on logging to the app log. Followed by corresponding event ID targeted alert generating rules.

03

https://msdn.microsoft.com/en-us/library/ms175094.aspx

The effect of sp_altermessage with the WITH_LOG option is similar to that of the RAISERROR WITH LOG parameter, except that sp_altermessage changes the logging behavior of an existing message. If a message has been altered to be WITH_LOG, it is always written to the Windows application log, regardless of how a user invokes the error. Even if RAISERROR is executed without the WITH_LOG option, the error is written to the Windows application log.

If a message is written to the Windows application log, it is also written to the Database Engine error log file.

Tagged , , , ,

Troubleshooting: Database Status Monitor generates warning for secondary database in log shipped pair

This is a question that I worked on for one of my customers awhile back, but I was giving a talk on extending the SQL MP this week out in Redmond so it seems like a good time to get this one on the blog.

The root of this problem is that the Database Status Monitor in the SQL Management Packs is not Log Shipping aware. For those who don’t live and breath SQL the basic definition of Log Shipping lifted from the MSFT documentation is:

SQL Server Log shipping allows you to automatically send transaction log backups from a primary database on a primary server instance to one or more secondary databases on separate secondary server instances.

-Provides a disaster-recovery solution for a single primary database and one or more secondary databases, each on a separate instance of SQL Server.

-Supports limited read-only access to secondary databases (during the interval between restore jobs).

-Allows a user-specified delay between when the primary server backs up the log of the primary database and when the secondary servers must restore (apply) the log backup. A longer delay can be useful, for example, if data is accidentally changed on the primary database. If the accidental change is noticed quickly, a delay can let you retrieve still unchanged data from a secondary database before the change is reflected there.

The Database Status Monitor can return the following possible states within SQL:

01

Which roll up under the following three health states:

02

So in the case of Log Shipping the issue is that for the secondary database “Restoring” is normal behavior for a log shipped DB, but it would not be typical behavior for a standard database. So you get Warning Alerts because the monitor isn’t smart enough to detect the difference between a log shipped and non log shipped DB.

To understand this better it can help to look at the history of the embedded T-SQL code in the MP for this monitor. The original code in the SQL MPs looked as follows:

03

04

It is very simple, but it isn’t mirroring or log shipping aware.

Then in later revisions of the SQL MP Mirroring Awareness was added to the query. This was first added for just SQL 2008, but then around the 6.5+ releases all versions of the SQL MP were made Mirroring aware with the exception of the SQL 2005 MPs via the following code modifications:

05 06 07

This code + a little XPathQuery logic at the end of the MP fixes the false warning alerts for mirroring, but a secondary log shipped database still shows up as a Warning today in the current release of the SQL MP.

I will put up the full sample code for the custom monitor soon on TechNet Gallery, but the basic modifications are as follows:

Add a left outer join for msdb.dbo.log_shipping_secondary_databaseses

09

Add the additional property bag value

10

Modify the XPathQuery so that -IsLogShipping is a Healthy condition.

11 12 13

And then you have a nice custom LogShipping & Mirroring aware Database Status Monitor. I tend to isolate the monitor and any dependencies into a standalone custom MP and then disable the out of box Monitor.

Tagged , , ,

Troubleshooting: SCOM DW Database is in a Suspect State

324

Comic Credit: Abstruse Goose

So we had some severe thunderstorms roll through this past week and it took out the power at the house. This in turn took out the power to my test servers. I generally have my servers plugged into a UPS so I could gracefully shut them down during a power outage, but I was a bit lax in my unpacking after the recent move and my office is still a work in progress so when the power went out my SCOM environment didn’t exactly take it well. {Insert ad for Azure here ;o)} Once the power came back on a day later and I booted everything up I was greeted by the messages below:

01

When I checked SQL Management Studio I found that DW was in a suspect state:

02

For a quick primer on the various states a database can be in check the chart below:

State Definition
ONLINE Database is available for access. The primary filegroup is online, although the undo phase of recovery may not have been completed.
OFFLINE Database is unavailable. A database becomes offline by explicit user action and remains offline until additional user action is taken. For example, the database may be taken offline in order to move a file to a new disk. The database is then brought back online after the move has been completed.
RESTORING One or more files of the primary filegroup are being restored, or one or more secondary files are being restored offline. The database is unavailable.
RECOVERING Database is being recovered. The recovering process is a transient state; the database will automatically become online if the recovery succeeds. If the recovery fails, the database will become suspect. The database is unavailable.
RECOVERY PENDING SQL Server has encountered a resource-related error during recovery. The database is not damaged, but files may be missing or system resource limitations may be preventing it from starting. The database is unavailable. Additional action by the user is required to resolve the error and let the recovery process be completed.
SUSPECT At least the primary filegroup is suspect and may be damaged. The database cannot be recovered during startup of SQL Server. The database is unavailable. Additional action by the user is required to resolve the problem.
EMERGENCY User has changed the database and set the status to EMERGENCY. The database is in single-user mode and may be repaired or restored. The database is marked READ_ONLY, logging is disabled, and access is limited to members of the sysadmin fixed server role. EMERGENCY is primarily used for troubleshooting purposes. For example, a database marked as suspect can be set to the EMERGENCY state. This could permit the system administrator read-only access to the database. Only members of the sysadmin fixed server role can set a database to the EMERGENCY state.

So suffice to say, Suspect is not a good state for your database to be in, particularly if you know it is the result of a loss of power. If you have a nice well oiled maintenance/DR plan this would typically be where you break out the backup files and restore the database to the most recent corruption free backup.

But sometimes for whatever reason you aren’t going to have a backup. Maybe your maintenance plan failed, you go to retrieve a backup and realize it too is corrupt, you forget to setup backups for your test environment etc. Generally if this were a production system and you didn’t have a backup I would consider it a RGE or Resume Generating Event. RGE’s should be avoided at all costs. (If this happens to a production system I would highly advise opening a support case with Microsoft so that you can get assistance from engineers who are experts in SQL.)

However, as this is just my test environment, and I spin up new test environments on a fairly frequent basis I am going to show you another way to deal with Suspect databases. It isn’t pretty, or recommended, it’s irreversible, and will almost always result in data-loss. But if you are in  a pinch with a test environment without a viable backup and need to get a Suspect database back online you can use the following procedure:

First we need to put the Database in an Emergency state

03

SQL Management Studio sometimes takes a little bit to refresh and show the updated state so I will just usually query the states of the tables directly to confirm that everything worked:

04

05

From here we need to Set the Database in Single User Mode, we also need to stop any of the management servers from trying to connect to the database. To do this I just stop the SCOM related services on any of the management servers. ( I also then run a DBCC CheckDB -note the REPAIR_ALLOW_DATA_LOSS – this is that not so pretty irreversible part I was discussing earlier) *I would also advise splitting up the Set Single_User such that you can confirm that worked first before kicking off the CheckDB. I once waited for a very long time thinking that I was repairing the database where in reality I was stuck on the Set Single_User command and the CheckDB hadn’t begun yet*

06

This can take awhile so don’t be surprised if you have to wait, might be a good time for a coffee break. Since you have the WITH ALL_ERRORMSGS you aren’t going to miss anything important:

07

You will then get a nice series of messages that will look something like:

08

09

If we look at database status in the Object Explorer it still shows the DB as being in an emergency state, but if we query sys.databases we can see it is now online:

10

Hit refresh and you should now see that the DW is back online but it is still in Single User Mode

11

Run one more ALTER DATABASE and you will be all set. (Also restart the services on your Management Servers so they can reconnect to the DB)

12

For more info on this process I recommend checking out Paul Randal’s post, he wrote a lot of the code behind CHECKDB back when he worked at MSFT and far better articulates why the above method should only be used as a last resort.

Tagged , , ,

Comic Relief: Computer Programming 101

 

98

I tend to try to hold myself to the standard that if I can’t explain something simply I don’t understand it well enough, but there are those days where this seems like a better option:

under the hood

Comic Credit: Abstruse Goose

Happy Friday all, more to come next week.

Tagged ,

Management Packs: SharePoint 2013 Extended MP

I had a question recently on whether or not the SharePoint 2013 Management Pack monitors when Services are stopped?

As with most questions this is not quite as straightforward as you would expect. With SharePoint there are two kinds of services. There are your standard services that you access via services.msc which are easy to make custom monitors for:

10

And there are also those services that are only accessible via the SharePoint Central Administration console:

11

So the native monitoring available in the pack is different depending on what type of SharePoint service we are talking about. The services of the .msc persuasion are easy enough to create custom monitoring around, but does the pack monitor the specialized internal services? If we look at the 2013 pack we find the SharePoint specific services are split up into three state views. Service Front End, Services, & Shared Services:

12

Each is populated as follows:

13

14

15

So this is normally the point that I would put up my feet and bask in the glory of the product team. The SharePoint Services that can only be accessed via the Central Administrative interface are all enumerated above and they are healthy. Question answered, all is well, and there was much rejoicing throughout the land. Unfortunately, I am prone to a nasty habit of testing things. So before breaking into celebratory dance I took one of these services and stopped it waiting impatiently for the inevitable alert to roll in.

16

I waited for a long time. I refreshed many times. No alert came. I was sad. I also noticed that the state view for the relevant service now showed “Not Monitored” whereas before I stopped the service it showed healthy.

17

{Insert appropriate Expletive Here}

So what you find if you open the Health Explorer for any of these SharePoint specific services is that while there are monitors associated, it’s just that they don’t actually monitor the status of the service. So if  it is stopped you lose monitoring around that service and you will never know it unless you go to the view and note the Not Monitored.

18

Part of me really wants to know why this is the case. My completely uninformed guess would be that since SharePoint services can exist on multiple members of the Farm and by design you may only want them running on certain members it is hard to code a solution that isn’t going to be extremely noisy when looking at service status since you as the MP author have no idea that the service being stopped on one server is a critical situation whereas on an identical server it could be by design and non-actionable.

So how do we fix this so that you can selectively monitor service status for certain SharePoint servers and get alerts? First we have to know a little bit about how SharePoint Services work– which at the time I didn’t know much. I started by stopping services via the Central Administrative site and then trawling all relevant event logs on the SharePoint servers hoping for a nice event ID that I could use.

I will hopefully save you some time by telling you that there is no such event ID. There are some events that get generated that are related and will say hopeful things like X Service is Stopping or Starting, but you will find these won’t actually correlate with the time you stopped the service.

So this left me with PowerShell.

The basic mechanics of dealing with SharePoint via PowerShell as it pertains to Services are below. First to do anything you need Add the relevant Snap-in. Then you can use Get-SPServiceInstance to get a list of the services:

Add-PSSnapin -Name “Microsoft.SharePoint.PowerShell” -ErrorAction SilentlyContinue

Get-SPServiceInstance | ft TypeName, Identity

19

Then you can use the script below to check status/start/stop the service as you wish:

#To Stop a Service:

$ServiceName = “Visio Graphics Service”

Get-SPServiceInstance -server $env:COMPUTERNAME | where-object {$_.TypeName -eq $ServiceName} | Stop-SPServiceInstance -confirm:$false > $null

#To get Service Status:

Get-SPServiceInstance -server $env:COMPUTERNAME | where-object {$_.TypeName -eq $ServiceName} | Select Status

#To Start a Service:

Get-SPServiceInstance -server $env:COMPUTERNAME | where-object {$_.TypeName -eq $ServiceName} | Start-SPServiceInstance -confirm:$false > $null

From my testing I found that there are 4 different States that a SharePoint Service can be in:

When a service is initially stopped it will temporarily be listed as = Unprovisioning

Once fully stopped the Service Status will = Disabled

Service Started initially = Provisioning

Sevice Started will eventually = Online

The reason this is important is that if you create a custom script based PowerShell monitor you need to either account for each of these states in your script or alternatively you could get around this with something like any state other than Online would be considered Unhealthy. From here I had enough info to start writing a custom management pack. My original script treated ‘Provisioning’ and ‘Online’ as Healthy and ‘Unprovisioning’ and ‘Disabled’ as Unhealthy. This was then changed in a later revision so that only Online would be considered healthy since you could theoretically have a situation where a service is stuck in a provisioning state for a protracted period of time. I then explicitly gave ‘Provisioning’, ‘Unprovisioning’, and ‘Disabled’ a Status of Unhealthy, this allowed me add one additional status whereby if the script returns anything other than an expected state it will flip to a warning state indicating that either the Run As account is not properly configured or the service isn’t actually present on the server you are targeting.

The basic script is below:

20

I will skip over the details of how I converted this into a functional management pack, but the end result is an extended MP. By default all monitors are disabled and targeted against the SharePoint 2013 Server Class. You will need to override and enable on a case by case basis as it is likely that your SharePoint team will only care about monitoring certain services on certain members of the Farm.

02

Since most SharePoint environments are not fully accessible via the default action account I have added a custom Run As Profile:

01

Just add/associate your Run As account that has SharePoint Farm Admin privileges and you should be all set.

If you neglect to configure the Run As or you enable the monitor on a server that doesn’t have the target service you will get the following:

03

I tweaked the associated Alert Status for the Run As Warning to critical just so it won’t get missed.

04

Other than that the pack is fairly self explanatory, when a service is stopped within 15 minutes you will get an alert as below:

08

As with all MP’s of this nature this is a proof of concept “As-Is” pack. Script based monitoring (Particularly PowerShell based monitoring) can be some of the most intensive monitoring you can perform on a system in terms of overhead so you should test thoroughly first and insure that the extra scripts running is not negatively impacting performance.

Currently the pack does not implement cookdown. Normally you would want to limit impact by having a module that runs a single script that queries the status of all relevant services at once and then is referenced by multiple monitors. Currently there is a 1:1 relationship for each monitor, so there is a separate script for each monitor and thus a higher impact on the system. If time allows I will rewrite to properly implement cookdown but in the meantime feel free to take apart the MP and build something better/more efficient, but this should at least help get you started.

Download from: TechNet Gallery

Tagged

Best Practices: How to load test your website? Part I

Earlier this week I received a question about using SCOM to do large scale load testing on a website. While SCOM can be a great resource with synthetic transactions to simulate a finite number of user transactions and APM can give you code level instrumentation of how your .NET app is performing it isn’t really designed as large scale load test rig. You could certainly set up a large number of synthetic transactions executed from anywhere you have a SCOM agent, but when you start talking about the need to simulate 10,000+ transactions against a single site then you are entering a territory where SCOM isn’t the right tool for the job.

If you want to do some basic load testing you can do this natively inside Visual Studio Online

https://www.visualstudio.com/en-us/products/what-is-visual-studio-online-vs.aspx

01

First enter the url of the site you want to load test & give your test a name.

02 03

Next pick which Microsoft Azure Datacenter you want your test to be executed from:

04

Select the number of test users. Simple tests are capped at a max of 200 users (You get 20,000 free user minutes per month)

05

Select how long you want the test to run for:

06

Set the Think-time

07

Select a browser distribution

08

Then just click Test now

09

It will take a few minutes to acquire the necessary resources and configure agents

10

Results will appear as below:

11 93

For more advanced load testing check out the Part II post.

Tagged , ,

Get-NewHome

01

02 03

04Comic Credit: Abstruse Goose

Apologies for the  blog going dark for a little bit. After ten years living in the Washington DC area we decided to give up the amenities of city life (Goodbye 80 Mbps Comcast Connection hello 8 Mbps DSL)   and bought a small farm in the mountains. In the interest of full disclosure I may from time to time break from the regularly scheduled programming to blog about life on the farm. I will do my best to mark these posts clearly or to figure out some way for them to circle back to the central theme of this blog.

But anyway the blog is back. More posts to come, but for now the view of the backyard.

05