Category Archives: Troubleshooting

Troubleshooting: Database Status Monitor generates warning for secondary database in log shipped pair

This is a question that I worked on for one of my customers awhile back, but I was giving a talk on extending the SQL MP this week out in Redmond so it seems like a good time to get this one on the blog.

The root of this problem is that the Database Status Monitor in the SQL Management Packs is not Log Shipping aware. For those who don’t live and breath SQL the basic definition of Log Shipping lifted from the MSFT documentation is:

SQL Server Log shipping allows you to automatically send transaction log backups from a primary database on a primary server instance to one or more secondary databases on separate secondary server instances.

-Provides a disaster-recovery solution for a single primary database and one or more secondary databases, each on a separate instance of SQL Server.

-Supports limited read-only access to secondary databases (during the interval between restore jobs).

-Allows a user-specified delay between when the primary server backs up the log of the primary database and when the secondary servers must restore (apply) the log backup. A longer delay can be useful, for example, if data is accidentally changed on the primary database. If the accidental change is noticed quickly, a delay can let you retrieve still unchanged data from a secondary database before the change is reflected there.

The Database Status Monitor can return the following possible states within SQL:

01

Which roll up under the following three health states:

02

So in the case of Log Shipping the issue is that for the secondary database “Restoring” is normal behavior for a log shipped DB, but it would not be typical behavior for a standard database. So you get Warning Alerts because the monitor isn’t smart enough to detect the difference between a log shipped and non log shipped DB.

To understand this better it can help to look at the history of the embedded T-SQL code in the MP for this monitor. The original code in the SQL MPs looked as follows:

03

04

It is very simple, but it isn’t mirroring or log shipping aware.

Then in later revisions of the SQL MP Mirroring Awareness was added to the query. This was first added for just SQL 2008, but then around the 6.5+ releases all versions of the SQL MP were made Mirroring aware with the exception of the SQL 2005 MPs via the following code modifications:

05 06 07

This code + a little XPathQuery logic at the end of the MP fixes the false warning alerts for mirroring, but a secondary log shipped database still shows up as a Warning today in the current release of the SQL MP.

I will put up the full sample code for the custom monitor soon on TechNet Gallery, but the basic modifications are as follows:

Add a left outer join for msdb.dbo.log_shipping_secondary_databaseses

09

Add the additional property bag value

10

Modify the XPathQuery so that -IsLogShipping is a Healthy condition.

11 12 13

And then you have a nice custom LogShipping & Mirroring aware Database Status Monitor. I tend to isolate the monitor and any dependencies into a standalone custom MP and then disable the out of box Monitor.

Tagged , , ,

Troubleshooting: SCOM DW Database is in a Suspect State

324

Comic Credit: Abstruse Goose

So we had some severe thunderstorms roll through this past week and it took out the power at the house. This in turn took out the power to my test servers. I generally have my servers plugged into a UPS so I could gracefully shut them down during a power outage, but I was a bit lax in my unpacking after the recent move and my office is still a work in progress so when the power went out my SCOM environment didn’t exactly take it well. {Insert ad for Azure here ;o)} Once the power came back on a day later and I booted everything up I was greeted by the messages below:

01

When I checked SQL Management Studio I found that DW was in a suspect state:

02

For a quick primer on the various states a database can be in check the chart below:

State Definition
ONLINE Database is available for access. The primary filegroup is online, although the undo phase of recovery may not have been completed.
OFFLINE Database is unavailable. A database becomes offline by explicit user action and remains offline until additional user action is taken. For example, the database may be taken offline in order to move a file to a new disk. The database is then brought back online after the move has been completed.
RESTORING One or more files of the primary filegroup are being restored, or one or more secondary files are being restored offline. The database is unavailable.
RECOVERING Database is being recovered. The recovering process is a transient state; the database will automatically become online if the recovery succeeds. If the recovery fails, the database will become suspect. The database is unavailable.
RECOVERY PENDING SQL Server has encountered a resource-related error during recovery. The database is not damaged, but files may be missing or system resource limitations may be preventing it from starting. The database is unavailable. Additional action by the user is required to resolve the error and let the recovery process be completed.
SUSPECT At least the primary filegroup is suspect and may be damaged. The database cannot be recovered during startup of SQL Server. The database is unavailable. Additional action by the user is required to resolve the problem.
EMERGENCY User has changed the database and set the status to EMERGENCY. The database is in single-user mode and may be repaired or restored. The database is marked READ_ONLY, logging is disabled, and access is limited to members of the sysadmin fixed server role. EMERGENCY is primarily used for troubleshooting purposes. For example, a database marked as suspect can be set to the EMERGENCY state. This could permit the system administrator read-only access to the database. Only members of the sysadmin fixed server role can set a database to the EMERGENCY state.

So suffice to say, Suspect is not a good state for your database to be in, particularly if you know it is the result of a loss of power. If you have a nice well oiled maintenance/DR plan this would typically be where you break out the backup files and restore the database to the most recent corruption free backup.

But sometimes for whatever reason you aren’t going to have a backup. Maybe your maintenance plan failed, you go to retrieve a backup and realize it too is corrupt, you forget to setup backups for your test environment etc. Generally if this were a production system and you didn’t have a backup I would consider it a RGE or Resume Generating Event. RGE’s should be avoided at all costs. (If this happens to a production system I would highly advise opening a support case with Microsoft so that you can get assistance from engineers who are experts in SQL.)

However, as this is just my test environment, and I spin up new test environments on a fairly frequent basis I am going to show you another way to deal with Suspect databases. It isn’t pretty, or recommended, it’s irreversible, and will almost always result in data-loss. But if you are in  a pinch with a test environment without a viable backup and need to get a Suspect database back online you can use the following procedure:

First we need to put the Database in an Emergency state

03

SQL Management Studio sometimes takes a little bit to refresh and show the updated state so I will just usually query the states of the tables directly to confirm that everything worked:

04

05

From here we need to Set the Database in Single User Mode, we also need to stop any of the management servers from trying to connect to the database. To do this I just stop the SCOM related services on any of the management servers. ( I also then run a DBCC CheckDB -note the REPAIR_ALLOW_DATA_LOSS – this is that not so pretty irreversible part I was discussing earlier) *I would also advise splitting up the Set Single_User such that you can confirm that worked first before kicking off the CheckDB. I once waited for a very long time thinking that I was repairing the database where in reality I was stuck on the Set Single_User command and the CheckDB hadn’t begun yet*

06

This can take awhile so don’t be surprised if you have to wait, might be a good time for a coffee break. Since you have the WITH ALL_ERRORMSGS you aren’t going to miss anything important:

07

You will then get a nice series of messages that will look something like:

08

09

If we look at database status in the Object Explorer it still shows the DB as being in an emergency state, but if we query sys.databases we can see it is now online:

10

Hit refresh and you should now see that the DW is back online but it is still in Single User Mode

11

Run one more ALTER DATABASE and you will be all set. (Also restart the services on your Management Servers so they can reconnect to the DB)

12

For more info on this process I recommend checking out Paul Randal’s post, he wrote a lot of the code behind CHECKDB back when he worked at MSFT and far better articulates why the above method should only be used as a last resort.

Tagged , , ,

Troubleshooting: SCOM reports yield weird data/what the heck does 9.221E+07 mean?

Eventually when running a report in SCOM you are going to end up with a report like the one below.

01

At first glance everything looks okay. But then you start looking at the data that was returned and it can sometimes be a little confusing.

02

Usually the questions I get from customers ranges from “I think this report is broken” to “what the heck does 9.221E+07 mean?”

Fear not, reporting is not broken and 9.221E+07 is not nearly as confusing as it may seem. Basically, what is going on is that the dataset you have returned is so large in regards to the number of digits that in order to display it in a meaningful way the system is presenting the data using some shorthand commonly known as scientific notation. All you need to understand is that +07 indicates the number of times the decimal point would need to be moved to the right to display the full number.

So 9.221E+07 = 92210000

And if we look at the top of the chart we will note that the particular performance counter that we are reporting on is being returned in Bytes so we are dealing with:

92210000 Bytes

For those of you who like me are not particularly mathematically inclined and prefer to leave conversions to someone else I recommend using the wonderful built-in functionality of PowerShell.

If you enter 9.221E+07 and hit enter it will automatically output the full value for you:

03

If the original unit–in this case Bytes–is not your unit of choice and you want to know what the value is in MB  just enter the value in scientific notation form and then divide by 1 MB:

9.221E+07  / 1MB

04

Same goes for GB

9.221E+07  / 1GB

05

 

Tagged , , , , , ,

Troubleshooting: SCOM Web Console 500 – Internal Server Error

This is a problem that I occasionally see crop up in customer environments, but until now I had never bothered to document the issue.

Symptoms:

Customer is able to log into the web console successfully but when they click on certain views like the active alerts view they see the following error:

01

Anytime there is an issue with the web console my first suggestion is to attempt to recreate the problem as localhost on the machine that hosts the webconsole. If the server has the same problem locally you know where to start troubleshooting, if it is only happening on client machines it could be firewall or some problem at the client level.

When we tested from SCOM web console server we get the same error, but this time with a little bit more information:

02

This may not seem useful, but this is actually telling us exactly where the problem is. Note the Version=2.0.0.0 in regards to .Net Framework. This is telling what version of the Framework is expected for this particular app pool.

If we navigate to IIS we see the following:

03

OperationsManagerMonitoringView is set to v.40, but it should be set to v.2.0. We need to edit the Basic Setting and select the drop down with v2.0

04

After that we need to recycle the Application Pool via the Application Pool task

05

Then just logout of the webconsole and back in and all will be well.

Tagged , , ,

Troubleshooting: Office 365 Management Pack (Could not load file or assembly ‘Microsoft.SystemCenter.O365.UI)

This will be a short post.

First, at long last the Office 365 Management Pack was released today! Yay & much rejoicing throughout the land!

The official announcement can be found here:

http://blogs.office.com/2014/07/29/new-office-365-admin-tools/

The download can be found here:

http://www.microsoft.com/en-us/download/details.aspx?id=43708

Naturally I immediately downloaded, read the MP Guide, and then imported into my test environment. The initial import seemed to go fine.

office 365

But then when I clicked the new Office 365 Admin Pane Item

office 365 2

I get a nice “Could not load file or assembly ‘Microsoft.SystemCenter.O365.UI” error

office 365 3

SOLUTION:

To make this error go away, close your OpsMgr console and reopen. Then you will get:

office 365 4

Full error text for interested parties:

Date: 7/29/2014 7:03:12 PM
Application: Operations Manager
Application Version: 7.1.10226.0
Severity: Error
Message:

System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.SystemCenter.O365.UI, Version=7.0.5000.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35’ or one of its dependencies. The system cannot find the file specified.
File name: ‘Microsoft.SystemCenter.O365.UI, Version=7.0.5000.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35’
   at System.Reflection.RuntimeAssembly._nLoad(AssemblyName fileName, String codeBase, Evidence assemblySecurity, RuntimeAssembly locationHint, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks)
   at System.Reflection.RuntimeAssembly.InternalLoadAssemblyName(AssemblyName assemblyRef, Evidence assemblySecurity, RuntimeAssembly reqAssembly, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean throwOnFileNotFound, Boolean forIntrospection, Boolean suppressSecurityChecks)
   at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, IntPtr pPrivHostBinder, Boolean forIntrospection)
   at System.Reflection.RuntimeAssembly.InternalLoad(String assemblyString, Evidence assemblySecurity, StackCrawlMark& stackMark, Boolean forIntrospection)
   at System.Reflection.Assembly.Load(String assemblyString)
   at Microsoft.EnterpriseManagement.ConsoleFramework.Wpf.ConsoleViewDisplayService.OpenViewJobArguments.LoadType()
   at Microsoft.EnterpriseManagement.ConsoleFramework.Wpf.ConsoleViewDisplayService.OpenViewJob(Object sender, ConsoleJobEventArgs args)
   at Microsoft.EnterpriseManagement.Mom.Internal.UI.Console.ConsoleJobExceptionHandler.ExecuteJob(IComponent component, EventHandler`1 job, Object sender, ConsoleJobEventArgs args)

WRN: Assembly binding logging is turned OFF.
To enable assembly bind failure logging, set the registry value [HKLM\Software\Microsoft\Fusion!EnableLog] (DWORD) to 1.
Note: There is some performance penalty associated with assembly bind failure logging.
To turn this feature off, remove the registry value [HKLM\Software\Microsoft\Fusion!EnableLog].

Tagged , ,

Troubleshooting: SQL MP 6.5.1 with SCOM 2007 R2 (SQL 2012 MP Imports Fail)

So on 6/30 a new SQL MP was released. It has some great new features and fixes:

http://www.microsoft.com/en-us/download/details.aspx?id=10631

and some awesome new instance level dashboards for both SQL 2008 and 2012 when used with SCOM 2012+

SQL MP

Unfortunately, the MP also doesn’t play nice with SCOM 2007 R2 and its predecessor 6.4.1 MP. When you try to import it into your 2007 R2 environment with the previous version of the SQL MP installed you will get failures of all the 2012 MP components

SQL01

This error is not particularly helpful but if you go to the OpsMgr Event log it helps narrow things down:

SQL02

So the workaround is to remove the SQL 2012 6.4.1 MPs and then reimport 6.5.1:

sql03

This should fix the issue, but ultimately you should really upgrade to 2012 R2 as 2007 R2 fell out of mainstream support on 7/8/2014 so any new MP’s going forward will likely be the 2.0 schema and completely incompatible.

Tagged , , ,

Troubleshooting: The installed version of SQL Server is Not Supported (SCOM 2012)

Awhile back I rebuilt one of my test environments. Post rebuild something very strange happened- I could not for the life of me get SCOM reporting to install. All the initial pre-req checks would pass, everything else would install just fine, but I would keep hitting this error.

If you mouse over the little Red X you would get the following:

If you consult the install log files in %userprofile%\AppData\Local\SCOM\LOGS I would find:

Searching for the error online returns a number of posts which while well meaning offer solutions which are unfortunately ultimately not very helpful.

I then spun up a brand new all in one test environment just to try to narrow things down and found that once again the error was present even though the installed version of SQL was a supported version.

After more troubleshooting than I would like to admit this left me with one option, there was something wrong with my SQL media I was using. At first glance it looks just like any other SQL media I have downloaded from MSDN:

But then I looked at the entire name of the media file:

Somehow in a moment of test environment building delirium I had downloaded an x86 copy of SQL 2012 Enterprise, and apparently one of the little known side effects of accidentally installing 32-bit SQL on a 64-bit Operating System is that you will get an SRS Couldn’t Check Version Exception, but everything else will install and work just fine.

I have come across a few instances of other people reporting this problem on the forums, but never actually arriving at a solution. Hopefully this post will be of some use. Once 64-bit SQL was installed on 64-bit Windows Server 2012 everything installs fine as it always has in the past.

 

Tagged , , , ,

Troubleshooting: Server 2012 Error Copying File to Folder

I have noticed that Server 2012 has a bug where copying large files between an RDP session fails:

server 2012

Anything over a few gig and I get an “Error Copying File or Folder.” File copy of large files over RDP has always been a little shaky, and eventually Microsoft will have a fix, but in the meantime if you find yourself running into this problem you can get around it by copying the files using UNC paths.

In Windows Explorer type: \\servername\c$ and then copy your files from there. \\Ipaddress\c$ should work as well you just have to make sure windows firewall is open to file and print sharing between the two systems.

Tagged , , ,

Troubleshooting: WSUS with Server 2012

A few weeks ago I had a request to setup a new WSUS server running on server 2012. The setup was easy, but once I had turned it over to our client systems group they were trying to figure out how to have it run on port 80 for testing purposes.  By default WSUS on Server 2012 uses port 8530 for Windows Updates. They quickly discovered that modifying the bindings in IIS won’t work in this case.

To modify WSUS to use port 80 the wsusutil tool is the preferred method.

The tool is located in c:\Program Files\Update Services\Tools

The command to change the WSUS website to use port 80 is: wsusutil usecustomwebsite false

wsus

Tagged , , ,

Troubleshooting: SCOM Agent Healthy, but availability report for server shows monitoring unavailable

This was definitely an odd one. I noticed that one of our systems was showing as having a healthy SCOM Agent yet it if you ran an availability report against the windows computer object it would show monitoring as being unavailable. After confirming that the data warehouse was not running behind I found that this was actually happening with more than one of our servers.

Running an availability report would look as follows:

01

Brody Kilpatrick has a nice post on his blog explaining one of the possible causes and solutions which involves running some unsupported scripts against the data warehouse. I highly recommend reading his post and all credit for this solution must go to him. With that said, I found that the SQL queries he posted have issues that caused them to fail, at least in my environment. (Brody responded that he is updating the queries so it is likely that by the time you read this they will be fixed.) There were also some slight discrepancies between the results of his queries and my results so I opted to use his work as a template, but to modify things ever so slightly so that it would actually work in my environment which is running OpsMgr 2012 SP1 with the datawarehouse running on a dedicated Server 2008 r2 box running SQL 2008 R2.

First on your datawarehouse server you are going to want to run the following query:

02

If nothing is returned, that is fantastic, and you aren’t experiencing the problem this post will solve. If you do get results they will look something like this:

3

 

The EndDateTime with Null is not necessarily indicative of a problem. In some cases it was just a server that had been shutdown for a period of time, but had not been removed from SCOM. However, some of these NULL’s were for the servers that were showing healthy SCOM agents with availability reporting showing monitoring unavailable.

As useful as HealthServiceOutageRowId is it can be helpful to actually know the name of the associated system. Run the following query to join in Name and DisplayName:

04

Your results should look like this with the right-most DisplayName column providing the FQDN of the affected system:

05

At this point Brody’s post recommends confirming that the systems are all experiencing the problem, backing up your datawarehouse, and at your own risk modifying the values of the EndDateTime column via custom SQL. I tend to be a little risk averse, at least in my production environments so the first thing I tried now that I had narrowed down the issue was to uninstall the SCOM agent from one of the misbehaving systems, and then immediately reinstalling it. For that system this resolved the issue immediately with the proper availability monitoring returning post reinstall:

06

However, one of my affected servers was a domain controller which had a manually installed agent. I had no way of uninstalling, and reinstalling the agent without bugging our domain administrator.

So for this case I backed up the datawarehouse and then did the following (Again you could do this via raw SQL, but sometimes I think it is easier to have a clear understanding of what you are doing to a database rather than just copying some code someone else wrote)

Please keep in mind this solution is not supported by Microsoft:

Right click the dbo.HealthServiceOutage table:

07

Select Edit Top 200 Rows:

08

In the right hand properties box hit the + sign next to Top Specification and increase the Expression value to include the value of the HealthServiceOutageRowID of the sytem you want to fix:

09

At the bottom of your query you will see query changed, right click and select Execute SQL:

10

Scroll down to the HealthServiceOutageRowID which matches your affected server. The EndDateTime should show Null. Copy the value from the StartDateTime, and paste it into the box for the EndDateTime and close out of the editor.

11

And then for good measure run this script again to confirm that the your modification worked and the server should no longer be returned:

 

04

So two fixes for this issue:

Recommended Fix Reinstall the SCOM agent

Optional Not Supported back up your datawarehouse first Fix:

Modify the EndDateTime value from Null to match the StartDateTime, either via management studio edit, or via SQL Query.

Just to reiterate, if you opt to use this post as a solution– read Brody’s post as well, he found the solution and presents a much deeper understanding of how availability is actually calculated and the extra info is extremely useful. His method of fixing this via SQL rather than a manual edit via management studio is also far more scalable if you happen to have this problem on more than a handful of servers.

Tagged , ,