Tag Archives: T-SQL

How do I: Create An Advanced SQL Database Backup Monitor In Visual Studio?

My favorite part of my job is getting to work with customers and understand exactly how they use, and also need a product to work. Some of the time this just means listening closely, answering questions, and relaying information back to the product group. But often there are those tiny changes and use cases that fall outside the realm of things the product group is likely to address.

A change that might be hugely valuable for Customer A, won’t happen if it breaks backwards compatibility for Customers B-Z.

One of my customers has a SCOM environment for SQL that monitors over 20,000 databases. Due to their size there are often cases where the out-of-box monitoring provided by the SQL Product Groups management packs isn’t able to completely meet their needs. While the pack provides fantastic monitoring for SQL, there are times where they need a more granular view of the health of their environment.

An instance of this from the past year was monitoring SQL Database backups. The SQL Product Group’s pack gives you the ability to alert if a database hasn’t been backed up in a certain number of days. By default it is set to 7 days, but you can override that to any integer value of days.

For my customer this wasn’t really good enough. They wanted to be able to drill down and alert on hours since last backup. They also wanted multiple severities so if it had been 20 hours since a backup, an e-mail could go out, but at 30 hours we would generate a page. The 20 & 30 hours would be customizable at the individual database level, and they also wanted some added logic that would check database backups for databases that had a status of “ONLINE”. We have other monitors that look at DB Status in general so in this case if a database was OFFLINE they either knew about it from the other monitors and were fixing it, or it was intentional in which case they didn’t want a backup alert.

The basic logic behind the SQL PG’s MP is a simple T-SQL query wrapped in a fairly complex vbscript. The unwrapped T-SQL is below:

The T-SQL modifications we need to make are relatively simple swap DAY to HOUR, and add in a line to only return database backup info for databases with a status of ONLINE.

To get this into a working Management Pack is a little bit more complex and requires isolating and cloning the Product Groups Database Backup Monitor in Visual Studio, and then making a few changes to the XML for our custom iteration.  To prevent screenshot overload I did a quick step-by-step walkthrough of the process. For this video I opted to leave out three-state severity request, and will show how to add that functionality in a follow up video.

If you have any questions or need any help, just leave a comment.

Tagged , , , , , ,

Troubleshooting: SCOM DW Database is in a Suspect State

324

Comic Credit: Abstruse Goose

So we had some severe thunderstorms roll through this past week and it took out the power at the house. This in turn took out the power to my test servers. I generally have my servers plugged into a UPS so I could gracefully shut them down during a power outage, but I was a bit lax in my unpacking after the recent move and my office is still a work in progress so when the power went out my SCOM environment didn’t exactly take it well. {Insert ad for Azure here ;o)} Once the power came back on a day later and I booted everything up I was greeted by the messages below:

01

When I checked SQL Management Studio I found that DW was in a suspect state:

02

For a quick primer on the various states a database can be in check the chart below:

State Definition
ONLINE Database is available for access. The primary filegroup is online, although the undo phase of recovery may not have been completed.
OFFLINE Database is unavailable. A database becomes offline by explicit user action and remains offline until additional user action is taken. For example, the database may be taken offline in order to move a file to a new disk. The database is then brought back online after the move has been completed.
RESTORING One or more files of the primary filegroup are being restored, or one or more secondary files are being restored offline. The database is unavailable.
RECOVERING Database is being recovered. The recovering process is a transient state; the database will automatically become online if the recovery succeeds. If the recovery fails, the database will become suspect. The database is unavailable.
RECOVERY PENDING SQL Server has encountered a resource-related error during recovery. The database is not damaged, but files may be missing or system resource limitations may be preventing it from starting. The database is unavailable. Additional action by the user is required to resolve the error and let the recovery process be completed.
SUSPECT At least the primary filegroup is suspect and may be damaged. The database cannot be recovered during startup of SQL Server. The database is unavailable. Additional action by the user is required to resolve the problem.
EMERGENCY User has changed the database and set the status to EMERGENCY. The database is in single-user mode and may be repaired or restored. The database is marked READ_ONLY, logging is disabled, and access is limited to members of the sysadmin fixed server role. EMERGENCY is primarily used for troubleshooting purposes. For example, a database marked as suspect can be set to the EMERGENCY state. This could permit the system administrator read-only access to the database. Only members of the sysadmin fixed server role can set a database to the EMERGENCY state.

So suffice to say, Suspect is not a good state for your database to be in, particularly if you know it is the result of a loss of power. If you have a nice well oiled maintenance/DR plan this would typically be where you break out the backup files and restore the database to the most recent corruption free backup.

But sometimes for whatever reason you aren’t going to have a backup. Maybe your maintenance plan failed, you go to retrieve a backup and realize it too is corrupt, you forget to setup backups for your test environment etc. Generally if this were a production system and you didn’t have a backup I would consider it a RGE or Resume Generating Event. RGE’s should be avoided at all costs. (If this happens to a production system I would highly advise opening a support case with Microsoft so that you can get assistance from engineers who are experts in SQL.)

However, as this is just my test environment, and I spin up new test environments on a fairly frequent basis I am going to show you another way to deal with Suspect databases. It isn’t pretty, or recommended, it’s irreversible, and will almost always result in data-loss. But if you are in  a pinch with a test environment without a viable backup and need to get a Suspect database back online you can use the following procedure:

First we need to put the Database in an Emergency state

03

SQL Management Studio sometimes takes a little bit to refresh and show the updated state so I will just usually query the states of the tables directly to confirm that everything worked:

04

05

From here we need to Set the Database in Single User Mode, we also need to stop any of the management servers from trying to connect to the database. To do this I just stop the SCOM related services on any of the management servers. ( I also then run a DBCC CheckDB -note the REPAIR_ALLOW_DATA_LOSS – this is that not so pretty irreversible part I was discussing earlier) *I would also advise splitting up the Set Single_User such that you can confirm that worked first before kicking off the CheckDB. I once waited for a very long time thinking that I was repairing the database where in reality I was stuck on the Set Single_User command and the CheckDB hadn’t begun yet*

06

This can take awhile so don’t be surprised if you have to wait, might be a good time for a coffee break. Since you have the WITH ALL_ERRORMSGS you aren’t going to miss anything important:

07

You will then get a nice series of messages that will look something like:

08

09

If we look at database status in the Object Explorer it still shows the DB as being in an emergency state, but if we query sys.databases we can see it is now online:

10

Hit refresh and you should now see that the DW is back online but it is still in Single User Mode

11

Run one more ALTER DATABASE and you will be all set. (Also restart the services on your Management Servers so they can reconnect to the DB)

12

For more info on this process I recommend checking out Paul Randal’s post, he wrote a lot of the code behind CHECKDB back when he worked at MSFT and far better articulates why the above method should only be used as a last resort.

Tagged , , ,