Prevent SQL Cluster Failover For Patching/Reboots (Powershell)

At my current shop we have almost 200 windows servers and VMs that are rebooted every 90 days as part of a regular windows patching process.  Each time we execute this process, our DBA team must validate all sql services, databases and clusters after the patching and reboot process is complete.  Our strong preference is to avoid any manual intervention by DBAs while still making sure that the reboot process doesn’t result in a clustered SQL Server instance being hosted by different node in the cluster than prior to the reboot.

To simplify this process we put together a pair of powershell scripts:

1) Pre Process – sets the possible owner of SQL Server network name cluster resources to only have their current hosting node as a possible owner (this will prevent the instance from failing over to another node a waves of reboots happen)

2)Post Process – sets the possible owners of the SQL Server clustered network name resource to all nodes in the cluster (our standard configuration for clustered sql instances).

We have a control table (and stored procedure) that is used to store the various Clustered names in our environment.  The powershell scripts adhere to this basic high level process flow:

  1. Open a result set of all sql cluster names
  2. Loop through the cluster names one at a time
    1. for each cluster name, pull all Network Name resources
      1. Loop through each network name
        1.  For each network name resource either set the possible owners to only the current owner (pre) or all nodes in the cluster (post)

Please note: the code must be run as a domain account that has the permissions to manage the cluster resources (in our case we use dedicated administrative level domain accounts when we run this).

First let’s look at the outer loop that gets a list of clusters:

prescript

The code first imports the FailoverClusters powershell module so that we can call cluster specific commands later in the code.  The remaining lines in the code above set up a new SQL Server connection and fill dataset with a list of clusters.  In the example above, I have a hardcoded, single cluster name but you could have a table of all your clusters and have them returned as a single column resultset.

The code below is the most important part that connects to the cluster, finds the network name resources and sets their possible owners.

prescriptloop

Our approach to prevent failovers of sql servers to other nodes in the clusters is to temporarily set the SQL Server network name resource to only have it’s current cluster node host as it’s only possible owner.  With this configuration in place on the network name resource, rebooting the host will NOT cause the clustered sql instance to fail over to a different node in the cluster.  The code above prints the following information:

1) The name of the cluster

2) List the current owner node for the network name

3) List all possible owner nodes in the cluster for the network name resource

4) Sets the possible owners for the network name to ONLY THE CURRENT cluster node (so after this runs, the network name resource, and all other sql related resources in the cluster group, will NOT fail over in the event of an owner node reboot or failure)

The POST version of the script simply sets the possible owners of the network name resources to ALL nodes in the cluster (our standard configuration).  The code below shows similar logic as the PRE script, but includes a different call to Set-ClusterOwnerNode in the powershell (to set all nodes as possible owners);

postscriptloop

 

Comment or email me if you need copies of the PS1 files for this.

Collecting xEvents Query Stats From Multiple SQL Instances to a Central Database Part 3

In this post on collecting xEvent query execution data to a central database I will cover the logic in the powershell script (provided as a link below) that determines the instances to collect from and then connect and collect the data to the central database.

First a refresher with an image from my previous post.

figure 1

As you can see above, the powershell script first queries the central database (DB_ServerMain table) to determine what instances to collect from.  Then it loops through those instances and executes the xEvents tSQL queries from my previous posts.  Let’s take a quick look at some parts of the PS1 file.

Determine What Instances to Collect From

At the top of the powershell file is a query and then the beginning of a loop:

ps1_part1

The first part of the PS1 file opens a connection to the central database and executes the query against DB_ServerMain to determine which SQL instances to collect xEvent query data from. At the bottom of the snippet above, you can see that it assumes all .XEL files will be in “C:\DBAMonitoring\traces\” on the remote machine.  It them goes into a foreach loop iterating through the list of remote SQL Server instances to collect from.

Within the loop, a connection string for source (remote server) and destination (central collection server) are defined, some remote instance specific variables are set (for incremental loading) and then the SQL statement is defined dynamically:

ps1_part12JPG

Note that xel filename and fileoffset are embedded in the dynamic SQL prior to it’s execution.  Then the actual call to run the query on the remote server and save the results to the central server is executed in a TRY/CATCH block:

ps1_part3

The code first defines the destination table name in the central DB instance (“XEvents_Queries”).  Then it opens the connection to the remote instance, creates a SqlCommand object (with the query AND the source connection) and then executes the query (with the .ExecuteReader() method of  the command object).  At this point, nothing has been written to the central collection database.  If there is an error in the execution of the query on the remote instance, then the CATCH block will fire.  In the powershell script, my catch block basically assumes there is a file rollover condition by checking for the phrase “The offset” in the caught exception: $ex1.Contains(“The offset”).  If the named xel file and offset are no longer valid (due to a file rollover), it basically resets the sql query to call the system function with NULLs instead of named incremental starting point (like this:   FROM sys.fn_xe_file_target_read_file(‘C:\DBAMonitoring\traces\sql2012*.xel’, null,NULL,NULL))

The final actionable piece of the script uses the SQLBulkCopy object to connect to the central collection database, and then write resultset from the remote instance to it:

ps1_part4

Basically, it connected to the central collection server with the conneciton string declared at the beginning of the script.  It also defines the destination table, bacthsize and timeout values.  The .WriteToServer method is called with the $Reader variable that contains the resultset from the remote instance.  The .WriteToServer method will perform a BCP insert operation in chunks of 5000 rows.

That’s pretty much it for the powershell script.  You can download the full version of it here FetchQueryMonitorxEvents.ps1.

Call a Powershell PS1 File from a Stored Procedure

Last week I was working on a project to help us better identify over 600 databases we have in production.  The final solution wound up including the following components:

  • Store various contact people and internal information on the databases themselves as Extended Properties.  We are tagging all of our production databases with a business owner, system owner, and development lead, as well as information describing the application(s) that the database supports.  While extended properties are a rarely used functionality within SQL Server, they are a good fit for our purposes:
    1. Extended properties travel with the database, so when we restore production databases to our development and testing SQL instances (either weekly or nightly in most cases) the non-production database copies will automatically have the same extended properties.  This is beneficial for us because we only have to maintain the content on the production databases and the information will automatically “trickle down” to the non-production environments with our existing restore processes.
    2. Extended properties are simple, expandable and native to SQL Server.  Extended Properties are implemented as simple name/value pairs so they are extremely flexible.  We will easily be able to alter the types and number of tags we want to assign to the databases without extensive development efforts.  You can also view the DB extended properties by right clicking on any database in SSMS which is a nice to have for DBAs and SysAdmin troubleshooting issues either on the machine itself or with remote access via SSMS.
    3. Extended properties are backwards compatible to SQL 2005.  We still have a handful of SQL 2005 instances and a few dozen SQL 2008 instances in our environment, so we wanted a mechanism that would work the same for everything we are running today.
  • Establish a centralized dataset of the database properties to support a Reporting Services report that is current and usable by application support and other less technical or privileged teams (no SSMS access is most cases).  We often make database server performance information available to non-DBA team members with custom SSRS reports that render various monitoring data (session activity, wait stats, etc).  A report that allows users to review the internal contact people and applications associated with specific databases is very useful for our application support team to determine who to contact if there are issues or questions.

In the initial phase of this effort, we were provided a spreadsheet that was exported from our disaster recovery planning system with the application name, system owner, business owner, and lead developer.  This information didn’t include a direct link to individual databases, so we went through and updated the spreadsheet with servernames (instances) and DB names.  We then developed a powershell script to go through the spreadsheet, connect to the various remote servers and assign DB level extended properties to the database in question.  This was great for our initial effort, but as you might imagine, there were several applications that we couldn’t map directly to a database and we needed to work with various business users to identify the appropriate app/db mappings.  We got the initial dataset done and propagated out to the various databases, but we needed a simple mechanism to allow users to assign an app/app owners set to a database easily from a web based front end.

In the end, we decided to provide a stored procedure on our central monitoring database instance that would take application/owner information as parameters and add the database level extended properties on the remote servers automatically.  Since we already had a decent set of working code to do this in powershell (from the initial import), we decided to implement the procedure by having it call a powershell ps1 script with parameters to shell out from the SQL stored procedure by using sp_executesql.

The powershell file is attached at the end of this post for reference, but here is the syntax for calling the PS1 script file with arguments(parameters) from within a SQL stored procedure.

create procedure [dbo].[set_DatabaseProperties]
(@ServerName varchar(100) = NULL,
 @DatabaseName varchar(100) = NULL,
 @App varchar(500) = 'UNKNOWN',
 @AppDesc varchar(500) = 'UNKNOWN',
 @ApprovalUser varchar(500) = 'UNKNOWN',
 @BusinessFunction varchar(500) = 'UNKNOWN',
 @BusinessOwner varchar(500) = 'UNKNOWN',
 @Dept varchar(500) = 'UNKNOWN',
 @LeadDeveloper varchar(500) = 'UNKNOWN',
 @ManagedBy varchar(500) = 'UNKNOWN',
 @SystemOwner varchar(500) = 'UNKNOWN',
 @Type varchar(500) = 'UNKNOWN'
 )
/************************************************************************************
**
** 07/25/13 MRaftree Initial Creation
**
*/
AS
BEGIN

 DECLARE @CmdExecStr varchar(5000)
if (@ServerName is not null and @DatabaseName is not null and @App <> 'UNKNOWN')
 BEGIN

-- build a command line string to execute the PS1 file with each argument in double quotes and separated by a space  Also, check the location of the powershell.exe file
--on your database server itself and make sure the path below is right
 Select @CmdExecStr = 'C:\WINDOWS\system32\WindowsPowerShell\v1.0\powershell.exe -File "C:\Powershell\SetDBProperties.ps1" "' + isnull(@ServerName,'Unknown') +'" "' + isnull(@DatabaseName,'Unknown') +'" "' + @App +'" "' + @AppDesc +'" "' + @ApprovalUser +'" "' + @BusinessFunction+'" "' + @BusinessOwner +'" "' + ISNULL(@Dept,'Unknown') +'" "' + @LeadDeveloper +'" "' + @ManagedBy +'" "' + @SystemOwner +'" "' + ISNULL(@Type,'Unknown') +'" '

 print @CmdExecStr
create table #returncode (value varchar(1000))
insert into #returncode exec master..xp_cmdshell @CMDExecStr
select * from #ReturnCode
 END
END

Here is the ps1 file itself for reference:  SetDBProperties.ps1

Baby Steps 2 – Central Collection of WaitStats

The next step in the process will be to build upon my earlier post (“Baby Steps”) by expanding the components in the following ways:

  1. Add a new table called DB_WaitStats to the DBAMonitoring database (structure detailed below)
  2. Build on the existing powershell script to loop through our SQL instances, run a WaitStats oriented DMV query and save the results to the DBAMonitoring..DB_WaitStats table.
  3. Schedule the powershell script to run every 30 minutes in SQL Server Agent.

This post will discuss the collection and storage mechanisms of the solution and my next post will cover an SSRS report that allows you to easily visualize this data (per server) to assist in troubleshooting and debugging.

DB_WaitStats Table to Persist Wait Stat Collections

The following is a table definition to store a fairly standard set of Wait Stat information from DMV calls:

CREATE TABLE [dbo].[DB_WaitStats](
[wait_type] [nvarchar](60) NOT NULL,
[waiting_tasks_count] [bigint] NOT NULL,
[wait_time_ms] [bigint] NOT NULL,
[max_wait_time_ms] [bigint] NOT NULL,
[signal_wait_time_ms] [bigint] NOT NULL,
[ServerName] [varchar](128) NOT NULL,
[capture_time] [datetime] NOT NULL,
[increment_id] [int] NOT NULL
) ON [PRIMARY]

GO

CREATE CLUSTERED INDEX [CI_WaitStats_ServerName] ON [dbo].[DB_WaitStats]
(
[capture_time] ASC,
[ServerName] ASC,
[increment_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90) ON [PRIMARY]
GO
So it is just a basic table with a clustered index on capturetime/servername/increment_id (more on this later).

Also create a simple table to hold the wait types that we want to ignore (there may be cases where certain wait stats can be disregarded when reviewing the data later):
CREATE TABLE [dbo].[DB_ignorewaits](
[wait_type] [varchar](250) NULL
) ON [PRIMARY]

With that in place, it is time to revisit the powershell script from my previous post.  Two key areas of the script are enhanced in this iteration including a smarter loop of the servers to collect from as well as the logic to execute the DMV query on the remote server and save the results to the local DBAMonitoring..DB_WaitStats table.

Note On The Cumulative nature of WaitStats DMVs

The queries we will be using to collect wait stats use the sys.dm_os_wait_stats DMV.  This DMV maintains cumulative wait stats by type since the last time the  sql instance was started.  In order to more effective (and easily) determine the actual wait stats between collection intervals (every 30 minutes in our case), the DB_WaitStats table has a increment_id int column that stores an integer value that is one greater than the last collection’s interval_id for the same server.  In a later post I will show you how to use this field to perform some comparisons across collections to get the real “wait stats for the last 30 minutes”

PowerShellCollectWaitStats.ps1 Script

The code below is for setting up the foreach loop to got through each SQL instance we want to collect from, connecting to it and running the DMV query to gather information.  It has a few additional elements than my original script;

  1. The $query variable that selects from DBAMonitoring..DB_ServerMain has been updated with a left join to DB_WaitStats so that we can pull back each SQL instance name AND the highest increment_id value we have for previous wait stat collections on a given SQL instance.
  2. I also declare a new SQL Server connection variable ($SourceConnection) that will be used within the loop to connect directly to each monitored sql instance and execute the wait stats DMV query
#—- BEGIN CODE —
try
{# create a connection string to connect to the DBAMonitoring Instance and Database
$ServerAConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=SSPI;”
$ServerAConnection = new-object system.data.SqlClient.SqlConnection($ServerAConnectionString);#Create a Dataset to hold the DataTable from the DB_ServerMain table
# we will loop through this recordset next
$dataSet = new-object “System.Data.DataSet” “DBServers”$query = “SET NOCOUNT ON;”
$query = $query + “Select a.DBServerID, a.ServerName, max(isnull(b.increment_id,0))+1 as increment_id ”
$query = $query + “from DB_ServerMain a ”
$query = $query + “left outer join DB_WaitStats b ”
$query = $query + “on a.ServerName = b.ServerName ”
$query = $query + “group by a.DBServerID,a.ServerName “#Create a DataAdapter which we will use to populate a data set (like a recordset)
$dataAdapter = new-object “System.Data.SqlClient.SqlDataAdapter” ($query, $ServerAConnection)
$dataAdapter.Fill($dataSet) | Out-Null#create a new connection object we will use to connect dynamically to each sqlinstance
$sourceConnection = New-Object System.Data.SqlClient.SQLConnection($SourceConnectionString)#the “$a” is one row from the resultset, and you can reference specific fields with $a.fieldname from within the loop
foreach($a in $dataSet.Tables[0].Rows)
{

Within the foreach loop (below), the updated script is performing a few new operations.  Each time through the loop (once for each SQL Instance) the code will:

  1. Set Variables
    • Set variables for source/target connection strings (to connect to the remote sql instance AND the central monitoring instance with the DBAMonitoring database)
    • Set the $sql variable which contains the select statement that will be executed against the remote instance (in our case against the sys.dm_os_wait_stats DMV)
  2. Connect to the remote instance and execute the DMV query
  3. Use the BulkCopy object to copy the results of the query from the remote sql instance to the central DBAMonitoring database

#the “$a” is one row from the resultset, and you can reference specific fields with $a.fieldname from within the loop
foreach($a in $dataSet.Tables[0].Rows)
{

# This is the area of code where “per server” queries can be executed and then saved back to the central database
write-host “DBServer: ” $a.ServerName ;
# open a connection for the server/database
$SourceConnectionString = “Data Source=”+$a.Servername+”;Integrated Security=SSPI;”
$DestinationConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=True”

#the name of the target table we are going to insert into ni the DBAMonitoring database
$tableName = “DB_WaitStats”

#the wait stats query we will run aganist each sql instance in DBAMONITORING..DB_ServerMain
$sql = “DECLARE @DT DATETIME ; ”
$sql = $sql + “SET @DT = GETDATE() ; ”
$sql = $sql + “SELECT [wait_type], [waiting_tasks_count], [wait_time_ms], [max_wait_time_ms], ”
$sql = $sql + ” [signal_wait_time_ms], ‘” + $a.ServerName+ “‘ AS ServerName, @DT as capture_time, ” + $a.increment_id + ” as increment_id ”
$sql = $sql + “FROM sys.dm_os_wait_stats; ”
#write-host $sql
$sourceConnection.ConnectionString = $SourceConnectionString
$sourceConnection.open()
$commandSourceData = New-Object system.Data.SqlClient.SqlCommand($sql,$sourceConnection)
$commandSourceData .CommandTimeout = ‘300’

$reader = $commandSourceData.ExecuteReader()

$bulkCopy = new-object (“Data.SqlClient.SqlBulkCopy”) $DestinationConnectionString
$bulkCopy.DestinationTableName = $tableName
$bulkCopy.BatchSize = 5000
$bulkCopy.BulkCopyTimeout = 0
$bulkCopy.WriteToServer($reader)

$sourceConnection.close()

}

You can download the whole script powershellcollectwaitstats .ps1  (rename to .ps1 as the file extension)

Once you have the script in the PowerShell ISE, you can run it and should see a simple output in the console pane of each sql instance that was collected from:

waitstsrun

Confirm it is working properly by performing a “Select * from DB_WaitStats” in your DBAMonitoring database.  You should have a few hundred rows per SQL instance, per collection.

Once the powershell script is working from the ISE, create a SQL Agent Job to have it run every 30 minutes.  You can use this code to create the job on your SQL Server instance (rename to .sql WaitStatsSqlAgentJob.txt)

The job is a simple 2 step job:

waitstatsjobsteops

The first step deletes aging records out of the DBAMonitoring..DB_WaitStats table so it doesn’t grow uncontrollably.  The script is configured to delete collections older that 3 days, but you can easily alter this to fit your needs.

The second step calls the ps1 powershell script file using the Operating System (CmdExec) job step type.  You basically call the powershell environment and give it the path to your ps1 file as shown below:

waitstatsjobsteps2

The job is scheduled to run every 30 minutes to collect new snapshot of remote server wait stats and it will delete aging collections.  Run it manually a couple of times from sql agent to make sure it is executing properly

Meanwhile…. back in DBAMonitoring..DB_WaitStats

Now go back to your central instance and query the DB_WaitStats table.  You should see a few hundred rows per instance/per collection.  Don’t worry that in most cases, the majority of the wait types have 0 values.  For our purposes we collect everything so we can compare across collections to determine what is happening with all wait stats.  The data that is contained in the table is very raw (just wait type, the cumulative wait time and a server and time stamp).  In my next post I’ll cover a stored procedure that can refactor this raw data into the actual wait times by wait types across collection intervals (so you can easily see the biggest wait type(s) for the last 30 minutes or major changes in the wait profile over the course of a few hours).  I’ll also cover a basic SSRS report that can sit on top of the DB_waitStats table to provide a more easily interpreted view of the information.

Baby Steps – Central Monitoring Database

Central Database
The first thing to do is set up a dedicated central monitoring database (which I will refer to as “DBAMonitoring’ moving forward).  This database will be used to house things like:
1)      A central table of all SQL Server instances to monitor
2)      Various data tables to hold performance data collections
3)      Supporting lookup/translation tables (if necessary)
Once you make a new database called DBAMonitoring, create a table called DB_ServerMain;

CREATE TABLE [dbo].[DB_ServerMain](
[DBServerID] [smallint] IDENTITY(1,1) NOT NULL,
[ServerName] [varchar](128) NOT NULL,
[ServerDescr] [varchar](200) NULL,
[SessionCollectionInterval] [int] NULL,
CONSTRAINT [PK_DBServerID] PRIMARY KEY CLUSTERED
(
[ServerName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY

Next let’s add a few rows to the table for SQL instances that we want to monitor.

insert into [DB_ServerMain] (ServerName,ServerDescr,SessionCollectionInterval)Values (‘localhost’,’My local server’, 5)
insert into [DB_ServerMain] (ServerName,ServerDescr,SessionCollectionInterval)Values (‘SQLServer2′,’Test SQL Server Environment’, 5)

Now we have a basic table in place to keep a list of SQL Server instances we want to monitor and a couple of records that map to existing SQL Servers in our environment. We can start developing a powershell script template that can loop through all of the servers in the table (and eventually perform some action like capturing the results from DMV queries).  The basic powershell functions that I’ll describe next include:

1)      Connect to the DBAMonitoring Database and fetch a recordset of the SQL servers from DB_ServerMain.
2)      Loop through each server and for each

  1. Connect to the sql instance
  2. Run a dynamic query to collect some pertinent set of monitoring data
  3. Commit that data to a local table in the DBAMonitoring database (e.g. DB_WaitStats, DB_DBSpaceUsed, etc.)
My Thoughts on Powershell
When powershell first came out several years ago I was hesitant to embrace it.  It seemed like yet another language and framework to learn that didn’t offer me any game changing functionality (i.e. There weren’t too many things I couldn’t get done with some combination of tSQL scripts and VBScript .bat files).  That said, I have steadily built up what I call a “working capability” in powershell programming and have found it fairly straightforward.  Also, over the last few years the amount of data available online (message board posts, blogs, and even the MSDN help) has made it much simpler to cobble together working powershell scripts and get answers to issues that you encounter.  The powershell scripts I will be elaborating on in this series of posts were developed based on other fragments of scripts

Consider the powershell script code below.  It connects to my localhost SQL Server instance and my new DBAMintoring database using windows authentication (Integrated Security = SSPI) with a standard SQL connection string syntax:

$ServerAConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=SSPI;”

These links have more information on SQL Server connection strings in powershell:
http://sqlnovice.blogspot.com/2009/08/sql-server-connection-string-format-for.html
http://www.connectionstrings.com/sql-server-2008

#—-   BEGIN  CODE   —try
{
# create a connection string to connect to the DBAMonitoring Instance and Database
$ServerAConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=SSPI;”$ServerAConnection = new-object system.data.SqlClient.SqlConnection($ServerAConnectionString);#Create a Dataset to hold the DataTable from the DB_ServerMain table
# we will loop through this recordset next
$dataSet = new-object “System.Data.DataSet” “DBServers”
$query = “SET NOCOUNT ON;”
$query = $query + “Select * from DB_serverMain”
 #Create a DataAdapter which we will use to populate the a data set (like a recordset)
$dataAdapter = new-object “System.Data.SqlClient.SqlDataAdapter” ($query, $ServerAConnection)
$dataAdapter.Fill($dataSet) | Out-Null

#the “$a” is one row from the resultset, and you can reference specific fields with $a.fieldname from within the loop
foreach($a in $dataSet.Tables[0].Rows)
{
# This is the area of code where “per server” queries can be executed and then saved back to the central database
write-host “DBServer: ” $a.ServerName   ;
}
}
catch
{
# if there is an error in the powershell execution, catch it and return it to the screen
$ex = $_.Exception
Write-Host “Write-DataTable$($connectionName):$ex.Message
}
#—-   END CODE—

If you can, download and install powershell 3.0 (http://www.microsoft.com/en-us/download/details.aspx?id=34595).  It has a nice integrated scripting environment (ISE) that makes it easier to develop and run powershell scripts.  Start up the Powershell ISE (Start-> All Programs -> Accessories->Windows Powershell -> Windows Powershell ISE

Image

Paste the powershell code above into the code window of the ISE and then click the Run Script button (F5) to execute it.  You will see all of the code appear in the execution window at the bottom and the results:

ISEGrab2

As you can see above, the sample code simply opens a record set of SQL Server Names and loops them printing each to the console (powershell “write-host” calls).  Now save the code as a .ps1 powershell script file (just use the regular Save button in the ISE) and then run it (green triangle button).  Now the execution just runs the ps1 file instead of the all the code line by line in the console:

ISEGrab3

With the ps1 file it makes it a lot easier to schedule SQL Server Agent jobs to simply execute various poweshell scripts from a directory (instead of having LOTS of powershell code embedded in SQL Agent job steps).

Simple test case complete!  In my next post we’ll build on this powershell looping script to collect SQL Server Wait stats for troubleshooting purposes.

Central SQL Monitoring and Performance Data Collection: Buy vs. Build

I work at a mid-sized firm and with a team of DBAs managing more than 150 SQL Server instances.  The instances range from SQL Server 2005 to SQL Server 2012 and vary in overall storage and usage profile considerably (we have several sql environments hosting 3rd party applications on <100GB of storage and we have internally developed data warehouses that are >50TB).  The company has historically used a third party monitoring solution to provide centralized performance data collection and alerting.  While these tools typically work well and can collect just about any data you might be interested in from SQL server, we have some reservations about sticking with the 3rd party monitoring tool.  Some of these concerns include:

1)      The recurring licensing costs can be significant since you are paying per sql instance and we monitor production, QA, and development environments.  At 150 SQL instances monitored, the yearly costs just for licensing are in the tens of thousands.

2)      Enabling data collections is simple (file and wait stats, session collection, perfmon counters, etc.) but the reporting provided is often generic and not terribly helpful in performing common DBA activities like researching and documenting a root cause analysis

3)      Adding new collections not supported “out of the box” can range in difficulty from complicated to impossible.

4)      There is a tendency to start using these 3rd party monitoring tools as a crutch.  “The monitoring tool says everything is ok, so it must not be a database issue…” is a common response and indicates an inability or unwillingness to perform the extra effort to debug and troubleshoot an issue to its conclusion.

With these considerations in mind, over the last few weeks we have begun to develop a set of replacement monitoring routines internally that will eventually allow us to remove the 3rd party SQL Server monitoring software we use today.  We had a few discussions on the best way to implement a central SQL Server monitoring and performance data collection solution and came up with the following requirements:

  1. The solution should not require any software on the monitored SQL Server instances (an “agentless” solution)
  2. The solution should be developed using standard technical components that new team members will be able to easily ramp up on and assist in the ongoing development efforts relating to the improving and enhancing the custom monitoring system
  3. We should be able to enable/disable some or all of the monitoring components on a server by server basis.

The phase 1 functionality that we decided to implement includes:

  • Session Viewer – capture active session information every 2 minutes and store for 7 days.  This information is extremely helpful in responding to queries from product managers’ questions like “Was anything going on in SQL Server between 10:12PM and 10:30PM last night?”.   This report will show running queries, blocking chains, etc.
  • Wait Stats – capture instance level wait stats every 1-2 hours to assist in identifying high level bottlenecks in SQL instance performance.
  • Aggregated dashboard of our disaster recovery status  – we use log shipping, mirroring, and always on availability groups to handle disaster recovery, and we wanted to implement a central dashboard report (and supporting data collection routine) to track disaster recovery delays and issues.

With the scope limited to these 3 key areas for our initial implementation, we began to think about the simplest way to develop code to handle this.  In the interest of sticking with what we know (and what we assume other future team members will know when they are hired and brought in to the mix) we decided a technical approach that includes the following components;

  1. PowerShell Scripts – to connect to a set of sql servers and collect information through executing TSQL queries (DMVs, sys. Views, and any other app specific tables that are meaningful)
  2. SQL Server Database –   we want a central SQL Server 2012 database that can house our master server list, configurations settings for each instance and the raw data collected
  3. SQL Server Agent  – to schedule the execution of various scripts

To reiterate, our goal is to keep the implementation of our central monitoring and data collection system as simple as possible and based on technologies that future hires will either already know or easily ramp up on.  We didn’t want to trade a black box 3rd party solution for an overly complicated home grown solution that only a few existing DBAs on the team would know how to expand and enhance.

The most basic components of the solution are the central DB with a control table to maintain the list of SQL Servers we need to monitor as well as some configuration flags (additional columns) to enable/disable/configure certain areas of monitoring (e.g. wait stats collection interval, Session Collection interval and retention, etc.)  The other piece of “infrastructure” from a code perspective is a working PowerShell harness loop to pull a list of the SQL Servers in question and iterate through each one.  Once have this generic looping code in place, we can generate a series of PowerShellscripts that loop though the servers and collect SQL performance information to the central database.  My next post will look at these initial components.

When a login failure isn’t a login failure

I put together a powershell script that connects to a set of sql server instances (~100 total with 2005/2008/2012 in the mix) and collects some information about the current state of TempDB.  It works great for a 98% of the instances and then I hit this error for a couple of our older SQL 2005 servers and couldn’t collect data:

Exception calling “Open” with “0” argument(s): Cannot open database “TempDB” requested by the login. The login failed.  Login failed for user ‘domain\someadminaccount’.” —> System.Data.SqlClient.SqlException: Cannot open database “TempDB” requested by the login. The login failed. 

Login failed… should be simple enough to fix right?  I went to the problem 2005 instances  and was surprised to see the domain account I am using is already in the sysadmin role.  I could log in to the instances using SSMS under the domain account.  I could even run my query against TempDB in a query window, so it wasn’t a SQL 2005 vs. SQL 2008+ dmv difference.  Then I noticed….case sensitivity!

On these particular SQL 2005 instances, we are using a case sensitive server collation required by some of the databases they host.  My powershell connection string was attempting to connect to the “TempDB” database and not the “tempdb” database.  So on 98% percent of our SQL instances it worked fine (i.e.  “TempDB” = “tempdb”) but for the case sensitive instances it failed with the unhelpful error message above.  It would have been nice to get a ‘The database “TempDB” doesn’t exist’ type of message so I could have figured this out in a couple minutes instead of an hour.  Hope this saves someone else that hour.