Nested Views And Bad Plans – Not Pushing Predicates

I came across an issue in one of our production SQL Servers yesterday that had to do with nested views and sudden, terrible performance of once highly responsive queries.  First a little background….

Views on Top of Views

In one of our application databases, we have a complex view that is used to denormalize financial transaction information to make it easier for users to access all the various dates, amounts, and other identifying flags  in a single “selectable” object.  This view works well for our purposes and has been reliably quick to return results (we maintain the underlying indexes and statistics and have optimized the tSQL in the view to a reasonable level).

In an effort to extend this approach of giving users access to a simplified version of the underlying data through a view,  a second view was developed to provide users with aggregate data.  This second view  performs a select on the first view with several sum and count functions and a long group by list.  So we have a “nested view” scenario with fairly complicated query logic in each view.

View 1 – Denormalized Data
create view vwCustomerHoldings as
Select custid, acctid, fundID, fundName, code1, code2, price1, price2, price3, tradedate, effectivedate
From <~20 tables with various join conditions>
View 2 – Aggregates of the denormalized data
create view vwCustomerHoldingsAggregate as
Select custid, acctid, fundID, fundName, code1, code2, tradedate, effectivedate,sum(price1), sum(price2), sum(price3)
From vwCustomerHoldings
GROUP BY custid, acctid, fundID, fundName, code1, code2, tradedate, effectivedate

For months and months we had no issues with various users pulling data with queries like this:

Select *
from vwCustomerHoldings
Where effectivedate = ‘6/21/2013’
and code1 in (‘val1′,’val2′,’val3′,’val4′,’val5′,’val6′,’val7′,’val8′,’val9’)

The typical response time was 3-5 seconds which is well within acceptable bounds for this application.  Yesterday, execution duration jumped up to ~2 minutes for the same call.

We checked the statistics on the underlying tables first and they were fresh within the last 6 hours.  (I have a  stats age script I have been using for a long time from Glenn Berry that I got here.)  There were no index or underlying table structure changes and there hadn’t been a significant change to the underlying data (other than the usual overnight load of yesterday’s transactions).  So we started executing the query in management studio and looking at the IO stats and the query plans.  As you might imagine the query plans associated with calls like this are very large and difficult to dissect online, so I’m going to focus on the results of SET STATISTICS IO ON which were just as revealing of the issue.

First we ran the query against a dev system with a smaller data set, and got a good result (4 second response time).  The Statistics IO information is below:

It has a high read profile (~400K reads on average for the good plan), but that is OK in our environment. When we run the same query on the production data set, the statistics IO looks a LOT different and has a few key elements that reveal what is happening;

stats1

The first thing that stands out is the HUGE IO on the Worktable object (>100 Million Reads).  Obviously, SQL Server has chosen a less than optimal plan to execute the TSQL on our production instance.  My initial reaction is “How can there be so many reads with that Worktable? Its almost like it’s reading the entire underlying data set with no where clause at all….!??!”…

It turns out that is exactly what is happening.  This is an example of a failure of the query optimizer to pick an execution plan that applies the WHERE clause first (or at least earlier in the execution).  The result is that the entire data set from vwCustomerHoldings is being aggregated by vwCustomerHoldingsAggregate before the where effectivedate=’6/21/2013′ is applied to the underlying data set.  So that massive Worktable entry in the statistics IO is evidence that all of the data is being run without any where clauses being applied until the end. This is a failure to execute something call “Predicate Pushing” in the execution plan.  Ideally the predicates (where clause conditions) are applied as early in the execution as possible to lower the row count that needs to be worked on and joined to by the rest of the query.  SQL Server usually does this, but in some cases, like this one, it stopped doing it.

The other red flag in the output is that the bad plan has a:

Warning: Null value is eliminated by an aggregate or other SET operation.

The good plan doesn’t have this in the output.  This warning is issued whenever you perform aggregate functions on fields that have a null.  SQL Server disregards them so you don’t have a sum() getting voided by doing a “+ NULL”.  The fact that we see this on the bad plan and not the good is a further indication that the entire vwCustomerHoldingsAggregate view is being run (with all of it’s aggregates and group bys) BEFORE the where clause is applied.  In the good plan, the where clauses (effective date and code filters) are applied first (via predicate pushing) so that when the aggregate logic from the vwCustomerHoldingsAggregate view is finally executed, there aren’t any NULL values to eliminate so the message is not output.

There are a few approaches that we investigated to help eliminate this problem.

  • Use an Indexed View to materialize the columns and simplify the plan – This won’t work because you can’t have an indexed view that references another view (see here)
  • Convert the top level view (vwCustomerHoldingsAggregate) to a stored procedure – This would allow us a lot more flexibility in using query hints or even a plan guide to force a known good plan on calls to the procedure.
  • Don’t use the top level view, but instead re-write all the places of the application using the vwCustomerHoldingsAggregate view and replace them with inline SQL to perform the aggregates against the underlying view.

The last 2 options are not great for us because the application is using the vwCustomerHoldingsAggregate view in over 50 places and would be difficult to make such a sweeping change without a lot of development and testing effort.  While   the view is referenced in  many places by the application, it is only in a handful of places where we are seeing this bad query plan issue.

We tested a few different query hints and found that OPTION (Force Order) always results in the optimal execution plan for single day calls like the one in question (<5 seconds response time).  So in a couple of places in the application, we are having our developers tweak the sql to look like this:

Select *
from vwCustomerHoldings
Where effectivedate = ‘6/21/2013’
and code1 in (‘val1′,’val2′,’val3′,’val4′,’val5′,’val6′,’val7′,’val8′,’val9’)
OPTION (FORCE ORDER)

This query hint works in our case, because the underlying tables that our where clauses apply to are among the first few tables in the FROM clause of the vwCustomerHoldings lower level view.  Forcing the order in which the tables are joined and filtered in the execution plan has resulted in a consistent, good plan for our problem query so we have a stop gap in place to keep things performing within acceptable ranges.  However, this is not a good long term solution for us.  We will need to address areas where we have nested views and put in the effort to convert them to parameterized queries or stored procedures so that we can better control query plan generation in the future.

Here are a few links about this scenario as well:

http://stackoverflow.com/questions/7183198/does-sql-server-propagate-where-conditions-in-complex-views
http://jahaines.blogspot.com/2010/03/performance-tuning-101-what-you-will_17.html
http://stackoverflow.com/questions/6653979/why-does-putting-a-where-clause-outside-view-have-terrible-performance

Baby Steps – Central Monitoring Database

Central Database
The first thing to do is set up a dedicated central monitoring database (which I will refer to as “DBAMonitoring’ moving forward).  This database will be used to house things like:
1)      A central table of all SQL Server instances to monitor
2)      Various data tables to hold performance data collections
3)      Supporting lookup/translation tables (if necessary)
Once you make a new database called DBAMonitoring, create a table called DB_ServerMain;

CREATE TABLE [dbo].[DB_ServerMain](
[DBServerID] [smallint] IDENTITY(1,1) NOT NULL,
[ServerName] [varchar](128) NOT NULL,
[ServerDescr] [varchar](200) NULL,
[SessionCollectionInterval] [int] NULL,
CONSTRAINT [PK_DBServerID] PRIMARY KEY CLUSTERED
(
[ServerName] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY

Next let’s add a few rows to the table for SQL instances that we want to monitor.

insert into [DB_ServerMain] (ServerName,ServerDescr,SessionCollectionInterval)Values (‘localhost’,’My local server’, 5)
insert into [DB_ServerMain] (ServerName,ServerDescr,SessionCollectionInterval)Values (‘SQLServer2′,’Test SQL Server Environment’, 5)

Now we have a basic table in place to keep a list of SQL Server instances we want to monitor and a couple of records that map to existing SQL Servers in our environment. We can start developing a powershell script template that can loop through all of the servers in the table (and eventually perform some action like capturing the results from DMV queries).  The basic powershell functions that I’ll describe next include:

1)      Connect to the DBAMonitoring Database and fetch a recordset of the SQL servers from DB_ServerMain.
2)      Loop through each server and for each

  1. Connect to the sql instance
  2. Run a dynamic query to collect some pertinent set of monitoring data
  3. Commit that data to a local table in the DBAMonitoring database (e.g. DB_WaitStats, DB_DBSpaceUsed, etc.)
My Thoughts on Powershell
When powershell first came out several years ago I was hesitant to embrace it.  It seemed like yet another language and framework to learn that didn’t offer me any game changing functionality (i.e. There weren’t too many things I couldn’t get done with some combination of tSQL scripts and VBScript .bat files).  That said, I have steadily built up what I call a “working capability” in powershell programming and have found it fairly straightforward.  Also, over the last few years the amount of data available online (message board posts, blogs, and even the MSDN help) has made it much simpler to cobble together working powershell scripts and get answers to issues that you encounter.  The powershell scripts I will be elaborating on in this series of posts were developed based on other fragments of scripts

Consider the powershell script code below.  It connects to my localhost SQL Server instance and my new DBAMintoring database using windows authentication (Integrated Security = SSPI) with a standard SQL connection string syntax:

$ServerAConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=SSPI;”

These links have more information on SQL Server connection strings in powershell:
http://sqlnovice.blogspot.com/2009/08/sql-server-connection-string-format-for.html
http://www.connectionstrings.com/sql-server-2008

#—-   BEGIN  CODE   —try
{
# create a connection string to connect to the DBAMonitoring Instance and Database
$ServerAConnectionString = “Data Source=localhost;Initial Catalog=DBAMonitoring;Integrated Security=SSPI;”$ServerAConnection = new-object system.data.SqlClient.SqlConnection($ServerAConnectionString);#Create a Dataset to hold the DataTable from the DB_ServerMain table
# we will loop through this recordset next
$dataSet = new-object “System.Data.DataSet” “DBServers”
$query = “SET NOCOUNT ON;”
$query = $query + “Select * from DB_serverMain”
 #Create a DataAdapter which we will use to populate the a data set (like a recordset)
$dataAdapter = new-object “System.Data.SqlClient.SqlDataAdapter” ($query, $ServerAConnection)
$dataAdapter.Fill($dataSet) | Out-Null

#the “$a” is one row from the resultset, and you can reference specific fields with $a.fieldname from within the loop
foreach($a in $dataSet.Tables[0].Rows)
{
# This is the area of code where “per server” queries can be executed and then saved back to the central database
write-host “DBServer: ” $a.ServerName   ;
}
}
catch
{
# if there is an error in the powershell execution, catch it and return it to the screen
$ex = $_.Exception
Write-Host “Write-DataTable$($connectionName):$ex.Message
}
#—-   END CODE—

If you can, download and install powershell 3.0 (http://www.microsoft.com/en-us/download/details.aspx?id=34595).  It has a nice integrated scripting environment (ISE) that makes it easier to develop and run powershell scripts.  Start up the Powershell ISE (Start-> All Programs -> Accessories->Windows Powershell -> Windows Powershell ISE

Image

Paste the powershell code above into the code window of the ISE and then click the Run Script button (F5) to execute it.  You will see all of the code appear in the execution window at the bottom and the results:

ISEGrab2

As you can see above, the sample code simply opens a record set of SQL Server Names and loops them printing each to the console (powershell “write-host” calls).  Now save the code as a .ps1 powershell script file (just use the regular Save button in the ISE) and then run it (green triangle button).  Now the execution just runs the ps1 file instead of the all the code line by line in the console:

ISEGrab3

With the ps1 file it makes it a lot easier to schedule SQL Server Agent jobs to simply execute various poweshell scripts from a directory (instead of having LOTS of powershell code embedded in SQL Agent job steps).

Simple test case complete!  In my next post we’ll build on this powershell looping script to collect SQL Server Wait stats for troubleshooting purposes.

Central SQL Monitoring and Performance Data Collection: Buy vs. Build

I work at a mid-sized firm and with a team of DBAs managing more than 150 SQL Server instances.  The instances range from SQL Server 2005 to SQL Server 2012 and vary in overall storage and usage profile considerably (we have several sql environments hosting 3rd party applications on <100GB of storage and we have internally developed data warehouses that are >50TB).  The company has historically used a third party monitoring solution to provide centralized performance data collection and alerting.  While these tools typically work well and can collect just about any data you might be interested in from SQL server, we have some reservations about sticking with the 3rd party monitoring tool.  Some of these concerns include:

1)      The recurring licensing costs can be significant since you are paying per sql instance and we monitor production, QA, and development environments.  At 150 SQL instances monitored, the yearly costs just for licensing are in the tens of thousands.

2)      Enabling data collections is simple (file and wait stats, session collection, perfmon counters, etc.) but the reporting provided is often generic and not terribly helpful in performing common DBA activities like researching and documenting a root cause analysis

3)      Adding new collections not supported “out of the box” can range in difficulty from complicated to impossible.

4)      There is a tendency to start using these 3rd party monitoring tools as a crutch.  “The monitoring tool says everything is ok, so it must not be a database issue…” is a common response and indicates an inability or unwillingness to perform the extra effort to debug and troubleshoot an issue to its conclusion.

With these considerations in mind, over the last few weeks we have begun to develop a set of replacement monitoring routines internally that will eventually allow us to remove the 3rd party SQL Server monitoring software we use today.  We had a few discussions on the best way to implement a central SQL Server monitoring and performance data collection solution and came up with the following requirements:

  1. The solution should not require any software on the monitored SQL Server instances (an “agentless” solution)
  2. The solution should be developed using standard technical components that new team members will be able to easily ramp up on and assist in the ongoing development efforts relating to the improving and enhancing the custom monitoring system
  3. We should be able to enable/disable some or all of the monitoring components on a server by server basis.

The phase 1 functionality that we decided to implement includes:

  • Session Viewer – capture active session information every 2 minutes and store for 7 days.  This information is extremely helpful in responding to queries from product managers’ questions like “Was anything going on in SQL Server between 10:12PM and 10:30PM last night?”.   This report will show running queries, blocking chains, etc.
  • Wait Stats – capture instance level wait stats every 1-2 hours to assist in identifying high level bottlenecks in SQL instance performance.
  • Aggregated dashboard of our disaster recovery status  – we use log shipping, mirroring, and always on availability groups to handle disaster recovery, and we wanted to implement a central dashboard report (and supporting data collection routine) to track disaster recovery delays and issues.

With the scope limited to these 3 key areas for our initial implementation, we began to think about the simplest way to develop code to handle this.  In the interest of sticking with what we know (and what we assume other future team members will know when they are hired and brought in to the mix) we decided a technical approach that includes the following components;

  1. PowerShell Scripts – to connect to a set of sql servers and collect information through executing TSQL queries (DMVs, sys. Views, and any other app specific tables that are meaningful)
  2. SQL Server Database –   we want a central SQL Server 2012 database that can house our master server list, configurations settings for each instance and the raw data collected
  3. SQL Server Agent  – to schedule the execution of various scripts

To reiterate, our goal is to keep the implementation of our central monitoring and data collection system as simple as possible and based on technologies that future hires will either already know or easily ramp up on.  We didn’t want to trade a black box 3rd party solution for an overly complicated home grown solution that only a few existing DBAs on the team would know how to expand and enhance.

The most basic components of the solution are the central DB with a control table to maintain the list of SQL Servers we need to monitor as well as some configuration flags (additional columns) to enable/disable/configure certain areas of monitoring (e.g. wait stats collection interval, Session Collection interval and retention, etc.)  The other piece of “infrastructure” from a code perspective is a working PowerShell harness loop to pull a list of the SQL Servers in question and iterate through each one.  Once have this generic looping code in place, we can generate a series of PowerShellscripts that loop though the servers and collect SQL performance information to the central database.  My next post will look at these initial components.

When a login failure isn’t a login failure

I put together a powershell script that connects to a set of sql server instances (~100 total with 2005/2008/2012 in the mix) and collects some information about the current state of TempDB.  It works great for a 98% of the instances and then I hit this error for a couple of our older SQL 2005 servers and couldn’t collect data:

Exception calling “Open” with “0” argument(s): Cannot open database “TempDB” requested by the login. The login failed.  Login failed for user ‘domain\someadminaccount’.” —> System.Data.SqlClient.SqlException: Cannot open database “TempDB” requested by the login. The login failed. 

Login failed… should be simple enough to fix right?  I went to the problem 2005 instances  and was surprised to see the domain account I am using is already in the sysadmin role.  I could log in to the instances using SSMS under the domain account.  I could even run my query against TempDB in a query window, so it wasn’t a SQL 2005 vs. SQL 2008+ dmv difference.  Then I noticed….case sensitivity!

On these particular SQL 2005 instances, we are using a case sensitive server collation required by some of the databases they host.  My powershell connection string was attempting to connect to the “TempDB” database and not the “tempdb” database.  So on 98% percent of our SQL instances it worked fine (i.e.  “TempDB” = “tempdb”) but for the case sensitive instances it failed with the unhelpful error message above.  It would have been nice to get a ‘The database “TempDB” doesn’t exist’ type of message so I could have figured this out in a couple minutes instead of an hour.  Hope this saves someone else that hour.

Introduction

I am not really sure where to begin with this so I am going to start adding posts in the next couple weeks that cover a series of tsql/powershell scripts that collect SQL Server performance metrics to a central database and several supporting SSRS reports that render the data in meaningful ways that help you diagnose and troubleshoot common sql server issues.  More to come…