I came across an issue in one of our production SQL Servers yesterday that had to do with nested views and sudden, terrible performance of once highly responsive queries. First a little background….
Views on Top of Views
In one of our application databases, we have a complex view that is used to denormalize financial transaction information to make it easier for users to access all the various dates, amounts, and other identifying flags in a single “selectable” object. This view works well for our purposes and has been reliably quick to return results (we maintain the underlying indexes and statistics and have optimized the tSQL in the view to a reasonable level).
In an effort to extend this approach of giving users access to a simplified version of the underlying data through a view, a second view was developed to provide users with aggregate data. This second view performs a select on the first view with several sum and count functions and a long group by list. So we have a “nested view” scenario with fairly complicated query logic in each view.
View 1 – Denormalized Data
create view vwCustomerHoldings as
Select custid, acctid, fundID, fundName, code1, code2, price1, price2, price3, tradedate, effectivedate
From <~20 tables with various join conditions>
View 2 – Aggregates of the denormalized data
create view vwCustomerHoldingsAggregate as
Select custid, acctid, fundID, fundName, code1, code2, tradedate, effectivedate,sum(price1), sum(price2), sum(price3)
From vwCustomerHoldings
GROUP BY custid, acctid, fundID, fundName, code1, code2, tradedate, effectivedate
For months and months we had no issues with various users pulling data with queries like this:
Select *
from vwCustomerHoldings
Where effectivedate = ‘6/21/2013’
and code1 in (‘val1′,’val2′,’val3′,’val4′,’val5′,’val6′,’val7′,’val8′,’val9’)
The typical response time was 3-5 seconds which is well within acceptable bounds for this application. Yesterday, execution duration jumped up to ~2 minutes for the same call.
We checked the statistics on the underlying tables first and they were fresh within the last 6 hours. (I have a stats age script I have been using for a long time from Glenn Berry that I got here.) There were no index or underlying table structure changes and there hadn’t been a significant change to the underlying data (other than the usual overnight load of yesterday’s transactions). So we started executing the query in management studio and looking at the IO stats and the query plans. As you might imagine the query plans associated with calls like this are very large and difficult to dissect online, so I’m going to focus on the results of SET STATISTICS IO ON which were just as revealing of the issue.
First we ran the query against a dev system with a smaller data set, and got a good result (4 second response time). The Statistics IO information is below:
It has a high read profile (~400K reads on average for the good plan), but that is OK in our environment. When we run the same query on the production data set, the statistics IO looks a LOT different and has a few key elements that reveal what is happening;
The first thing that stands out is the HUGE IO on the Worktable object (>100 Million Reads). Obviously, SQL Server has chosen a less than optimal plan to execute the TSQL on our production instance. My initial reaction is “How can there be so many reads with that Worktable? Its almost like it’s reading the entire underlying data set with no where clause at all….!??!”…
It turns out that is exactly what is happening. This is an example of a failure of the query optimizer to pick an execution plan that applies the WHERE clause first (or at least earlier in the execution). The result is that the entire data set from vwCustomerHoldings is being aggregated by vwCustomerHoldingsAggregate before the where effectivedate=’6/21/2013′ is applied to the underlying data set. So that massive Worktable entry in the statistics IO is evidence that all of the data is being run without any where clauses being applied until the end. This is a failure to execute something call “Predicate Pushing” in the execution plan. Ideally the predicates (where clause conditions) are applied as early in the execution as possible to lower the row count that needs to be worked on and joined to by the rest of the query. SQL Server usually does this, but in some cases, like this one, it stopped doing it.
The other red flag in the output is that the bad plan has a:
Warning: Null value is eliminated by an aggregate or other SET operation.
The good plan doesn’t have this in the output. This warning is issued whenever you perform aggregate functions on fields that have a null. SQL Server disregards them so you don’t have a sum() getting voided by doing a “+ NULL”. The fact that we see this on the bad plan and not the good is a further indication that the entire vwCustomerHoldingsAggregate view is being run (with all of it’s aggregates and group bys) BEFORE the where clause is applied. In the good plan, the where clauses (effective date and code filters) are applied first (via predicate pushing) so that when the aggregate logic from the vwCustomerHoldingsAggregate view is finally executed, there aren’t any NULL values to eliminate so the message is not output.
There are a few approaches that we investigated to help eliminate this problem.
- Use an Indexed View to materialize the columns and simplify the plan – This won’t work because you can’t have an indexed view that references another view (see here)
- Convert the top level view (vwCustomerHoldingsAggregate) to a stored procedure – This would allow us a lot more flexibility in using query hints or even a plan guide to force a known good plan on calls to the procedure.
- Don’t use the top level view, but instead re-write all the places of the application using the vwCustomerHoldingsAggregate view and replace them with inline SQL to perform the aggregates against the underlying view.
The last 2 options are not great for us because the application is using the vwCustomerHoldingsAggregate view in over 50 places and would be difficult to make such a sweeping change without a lot of development and testing effort. While the view is referenced in many places by the application, it is only in a handful of places where we are seeing this bad query plan issue.
We tested a few different query hints and found that OPTION (Force Order) always results in the optimal execution plan for single day calls like the one in question (<5 seconds response time). So in a couple of places in the application, we are having our developers tweak the sql to look like this:
Select *
from vwCustomerHoldings
Where effectivedate = ‘6/21/2013’
and code1 in (‘val1′,’val2′,’val3′,’val4′,’val5′,’val6′,’val7′,’val8′,’val9’)
OPTION (FORCE ORDER)
This query hint works in our case, because the underlying tables that our where clauses apply to are among the first few tables in the FROM clause of the vwCustomerHoldings lower level view. Forcing the order in which the tables are joined and filtered in the execution plan has resulted in a consistent, good plan for our problem query so we have a stop gap in place to keep things performing within acceptable ranges. However, this is not a good long term solution for us. We will need to address areas where we have nested views and put in the effort to convert them to parameterized queries or stored procedures so that we can better control query plan generation in the future.
Here are a few links about this scenario as well:
http://stackoverflow.com/questions/7183198/does-sql-server-propagate-where-conditions-in-complex-views
http://jahaines.blogspot.com/2010/03/performance-tuning-101-what-you-will_17.html
http://stackoverflow.com/questions/6653979/why-does-putting-a-where-clause-outside-view-have-terrible-performance