I’m sure there a thousand posts on the best way to partition a table and manage it, but I wanted to provide a “quick and dirty” solution that we put together for something in my shop in the last week. Here’s a little background….
We have a table called ApplicationLog that is a centralized table for logging application activity across multiple apps within our environment. The table typically gets ~200K rows per month but in some heavier testing/release months, that number can climb into the tens of millions (depending on how verbose the logging is set to and if there are any botched releases that result in a flood of applog messages). At the time the DBA team got involved, there was a monthly purging job that just ran a “delete from applicationlog where date<getdate()-180” to trim out entries older than 180 days. This worked for a couple years without incident, but then more applications started logging their messages to the table and the volume of data just got too high. The problem presented itself when the monthly purging job started getting deadlocks as it attempted to delete rows that were (presumably) on the same pages as new messages being logged. We wanted to use partitioning on this table to prevent deadlocks and blocking during the purge process.
First lets take a look at the table to be partitioned:
Pretty basic table, no funky datatypes, and the ID is an identity. It also had a single index, clustered on the identity ID column… even though no-one every queries by the ID…. classic developer stuff. The basic steps we followed to set up the table for partitioning include:
- Define a partition Function
- Define a Partition scheme
- Define a new table with the same structure (ApplicationLog_Part)
- Define a new clustered index on the Date field (and using the partition scheme)
- Populate the new partitioned table with the data from the existing table
- Rename the table (ApplicationLog becomes ApplicationLog_Backup and ApplicationLog_Part becomes ApplicationLog)
Lets take a look at these one by one. The partition function is the piece of code that defines the boundaries between partitions. In our case this is simple a series of dates on the first of each month. The code below illustrates the Partition Function and Scheme we used:
The first statement makes the partition function with the historical and future partition boundaries predetermined. The above is a condensed version for readability, but the real version has partition boundaries going out to 12/1/2025. We added all the boundaries up front because we wanted to keep things simple and not have to dynamically regenerate the partition function with different boundary definitions each month.
Also notice that at the time of definition, a partition function has nothing to do with any particular table or data set. It is merely a series of boundary definitions with a single parameter that can be passed (of the same datatype as the partition boundary values) [sort of like a big CASE statement]..
The partition scheme is just a construct that can employ the partition function as rows are inserted into an underlying table.
The next step was to define a new table with the identical scheme as ApplicationLog and then create a clustered index (that used the partition scheme/function) on it. One of our requirements was that we couldn’t change the structure of the table at all (so that the existing logging code would continue to work without any changes). So we made a new version of the table called ApplicationLog_Part:
After the table is created, we define a new clustered index on the Date field (not the identity ID) and then use the ON clause in the create index statement to specify how to store the data as it is inserted. In the case above, we reference the partition scheme ps_dbo_ApplicationLog(date) with the column that it needs to determine which partition to route the data to. The next step is to populate the temporary table with the last 120 days of data from the current table:
In our case we had to preserve the identity column so you see the identity_insert set operation at the top. Then you basically perform an insert from the original table to the temporary (_part) table. Once the insert finishes, you can evaluate the distribution of data with the second query. The final step is to rename the tables and reset the identity value.
As shown above, you can simply rename the tables and then reset the identity configuration on the _part table (now named ApplicationLog). Now you have a working partitioned table called ApplicationLog that accepts inserts from all the existing code without issues.
The final piece I’ll cover is handling the routine deletion of aging data. Since the table is now partitioned, we don’t need to run delete statements that can contend with applications writing to the table. Instead we can take advantage of the SWITCH operation. The SWITCH PARTITION option allows you to instantaneously move a whole partition of data out of the original partitioned table and into it’s own separate table.
The great thing about it is that the operation is almost instantaneous regardless of the number of rows in the partition. There are countless overviews of partitioning out there and a lot of it handles the complexities of dynamically managing the partition function boundaries by recreating the partition function regularly. My “quick and dirty” approach doesn’t handle these complexities. Instead I simply define the partition function with boundaries predefined and going out 25 years (long past when I’d expect this applicationlog table to actually be in service. Another element that we did not have to address was having partitions on different physical files/ drives. That is something that can be useful when you have tiered storage and you want keep your recent partitions on your fastest underlying disks and your aging partitions on your slower cheaper disks.
PS> sorry for all the images of code, but it looks nicer with the SSMS formatting. If you need any of the code, I can send it along.