Data movement when partitioning existing table in SQL Server

The other day, Jørgen Guldmann (@jorgenguldmann) asked if data would move if we would partition an existing table. The table was large, billions of rows, so moving data could take a long time. He wanted all the existing table in one partition, and then new data loaded to be placed in new partitions. I have seen this with some Data Warehouses, where the partitioning strategy for the fact tables was designed and applied after the database had been put into production and been running for a while. So, I thought to test it out to make sure I would provide a right answer. 

In one of my previous blog posts, I wrote about when a partition split move data. So the question is, would we see the same behavior where we see LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log? 

tl;dr: We don’t see data movement with LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log. However, the creation of the clustered index copy the data pages from the old structure (heap or existing clustered index) to the new structure – even if it is on the same file group. So yes, data will move and it can have a significant performance impact. 

In my test, I have a small-ish heap called Facts.Sales, with 1,000,000 rows, in a database called PartitonTest. I have then created a right range partition function SalesPF on the TransactionDate column and a partition scheme SalesPS, which are using the same file group as the heap. The partitioning function would have the boundary value right of the existing values in TransactionDate, making sure that all existing data will end up in the first partition. 

To partition the existing table, we need to create a clustered index on the heap. If the table were a clustered index, we would need to do the same thing using WITH (DROP_EXISTING = ON)

However, before we create the clustered index, and partition our table, let’s look at the data pages in our heap by using DBCC IND:

DBCC IND ('PartitionTest', 'Facts.Sales', 1);

The output shows that the data pages contain the data between page 8 and 3131. 

Now, let’s create the clustered index on the partition scheme SalesPS:

CREATE UNIQUE CLUSTERED INDEX CIX_FactsSales
ON Facts.Sales(SalesID, TransactionDate)
ON SalesPS(TransactionDate);

After this, DBCC IND shows us that the data has been moved to the data pages between page 24,600 and page 27,599. So, data has been moved. If we look at the Books Online article for index disk space, it states that:

Whenever an index is created, rebuilt, or dropped, disk space for both the old (source) and new (target) structures is required in their appropriate files and file groups. The old structure is not deallocated until the index creation transaction commits. Additional temporary disk space for sorting operations may also be needed.

This aligns with what we saw in the above test. 

If we look at the transaction log using fn_dblog for the entries related to the transaction for the clustered index creation, we only get 3,123 entries. These entries are related to metadata operations and extent allocation. This is a minimal footprint on the transaction log, compared to the when data is moved due to a partition split. I reran the test where data is placed in several partitions, and the result was the same. That being said, large-scale index operations can generate large data loads that can cause the transaction log to fill quickly

In conclusion, when designing Data Warehouses and large fact tables, it is a good idea to design the partition strategy upfront rather than apply it after the database is already in production and the fact table contains a significant amount of data. If applied afterward, partitioning an existing table will cause data movement when creating the clustered index and have a significant performance impact.

Data movement when partitioning existing table in SQL Server