Data movement when partitioning existing table in SQL Server

The other day, Jørgen Guldmann (@jorgenguldmann) asked if data would move if we would partition an existing table. The table was quite large, billions of rows, so moving data could take quite a bit of time. He wanted all the existing table in one partition, and then new data loaded in the future to be placed in new partitions. I have seen this with a number of Data Warehouses, where the partitioning strategy for the fact tables was designed and applied after the database had been put into production and been running for a while. So, I thought to test it out to make sure I would provide a good answer.

In one of my previous blog posts, I wrote about when a partition split move data. So the question is, would we see the same behavior where we see LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log?

tl;dr: We don’t see data movement with LOB_INSERT_ROWS and LOB_DELETE_ROWS entries in the transaction log. However, the creation of the clustered index copy the data pages from the old structure (heap or existing clustered index) to the new structure – even if it is on the same filegroup. So yes, data will move and it can have a significant performance impact.

In my test, I have a small-ish heap called Facts.Sales, with 1,000,000 rows, in a database called PartitonTest. I have then created a right range partition function SalesPF on the TransactionDate column and a partition scheme SalesPS, which are using the same filegroup as the heap. The partitioning function would have the boundary value right of the existing values in TransactionDate, making sure that all existing data will end up in the first partition.

To partition the existing table, we need to create a clustered index on the heap. If the table was a clustered index, we would need to do the same thing using WITH (DROP_EXISTING = ON).

But before we create the clustered index, and partition our table, let’s take a look at the pages the data is located on in our heap by using DBCC IND:

DBCC IND ('PartitionTest', 'Facts.Sales', 1);

This shows that data is placed on the data pages between page 8 and page 3131.

Now, let’s create the clustered index on the partition scheme SalesPS:

CREATE UNIQUE CLUSTERED INDEX CIX_FactsSales
ON Facts.Sales(SalesID, TransactionDate)
ON SalesPS(TransactionDate);

After this, DBCC IND shows us that the data has been moved to the data pages between page 24,600 and page 27,599.

So, data has been moved.

If we take a look at the Books Online article for index disk space, it states that:

Whenever an index is created, rebuilt, or dropped, disk space for both the old (source) and new (target) structures is required in their appropriate files and filegroups. The old structure is not deallocated until the index creation transaction commits. Additional temporary disk space for sorting operations may also be needed.

This aligns with what we saw in the above test.

If we take a look at the transaction log using fn_dblog for the entries related the transaction for the clustered index creation, we only get 3,123 entries. These entries are related to metadata operations and extend allocation. This is a very small footprint on the transaction log, compared to the when data is moved due to a partition split. I reran the test where data is placed in several partitions, and the result was the same. That being said, large-scale index operations can generate large data loads that can cause the transaction log to fill quickly.

In conclusion, when designing Data Warehouses and large fact tables, it is a good idea to design the partition strategy up front rather than apply it after the database is already in production and the fact table contains a large amount of data. If applied afterwards, partitioning an already existing table will cause data movement when creating the clustered index and have a significant performance impact.

Data movement when partitioning existing table in SQL Server

Upgrading to SQL Server 2014 – Campus Days talk (slides in Danish)

Below you will find slides, demos, and references for my Campus Days talk on “Upgradring to SQL Server 2014”. The slides are in Danish, as it was a Danish conference.

Slides

Demos

Demos can be found at github.

Further reading / references

Upgrading to SQL Server 2014 – Campus Days talk (slides in Danish)

Integration Services (SSIS) for the DBA

Below you will find slides, demos, and references for my PASS Summit 2014 talk on “Integration Services (SSIS) for the DBA”.

Slides

Demos

Demos can be found at github.

Further reading / references

Integration Services (SSIS) for the DBA

SQLRally Nordic in Copenhagen – Full day training on SSIS internals and performance

SQLRally 2015On 2 March 2015, I am giving a full day pre-conference training session on SSIS internals and performance at SQLRally Nordic in Copenhagen

You have worked with SQL Server Integration Services (SSIS) for a few years, but more often than not, you see performance problems with your packages. Some packages are running slow, while others are taking up more memory and CPU than before.

The first part of this full-day session will cover the internals of SSIS. Starting with an overview, we will move into the details of the control flow engine and the data flow engine. At the end of the internals section, you will have a deep understanding of the strengths and weaknesses of SSIS and what is the best way to design your packages.

The second part will cover the architecture of the SSIS Server, which came with SQL Server 2012. We will dive into the SSIS catalog and the host process used to execute the packages.

The third part will cover how we can use the knowledge about the internals of SSIS to gain better performance. This part will show practical examples of common performance problems, and how these can be solved. Furthermore we will look at how designing and tuning your data flows can increase performance significantly.

Join me for a day full of fun with SSIS. Super-early bird price is ending on 27 October 2014. Register before your colleague.

SQLRally Nordic in Copenhagen – Full day training on SSIS internals and performance

Regional and language settings for Oracle client used with SQL Server linked server

Today, I ran into a problem while migrating a Data Warehouse solution to another server. The solution have linked servers for extracting data from an Oracle source system.

The query would go something like this:

INSERT INTO dbo.MyTable
SELECT *
FROM OPENQUERY(UNDPFSP1ICC, 'SELECT
	TO_CHAR(COL1_DT, ''YYYY'') AS Col1Year
	, COL2
	, COL3
	, SUM(AMOUNT) AS Amount
	, SYSDATE
FROM MYTABLE
GROUP BY
	TO_CHAR(COL1_DT, ''YYYY'')
	, COL2
	, COL3
')

Everything had been migrated, but while running the extract job, which ran queries like the above, we would get errors like these:

Conversion failed when converting the nvarchar value '123456,78' to data type int.

Error converting data type nvarchar to numeric.

The problem was that instead of using a dot (.) as the decimal symbol, a comma (,) was being returned. This would make the data convert to nvarchar, and an implicit conversion to a NUMERIC or INT would fail.

Everything was running fine on the old server. The collation of the server and databases were identical (Latin1_General_CI_AS), but the regional and language settings were different. On the old server they were set to English (United States) and on the new server to Danish. Changing the settings on the new server to English (United States) did not help.

After troubleshooting a bit on #sqlhelp with Amit Banerjee (twitter | blog) and William Durkin (twitter | blog), as well as via Skype with Jørgen Guldmann (twitter | blog), Jørgen ended up pointing me in the right direction.

The problem as that the NLS_LANG parameter was set to Danish (where we use comma (,) as decimal symbol). It is stored in the registry under the following subkey:

HKEY_LOCAL_MACHINESOFTWAREORACLEHOMEIDNLS_LANG

Where ID is the unique number identifying the Oracle home.

The trick was to change this subkey from:

DANISH_DENMARK.WE8MSWIN1252

To:

AMERICAN_AMERICA.WE8MSWIN1252

And then everything worked fine, and the decimal symbol was now a dot (.) instead of a comma (,).

Regional and language settings for Oracle client used with SQL Server linked server

SQLSaturday Johannesburg & Cape Town pre-con: Understanding SSIS internals and performance tuning

I am very happy and super excited to announce that I will be doing a full day pre-con the day before SQLSaturday #327 Johannesburg on 29 August, as well as SQLSaturday #301 Cape Town on 5 September 2014. Having lived in South Africa, I have had the pleasure to visit Johannesburg, Cape Town and Stellenbosch on a number of occasions. I must say, I can’t wait to go back!

Two years ago, I started speaking at SQL Server conferences, when Jody Roberts (twitter) pinged me on Twitter and asked if I didn’t want to give a talk at SQLSaturday #134 Johannesburg. I gave my first talk on Layered Partitioning, and it was such an awesome experience that I decided to go all in; which has resulted in three big conferences (and quite a few smaller ones) this year.

So, this year I am going back to South Africa where it all started to give a full day pre-con on Understanding SSIS internals and performance tuning.

Abstract:

You have worked with SQL Server Integration Services (SSIS) for a few years, but more often than not, you see performance problems with your packages. Some packages are running slow, while others are taking up more memory and CPU than before. But why are they slow or taking up a lot of resources? What is actually going on behind the scene? Join this full-day session to find out.

The first part will cover the internals of SSIS. Starting with an overview, we will move into the details of the control flow engine and the data flow engine. At the end of the internals section, you will have a deep understanding of the strengths and weaknesses of SSIS and what is the best way to design your packages.

The second part will cover how we can use the knowledge about the internals of SSIS to gain better performance. This part will show practical examples of common performance problems, and how these can be solved. Furthermore we will look at how designing and tuning your data flows can increase performance significantly.

If you are in Johannesburg or Cape Town (or feel like visiting the beautiful South Africa), I would be happy to see you at my pre-con.

Sign up for a full day of understanding SSIS internals and performance tuning in Johannesburg on 28 August.

Sign up for a full day of understanding SSIS internals and performance tuning in Cape Town on 5 September.

SQLSaturday Johannesburg & Cape Town pre-con: Understanding SSIS internals and performance tuning