Category Archives: XML

Poor Execution Plan While Ripping Large XML in SQL Server 2008

A few days ago, I ran into a very frustrating problem while trying to rip a large XML document into a table-structure in T-SQL on 2008.  I am setting up a new procedure to process large files (Download Sample File) that we get from the market into our databases.  These files are each about 92K lines of XML that I need to get into a table-structure, for further processing.  So, I start with my normal X-Path style SQL:

DECLARE @priceDataString		VARCHAR(MAX),
		@priceDataXML			XML
 
SELECT
	@priceDataString = BulkColumn
FROM
	OPENROWSET(BULK'C:\Temp\EnergyPrices92KLines.xml', SINGLE_BLOB) fileData
 
SET @priceDataXML = @priceDataString
 
IF OBJECT_ID('TempDB..#extractedData') IS NOT NULL
		DROP TABLE [#extractedData]
 
CREATE TABLE [#extractedData]
	(
	[deliveryDate]			CHAR(10),
	[hourEnding]			CHAR(5),
	[settlementPoint]		VARCHAR(50),
	[price]					DECIMAL(20,13),
	[dstFlag]				CHAR(1)
	)
 
INSERT INTO
	[#extractedData]
	(
	[deliveryDate],
	[hourEnding],
	[settlementPoint],
	[price],
	[dstFlag]
	)
SELECT
	ref.value('DeliveryDate[1]', 'CHAR(10)') AS deliveryDate,
	ref.value('HourEnding[1]', 'CHAR(5)') AS hourEnding,
	ref.value('SettlementPoint[1]', 'VARCHAR(50)') AS settlementPoint,
	ref.value('SettlementPointPrice[1]', 'DECIMAL(20,13)') AS price,
	ref.value('DSTFlag[1]', 'CHAR(1)') AS dstFlag
FROM
	@priceDataXML.nodes ('/EnergyPrices/EnergyPrice') R(ref)

Normally, in my experience, this type of code would execute in a few seconds, making it an efficient way to rip through XML.  This time was different though.  This time, the clock kept ticking, after playing with it a few times, assuming I had done something dreadfully wrong, I finally just let it go all the way until it finished.  It took 3 hours to complete this query on my work desktop.  I’m sure it is clear that this is unacceptable.  So, I start troubleshooting.

My first step in troubleshooting was to remove the “INSERT INTO [#extractedData]” clause, just to see how slowly it is returning each row.  To my surprise, the entire process returns in 3 seconds.  That is quite a change from 3 hours to 3 seconds, so I at least know that the problem is that a bad plan is being chosen due to the insert.  I’m confusing the optimizer somehow.

Now, I know I could have just went into SSIS, or had the developers rip the XML in C# before passing it on to me, but this SHOULD work, and I wanted to know why it wasn’t. So, I jump on Twitter and send out a “#sqlhelp” for anyone who can help me think this one through.  Right away, Paul Randal (Blog | Twitter), Jonathan Kehayias (Blog | Twitter), Robert Davis (Blog | Twitter) all jump in to try to help me sort out why this strange behavior is occurring.  They each gave me some great suggestions to try, but nothing was working.  Jonathan tried out my exact code, to try to replicate my problem, but it wasn’t working the same way for him, his was returning in just a few seconds.

At this point, I was really confused, we had tried every thing we could think of, and could only surmise that it was something with my configuration that was causing the problem.  But, I went ahead and tried my exact query on one of my 2008 R2 instances and it worked perfectly!  Just like on Jonathan’s server.  This made me a bit relieved, since it didn’t seem to be my code that was bad, but this left me even more baffled as to why the optimizer would interpret my code so poorly on 2008.  Jonathan is opening a case for this and I will post a link to that once he has done so.

Ultimately, Jonathan had another suggestion to try, one that made the query perform optimally on 2008 as well, I will post that here, in case anyone else runs into the same problem:

DECLARE @priceDataString		VARCHAR(MAX),
		@priceDataXML			XML
 
SELECT
	@priceDataString = BulkColumn
FROM
	OPENROWSET(BULK'C:\Temp\EnergyPrices92KLines.xml', SINGLE_BLOB) fileData
 
SET @priceDataXML = @priceDataString
 
IF OBJECT_ID('TempDB..#sourceData') IS NOT NULL
	DROP TABLE [#sourceData]
 
CREATE TABLE [#sourceData]
	(
	[xmlData] XML
	)
 
INSERT INTO
	[#sourceData]
	(
	[xmlData]
	)
VALUES
	(
	@priceDataXML
	)
 
IF OBJECT_ID('TempDB..#extractedData') IS NOT NULL
	DROP TABLE [#extractedData]
 
CREATE TABLE [#extractedData]
	(
	[deliveryDate]			CHAR(10),
	[hourEnding]			CHAR(5),
	[settlementPoint]		VARCHAR(50),
	[price]					DECIMAL(20,13),
	[dstFlag]				CHAR(1)
	)
 
INSERT INTO
	[#extractedData]
	(
	[deliveryDate],
	[hourEnding],
	[settlementPoint],
	[price],
	[dstFlag]
	)
SELECT
	ref.value('(EnergyPrice/DeliveryDate)[1]', 'CHAR(10)') AS deliveryDate,
	ref.value('(EnergyPrice/HourEnding)[1]', 'CHAR(5)') AS hourEnding,
	ref.value('(EnergyPrice/SettlementPoint)[1]', 'VARCHAR(50)') AS settlementPoint,
	ref.value('(EnergyPrice/SettlementPointPrice)[1]', 'DECIMAL(20,13)') AS price,
	ref.value('(EnergyPrice/DSTFlag)[1]', 'CHAR(1)') AS dstFlag
FROM
	(
	SELECT
		ref.query('.') AS ref
	FROM
		[#sourceData]
 
			CROSS APPLY [xmlData].nodes ('/EnergyPrices/EnergyPrice') R(ref)
	) AS SourceData

Ripping XML in T-SQL can be a huge chore sometimes, but it makes it even worse when the database engine is not behaving the way that you expect it to.  I hope this post can save someone else from the headaches I had with this sort of code.  Huge thanks to Jonathan for his continued help on troubleshooting this annoying problem.