A few days ago, I ran into a very frustrating problem while trying to rip a large XML document into a table-structure in T-SQL on 2008. I am setting up a new procedure to process large files (Download Sample File) that we get from the market into our databases. These files are each about 92K lines of XML that I need to get into a table-structure, for further processing. So, I start with my normal X-Path style SQL:
DECLARE @priceDataString VARCHAR(MAX), @priceDataXML XML SELECT @priceDataString = BulkColumn FROM OPENROWSET(BULK'C:\Temp\EnergyPrices92KLines.xml', SINGLE_BLOB) fileData SET @priceDataXML = @priceDataString IF OBJECT_ID('TempDB..#extractedData') IS NOT NULL DROP TABLE [#extractedData] CREATE TABLE [#extractedData] ( [deliveryDate] CHAR(10), [hourEnding] CHAR(5), [settlementPoint] VARCHAR(50), [price] DECIMAL(20,13), [dstFlag] CHAR(1) ) INSERT INTO [#extractedData] ( [deliveryDate], [hourEnding], [settlementPoint], [price], [dstFlag] ) SELECT ref.value('DeliveryDate[1]', 'CHAR(10)') AS deliveryDate, ref.value('HourEnding[1]', 'CHAR(5)') AS hourEnding, ref.value('SettlementPoint[1]', 'VARCHAR(50)') AS settlementPoint, ref.value('SettlementPointPrice[1]', 'DECIMAL(20,13)') AS price, ref.value('DSTFlag[1]', 'CHAR(1)') AS dstFlag FROM @priceDataXML.nodes ('/EnergyPrices/EnergyPrice') R(ref) |
Normally, in my experience, this type of code would execute in a few seconds, making it an efficient way to rip through XML. This time was different though. This time, the clock kept ticking, after playing with it a few times, assuming I had done something dreadfully wrong, I finally just let it go all the way until it finished. It took 3 hours to complete this query on my work desktop. I’m sure it is clear that this is unacceptable. So, I start troubleshooting.
My first step in troubleshooting was to remove the “INSERT INTO [#extractedData]” clause, just to see how slowly it is returning each row. To my surprise, the entire process returns in 3 seconds. That is quite a change from 3 hours to 3 seconds, so I at least know that the problem is that a bad plan is being chosen due to the insert. I’m confusing the optimizer somehow.
Now, I know I could have just went into SSIS, or had the developers rip the XML in C# before passing it on to me, but this SHOULD work, and I wanted to know why it wasn’t. So, I jump on Twitter and send out a “#sqlhelp” for anyone who can help me think this one through. Right away, Paul Randal (Blog | Twitter), Jonathan Kehayias (Blog | Twitter), Robert Davis (Blog | Twitter) all jump in to try to help me sort out why this strange behavior is occurring. They each gave me some great suggestions to try, but nothing was working. Jonathan tried out my exact code, to try to replicate my problem, but it wasn’t working the same way for him, his was returning in just a few seconds.
At this point, I was really confused, we had tried every thing we could think of, and could only surmise that it was something with my configuration that was causing the problem. But, I went ahead and tried my exact query on one of my 2008 R2 instances and it worked perfectly! Just like on Jonathan’s server. This made me a bit relieved, since it didn’t seem to be my code that was bad, but this left me even more baffled as to why the optimizer would interpret my code so poorly on 2008. Jonathan is opening a case for this and I will post a link to that once he has done so.
Ultimately, Jonathan had another suggestion to try, one that made the query perform optimally on 2008 as well, I will post that here, in case anyone else runs into the same problem:
DECLARE @priceDataString VARCHAR(MAX), @priceDataXML XML SELECT @priceDataString = BulkColumn FROM OPENROWSET(BULK'C:\Temp\EnergyPrices92KLines.xml', SINGLE_BLOB) fileData SET @priceDataXML = @priceDataString IF OBJECT_ID('TempDB..#sourceData') IS NOT NULL DROP TABLE [#sourceData] CREATE TABLE [#sourceData] ( [xmlData] XML ) INSERT INTO [#sourceData] ( [xmlData] ) VALUES ( @priceDataXML ) IF OBJECT_ID('TempDB..#extractedData') IS NOT NULL DROP TABLE [#extractedData] CREATE TABLE [#extractedData] ( [deliveryDate] CHAR(10), [hourEnding] CHAR(5), [settlementPoint] VARCHAR(50), [price] DECIMAL(20,13), [dstFlag] CHAR(1) ) INSERT INTO [#extractedData] ( [deliveryDate], [hourEnding], [settlementPoint], [price], [dstFlag] ) SELECT ref.value('(EnergyPrice/DeliveryDate)[1]', 'CHAR(10)') AS deliveryDate, ref.value('(EnergyPrice/HourEnding)[1]', 'CHAR(5)') AS hourEnding, ref.value('(EnergyPrice/SettlementPoint)[1]', 'VARCHAR(50)') AS settlementPoint, ref.value('(EnergyPrice/SettlementPointPrice)[1]', 'DECIMAL(20,13)') AS price, ref.value('(EnergyPrice/DSTFlag)[1]', 'CHAR(1)') AS dstFlag FROM ( SELECT ref.query('.') AS ref FROM [#sourceData] CROSS APPLY [xmlData].nodes ('/EnergyPrices/EnergyPrice') R(ref) ) AS SourceData |
Ripping XML in T-SQL can be a huge chore sometimes, but it makes it even worse when the database engine is not behaving the way that you expect it to. I hope this post can save someone else from the headaches I had with this sort of code. Huge thanks to Jonathan for his continued help on troubleshooting this annoying problem.