As part of T-SQL Tuesday, I wanted to discuss one of the experiences I have had with dealing with Big Data in SQL.
Over the course of the last few years, our Development group has created a new flagship product that our entire enterprise is now using for day-to-day activities. It is a system that allows the business users to have direct power over the way their data is calculated and manipulated without us having to make direct changes to compiled code or database procedures. As a result, we had to create an extremely abstracted data model, in which no data can really be grouped by any kind of “business” idea, and only by general types of data, such as “inputs”, “calculated data”, and “configuration”. This is all fed into our system from various sources, recorded, versioned, calculated, and reported out through various methods.
It is also a requirement that we keep every version of any data we receive or calculate, as well as every version of the “equations” our business users have created, referencing these data. This enables them to better explain and modify business decisions based on the data available at the time that operations made the decisions they did. It also makes audits and compliance checking completely traceable and transparent.
As you may realize, this makes for an enormously complex, but flexible, and powerful, system. It also creates a very quickly growing amount of data. This system has only been in production for 13 months so far, and we already have added 3.2 TB of data (with about 6 billion rows in our largest table) and the growth-rate is increasing as we incorporate more areas of business into this model. We are currently experiencing a growth of over 350GB per month in our production database and we are expecting this to continue to expand to over 500 GB per month by the end of the year. Due to the nature of our industry, the data is continually resettled by the market and this is done heavily for the first 6 months. So, the data for 6 months ago will be at least 4 times as large as the data that is processed today. This is, by no means, the largest database or the largest growth rates out there, but for us, it’s a huge jump from our legacy systems.
So, we have had to rethink the way we address this data. Our old ways of optimizing and handling the database didn’t translate 1:1 to this new VLDB. This manifested specifically in one particular design consideration, early on in development.
When I was still developing the database, we were only receiving 5 to 10 GB of data per day in inputs and calcualated data, however, I knew this was still the early stages and that the amount of data was going to grow significantly and I wanted to be ready before we got there. So, I started testing partitioning schemes to divide up the data, to make the data easily queried, but still seperated enough.
Initially, I tried to create a variation of the sliding window scenario, based on months in the future, dividing out into weeks in the future, dividing out into days for the days surrounding “today”, collapsing back into weeks and months in the past. The scenario was very complex to setup, but I got the management script created and it was working, but it caused an issue on the switch-out, in which data was unavailable for too long while trying to collapse the weeks into months. With our resettlement timeline, this was unacceptable. So, I called Kimberly Tripp (Blog | Twitter), the partitioning expert.
After a conversation about what I was trying to accomplish, she suggested that I stop trying to do a rolling window scenario for this particular case. Instead, since I wanted the granularity of a single day, go ahead and partition at a day, but create a single table for each year, and then wrap up those years using a partitioned view. I implemented a variation of this, based on creating a BIGINT that contains both the day in question and an indicator for which market I was dealing with (so that I could have my data split between markets as well).
This has proven to be a very successful method that has kept database round-trips, consisting of 7 to 12 million rows each, very quick and completely within the scope of our users’ needs. It has also made my partition maintenance very simple. I only have to add another year out into the future once a year, to keep the machine rolling. And, once my data has gotten to a place where the business users can allow me to archive off old data, I’ll have a clear delineation upon which to divide my data. It has been a clear win for my situation and shows promise of scalability far beyond where we are even today.