Introduction
During the implementation of my last project, I came across a POC (Proof Of Concept) situation when we transferred data from source system to MS SQL Stage DB and build Tabular Model directly on top of stage to show the first results. As there was a lot of data in fact table with repeated values, we used clustered column store index table as a target storage. I was surprised how good a compression ratio we gained and after some time a heretical question popped up in my mind. Should I even look up Dimension ID keys? Should I use Surrogate Keys at all? What benefit will I get?
Topic Description
I think I need to describe the situation of the project a bit more in detail first. We had fact table with 70 million records, 30 columns with dimension code (string codes) and then one value column. Our target was to have SSAS model on top of this data and we have chosen Tabular for its simplicity for POC and a better understanding of the concept for MSBI rookies.
For the POC reason I simply retrieved fact table data from data source and transferred to MSSQL with no transformation. As data space was the issue I set target MSSQL table as cluster column store Index table. I expected a high compression ratio but the reality was much better than my expectations. I was able to put 70mio rows to 1GB of table space.
The original plan was to go with the “classic” DWH approach, which means loading dimensions with surrogate key (SK) and then look up this key during fact load. But as we continued with implementation of POC (Tabular model, DAX, ETL) and I got some knowledge of the data and project the idea popped out. What impact would not using surrogate key have for this case? Of course I know a bunch of reasons why to use SK but this implementation was kind of special and none of those reasons really fit into this specific project.
Here are a few thoughts on this topic from a technical perspective:
- SK would be necessary if we implement Multidimensional SSAS but we use Tabular which has no benefit from this type of design.
- Using integer instead of nvarchar would definitely save space in row store on such a big amount of data but is this the case with clustered column store table scenario as well?
- From the implementation perspective look up and replace dimension key by id is more difficult to implement, and ETL is of course slower as well due to the lookups.
- Clustered Column Store Table type is slower for bulk data load and slower for row mass data retrieval.
I was not in fact sure about the second assumption above and this was a great opportunity to check this out. So I tested it.
Test Steps:
- I used system views sys.column_store_segments and sys.column_store_dictionaries to get the amount of data stored for one compressed column of my fact table.
- I have chosen one of the highest distinct dimension values column with 70k distinct records in it.
- Then I cloned the fact table and changed the type of the selected column to int and performed insert of fact data with lookup to dim table to retrieve dimension int key.
- After that I had two of the same fact tables. The only difference was data type int vs nvarchar of one column.
- Then I did column size comparison based on metadata.
Conclusion
The result of data volume comparison was as expected. The integer typed column reached lower data volume. But only slightly: the difference was 2MB out of the 22MB. I consider such space saving as minor and thus it cannot be considered as an argument in favor. Comparing this with implementation effort of lookups and performance of lookup steps I decided to go for now with a simplified scenario without surrogate keys. I understand the risk taken by this decision but as we decided on having the whole DW as generated from metadata and have DW under tests I accept the risk of the rework scenario.