In Business Intelligence and data science there has always been this debate about whether a star schema is better than a snowflake schema or vice versa. This post, is not about that argument. This is a discussion that comes after the choice has been made.
Once you have chosen whether to go star or snowflake, there is the next question which is: "Is there a case whereby lumping everything into the fact with a few dimensions is better than having a streamlined fact table with many dimensions? "
In this discussion, I will try arguing for both cases from my experience thus far. For my debate regarding star vs snowflake, you may want to look at:
Big Fact, few dimensions
Many people can argue to put everything into the fact table. This is not necessarily wrong per se. It takes all the hard work out of modelling correctly and making sure the data links together properly.
There is a case for this, the case to join everything into one gigantic, monstrous Fact table.
For example: If the de-normalising of an ultra-normalised database makes it hard to de-normalise, the pressure is there to just lump almost everything into the Fact and have a few dimensions. There are advantages and disadvantages.
- All needed data is found in the Fact
- It is easier to query one table for fields instead of having many hops through dimensions.
- Increases the size of the Fact table
- Reduces performance of a dashboard
- Reload time increases due to the huge volumes of the Fact table
- Un-used fields are loaded into the dashboard
- Incorrect measures could be calculated due to potential duplicates arising from joins
Small Fact, Many Dimensions
As mentioned above, best practice is to have a Fact table with links/keys to dimensions and measures that will be used on the dashboard. The Business Intelligence tool, Power BI, does this well. Although other tools and cubes should also be built in this manner. Building your model like this also promotes a separation of concern.
We should always keep in mind that dimensions are for describing entities. Just as we have in development, a class should describe an object so too we have in Data warehousing that dimensions describe entities and the Fact is used to link all of them together with its measures.
- The size of the Fact table is smaller
- Performance of a dashboard is faster
- Reload time decreases due to the low volumes of the Fact table
- Un-used fields are not loaded into the dashboard
- Duplicates in Dimensions should not be allowed therefore more accurate measures
- Linking to the Fact table must be precise
- Some measures could be incorrect if linking is not done well
We should create a fact table and dimensions that we would like to use in our model. We should decide either to put everything into the Fact table or into one Fact and dimensions that will link to the Fact through a field.
In the below example, we have created a Fact table and left joined some possible measures and fields. In the next instance, we decide whether to concatenate some dimensions in also or make them their own dimensions. We have the following Fact built:
Two possible dimensions are concatenated into the Fact table. This operation will be easier to access the fields from one table, that being the Fact table. This operation however, takes long to load into the Fact and, this means that separation of responsibility is violated. This is not best practice in any data modelling operation.
The above is the same as the first example but in this case, we create dimensions with a link to the Fact table. This operation is usually quick. We just store the table separately from the Fact table. In a real-world example, this operation could decrease reload time by ~20+%.
In hindsight, it should be stated that that it is not best practice to do this. A Fact table should consist of several keys linking to different dimensions with many measures. This takes the burden off the fact table.
The dimensions should be used to describe an entity and the fact should be freed of this. The role of the Fact table should be to contain keys and measures. This measure will then be used to determine KPI’s on the front-end of the dashboard.
Modelling is 80% and creating the dashboard is only 20% of the whole work.