Giant wide denormalized fact OR star schema design (many millions of rows)
I recent joined in on a project where the data model design is based on single concatenated fact = GOOD. Personally, I have found that on data models with several million+ rows, that a concatenated fact is the best design.
However, the developers are denormalizing all dimensional attributes from all dimension tables into the single concatenated fact table vs creating a star schema which has dimension keys in the fact.
I have always used a star schema, as from what I have seen over years it is a best best practice for a qlik data model design. Let's throw out the notion of single excel spreadsheets here, and focus on enterprise best practice qlik data model development.
The developers are claiming that even with many millions or rows, that the non-star approach, creating a single wide table with all dimensional attributes embedded is better than using a star schema approach. They base that arguement solely that since Qlik is a columnar database, the star schema approach adds no value in performance over a giant wide single fact table.
In my opinion, the benefits of a star schema data model design far outweigh that of a single flat wide giant table, and those benefits are too long to list for this discussion.
A single, wide table is in fact the structure that has the best response time. So on this point, the developers are right.
However, there are other drawbacks with this approach: First of all, a star schema is conceptually better since it better corresponds to the real-life situation. Imagine that you have no prior knowledge about the data model and want to understand how things are related. Then, just looking at a star schema data model will help you understand.
Also, a star schema will help you write the correct expressions. For example, in a star schema situation, the expression “Count(Customer)” could be the right expression to use, but with a single fact table you would have to write “Count(distinct Customer)” to get the right number.
A star schema is also easier to maintain: Imagine that you want to tweak your data model by adding a table. Then it is easier to make a mistake in the single table case: The join could potentially cause duplicate rows. But in the star schema case it is just a matter of linking an additional table without changing any of the existing ones.
So, I would recommend a star/snowflake schema in most cases. I would only use the single table approach where I really need it: for extremely large datasets.
Thanks Henric for responding. For extremely large data dart sets (100m+ row wide table), I am thinking I would change the design to either (1) use a summary table for performance and then document chain to details, or (2) use a "direct discovery" (or the current Qlik technology for that) approach to send targeted queries directly to the database. I am thinking those approaches would allow me to satisfy high performance, while keeping all the design benefits that come with a star schema.