Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello everyone
I have a question that has been bugging my mind. What are the factors that decide if it is acceptable to join one or more dimension tables to the fact table? I think I know some, but I would appreciate a more well founded opinion.
Thanks,
Pedro.
This is an old slide we used to use for QlikView 11. And it mostly applies for QlikView 12.20 as it stands today.
In QlikView 12, the biggest change has been to introduce the newer QIX engine (the one Qlik Sense used since the beginning) which stores the values in a more columnar way than the associative engine used in QlikView 11 utilized. This affects to resource consumption overall, both storing/writing and retrieving/reading.
My main statement prevails: whatever makes sense to your data. I have applications running every 8 minutes, I have other apps running once every day. Obviously, I cannot apply, even if all best practices in the world say it, the same script for the former than for the latter. In the scripts to run every 8 minutes, I want them to be fast, even if it means doing less transformation in the script and having to do so in the front-end expressions.
However, for the task that runs daily, I can afford to do some more transformations and avoid workload for the front-end.
In addition, experience will tell you when using ApplyMap() instead of JOIN, even if JOIN seems the straightforward technical feature.
With regards to RAM consumption, that seems to be one of your concerns, do the maths following The Importance Of Being Distinct.
Data is stored into QlikView's internal database (and very similar to how QVDs work) in several tables, but oversimplifying, let's say 2 tables: data and symbol tables.
Symbol is each unique value for a field. If field Product has 3 distinct values 'QlikView', 'Qlik Sense' and 'NPrinting', the (internal) symbol table will contain one column named "Product" with those 3 values. Regardless how many times those values are in the source data, those will be stored only once.
Let's say you are reading from a 20 million line server log which stores Qlik product usage. Although there are 20 million lines, those values are stored only once, so again, the internal symbols table will not contain 20 million lines, but only 3.
To each distinct value, a bit stuffed pointer is assigned. Following with the example of the field Product and its values, the symbol table will store 3 lines with values 1, 10 and 11 (base 2). These pointers are also stored into the symbols table.
For the sake of the example then, our symbol table for "Product" looks like
Product, Value
QlikView, 1
Qlik Sense, 10
NPrinting, 11
Definitely, far smaller than whatever the size might be for those 20 million lines.
Then there is the data table, which stores each column with the pointers to its distinct values. We could say the first 10 lines of the logs, speaking from how QlikView stores values internally, could be
Product
1
10
10
10
1
11
10
1
1
1
Now do the equivalent with the rest of the source data feeding the QlikView app. Yet for the 20 million data table, the size is considerably smaller than its source counterpart.
How all this affects which schema should your data model look like and why that consumption "bad" in the slide is plainly wrong, given all the above?
Apply the example to that table, and see how could affect to have Product in a separate table (dimension) or in the fact table (logs). You will still have pointers to the symbols in the data table. Why then not doing it in the first place instead of separating "Products" as a different table in the script instead of keeping it on the fact table?
Because of how your data looks like. Coming from the relational world, making two tables looks like a no brainer. Easier to maintain, cleaner code, faster performance.
However, in the associative world of QlikView, the easiness to maintain is something you, as a developer, will have to decide. If you are performing 20 consecutive JOINs in your script, it will definitely take a harsh time for anyone else to understand where each field comes from, how it looked like in the first iteration and how it looks like in the resulting table.
Also as mentioned, in QlikView some operations are more resource intensive than others, and JOIN is definitely one of the most expensive (utilizes a lot of CPU to calculate the resulting internal data table and memory to the intermediate steps until the final table is calculated). How big are the tables you want to join?
In any small server, joining 20x20 takes probably exactly the same time to compute than 20000x20000. But 2000000x200000000 will take much, much more longer in time, CPU and RAM (during the load! not in the front-end).
If you can code it clean, commenting your script and the server resources permit, I don't see the disadvantage of not JOINing those tables, rather, in fact, the opposite, if you value most the user experience over loading times.
Last but not least, let's consider again the uniqueness of your data. Data from a transactional system with dozens of columns with 4 or more decimal positions will have a much higher uniqueness (meaning, will use more memory and CPU both loading and in the UI) than data from another transactional system with even more columns but integer values. How often the amount of "EUR 10.00" can appear in your data model and how often the amount of "EUR 112.7896"?
While dimensional tables may seem useful to keep separate (because our DBA minds tend to think so), understand your data first and your environment second to decide which model you want to use and why the closer you get to a single "big" table the better it will be for both
And then test, test, test.
EDIT: some formatting added.
End-user performance-wise? always. The lesser the tables the better your model will perform.
But...
The key factor is you. As the developer, only you can decide:
And above all, that makes sense for the application you are building: you are not losing any data, or adding any data, and having an extra table in the data model is worse than not having it.
Theory says join QlikView tables as often as possible, because eventually, the data model of your application when expanded in the computer memory is nothing else than a big table with pointers and another table with values to those pointers.
Also for the front-end expressions, one table is better than two: set analysis and aggregation expressions have to create intermediate tables when values come from different tables, e.g.: Sales = Sum(Qty * Price) where Qty comes from the "Orders" table and Price from the "Products" table.
Once again: this is not always possible and sometimes does not even make sense because you want two separate tables for a reason, like a link table. Or because the server cannot simply handle it.
Depends on data:
It is possible to join tables already in the script. The QlikView logic will then not see the separate tables, but rather the result of the join, which is a single internal table. In some situations this is needed, but there are disadvantages:
The loaded tables often become larger, and QlikView works slower.
Some information may be lost: the frequency (number of records) within the original table may no longer be available.
it usually is better to keep the data model as normalized as possible. A normalized model has many advantages:
It is memory efficient. It is, by definition, the data model that uses least memory.
It is CPU efficient. In most cases, QlikView calculations in a normalized model are as efficient - or only marginally slower - as in a denormalized model. In some cases the normalized model is faster.
It is easier to understand and manage. It should be possible for other developers to read your script: A simple script with as few transformations as possible, is a script that is easy for other developers to understand and maintain.
It minimizes the risk for incorrect calculations. Joins potentially change the number of records in the tables, which means that a normal Sum() or Count() function cannot always be used – they would sometimes return an incorrect result. You may counter that there is always a way to write a correct formula, but my point is that it should also be easy. Expressions in server objects will be written by users that do not have special knowledge about the data model in the app.
I have to disagree with some of your statements above:
[1] joined table as the result of two or more tables initially loaded separately which can be joined using JOIN LOAD or ApplyMap() (I'm usually inclined for the latter) or others.
Hi mbaeyens
The logic tells me that you are right, but I was using this post as a guidance Star & snowflake schema and in it engishfaque posted a very useful picture of Snowflake vs Star Schema vs Single Table. Please find below the picture that I'm referring to
The star schema seems to be better in RAM consumption, and I wonder if you are taking into account this point, or if you think that it might be not correct. Could you please give me your opinion on this?
Thank you very much for the help, Miguel Angel. Best regards,
Pedro.
This is an old slide we used to use for QlikView 11. And it mostly applies for QlikView 12.20 as it stands today.
In QlikView 12, the biggest change has been to introduce the newer QIX engine (the one Qlik Sense used since the beginning) which stores the values in a more columnar way than the associative engine used in QlikView 11 utilized. This affects to resource consumption overall, both storing/writing and retrieving/reading.
My main statement prevails: whatever makes sense to your data. I have applications running every 8 minutes, I have other apps running once every day. Obviously, I cannot apply, even if all best practices in the world say it, the same script for the former than for the latter. In the scripts to run every 8 minutes, I want them to be fast, even if it means doing less transformation in the script and having to do so in the front-end expressions.
However, for the task that runs daily, I can afford to do some more transformations and avoid workload for the front-end.
In addition, experience will tell you when using ApplyMap() instead of JOIN, even if JOIN seems the straightforward technical feature.
With regards to RAM consumption, that seems to be one of your concerns, do the maths following The Importance Of Being Distinct.
Data is stored into QlikView's internal database (and very similar to how QVDs work) in several tables, but oversimplifying, let's say 2 tables: data and symbol tables.
Symbol is each unique value for a field. If field Product has 3 distinct values 'QlikView', 'Qlik Sense' and 'NPrinting', the (internal) symbol table will contain one column named "Product" with those 3 values. Regardless how many times those values are in the source data, those will be stored only once.
Let's say you are reading from a 20 million line server log which stores Qlik product usage. Although there are 20 million lines, those values are stored only once, so again, the internal symbols table will not contain 20 million lines, but only 3.
To each distinct value, a bit stuffed pointer is assigned. Following with the example of the field Product and its values, the symbol table will store 3 lines with values 1, 10 and 11 (base 2). These pointers are also stored into the symbols table.
For the sake of the example then, our symbol table for "Product" looks like
Product, Value
QlikView, 1
Qlik Sense, 10
NPrinting, 11
Definitely, far smaller than whatever the size might be for those 20 million lines.
Then there is the data table, which stores each column with the pointers to its distinct values. We could say the first 10 lines of the logs, speaking from how QlikView stores values internally, could be
Product
1
10
10
10
1
11
10
1
1
1
Now do the equivalent with the rest of the source data feeding the QlikView app. Yet for the 20 million data table, the size is considerably smaller than its source counterpart.
How all this affects which schema should your data model look like and why that consumption "bad" in the slide is plainly wrong, given all the above?
Apply the example to that table, and see how could affect to have Product in a separate table (dimension) or in the fact table (logs). You will still have pointers to the symbols in the data table. Why then not doing it in the first place instead of separating "Products" as a different table in the script instead of keeping it on the fact table?
Because of how your data looks like. Coming from the relational world, making two tables looks like a no brainer. Easier to maintain, cleaner code, faster performance.
However, in the associative world of QlikView, the easiness to maintain is something you, as a developer, will have to decide. If you are performing 20 consecutive JOINs in your script, it will definitely take a harsh time for anyone else to understand where each field comes from, how it looked like in the first iteration and how it looks like in the resulting table.
Also as mentioned, in QlikView some operations are more resource intensive than others, and JOIN is definitely one of the most expensive (utilizes a lot of CPU to calculate the resulting internal data table and memory to the intermediate steps until the final table is calculated). How big are the tables you want to join?
In any small server, joining 20x20 takes probably exactly the same time to compute than 20000x20000. But 2000000x200000000 will take much, much more longer in time, CPU and RAM (during the load! not in the front-end).
If you can code it clean, commenting your script and the server resources permit, I don't see the disadvantage of not JOINing those tables, rather, in fact, the opposite, if you value most the user experience over loading times.
Last but not least, let's consider again the uniqueness of your data. Data from a transactional system with dozens of columns with 4 or more decimal positions will have a much higher uniqueness (meaning, will use more memory and CPU both loading and in the UI) than data from another transactional system with even more columns but integer values. How often the amount of "EUR 10.00" can appear in your data model and how often the amount of "EUR 112.7896"?
While dimensional tables may seem useful to keep separate (because our DBA minds tend to think so), understand your data first and your environment second to decide which model you want to use and why the closer you get to a single "big" table the better it will be for both
And then test, test, test.
EDIT: some formatting added.