Qlik Community

QlikView Deployment

Discussion Board for collaboration related to QlikView Deployment.

Announcements
Coming your way, the Qlik Data Revolution Virtual Summit. October 27-29. REGISTER
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Contributor II
Contributor II

When is it worth to join a dimension to the fact table?

Hello everyone

I have a question that has been bugging my mind. What are the factors that decide if it is acceptable to join one or more dimension tables to the fact table? I think I know some, but I would appreciate a more well founded opinion.

Thanks,

Pedro.

1 Solution

Accepted Solutions
Highlighted

This is an old slide we used to use for QlikView 11. And it mostly applies for QlikView 12.20 as it stands today.

In QlikView 12, the biggest change has been to introduce the newer QIX engine (the one Qlik Sense used since the beginning) which stores the values in a more columnar way than the associative engine used in QlikView 11 utilized. This affects to resource consumption overall, both storing/writing and retrieving/reading.

My main statement prevails: whatever makes sense to your data. I have applications running every 8 minutes, I have other apps running once every day. Obviously, I cannot apply, even if all best practices in the world say it, the same script for the former than for the latter. In the scripts to run every 8 minutes, I want them to be fast, even if it means doing less transformation in the script and having to do so in the front-end expressions.

However, for the task that runs daily, I can afford to do some more transformations and avoid workload for the front-end.

In addition, experience will tell you when using ApplyMap() instead of JOIN, even if JOIN seems the straightforward technical feature.

With regards to RAM consumption, that seems to be one of your concerns, do the maths following The Importance Of Being Distinct.

Data is stored into QlikView's internal database (and very similar to how QVDs work) in several tables, but oversimplifying, let's say 2 tables: data and symbol tables.

Symbol is each unique value for a field. If field Product has 3 distinct values 'QlikView', 'Qlik Sense' and 'NPrinting', the (internal) symbol table will contain one column named "Product" with those 3 values. Regardless how many times those values are in the source data, those will be stored only once.

Let's say you are reading from a 20 million line server log which stores Qlik product usage. Although there are 20 million lines, those values are stored only once, so again, the internal symbols table will not contain 20 million lines, but only 3.

To each distinct value, a bit stuffed pointer is assigned. Following with the example of the field Product and its values, the symbol table will store 3 lines with values 1, 10 and 11 (base 2). These pointers are also stored into the symbols table.

For the sake of the example then, our symbol table for "Product" looks like

Product, Value

QlikView, 1

Qlik Sense, 10

NPrinting, 11

Definitely, far smaller than whatever the size might be for those 20 million lines.

Then there is the data table, which stores each column with the pointers to its distinct values. We could say the first 10 lines of the logs, speaking from how QlikView stores values internally, could be

Product

1

10

10

10

1

11

10

1

1

1

Now do the equivalent with the rest of the source data feeding the QlikView app. Yet for the 20 million data table, the size is considerably smaller than its source counterpart.

How all this affects which schema should your data model look like and why that consumption "bad" in the slide is plainly wrong, given all the above?

Apply the example to that table, and see how could affect to have Product in a separate table (dimension) or in the fact table (logs). You will still have pointers to the symbols in the data table. Why then not doing it in the first place instead of separating "Products" as a different table in the script instead of keeping it on the fact table?

Because of how your data looks like. Coming from the relational world, making two tables looks like a no brainer. Easier to maintain, cleaner code, faster performance.

However, in the associative world of QlikView, the easiness to maintain is something you, as a developer, will have to decide. If you are performing 20 consecutive JOINs in your script, it will definitely take a harsh time for anyone else to understand where each field comes from, how it looked like in the first iteration and how it looks like in the resulting table.

Also as mentioned, in QlikView some operations are more resource intensive than others, and JOIN is definitely one of the most expensive (utilizes a lot of CPU to calculate the resulting internal data table and memory to the intermediate steps until the final table is calculated). How big are the tables you want to join?

In any small server, joining 20x20 takes probably exactly the same time to compute than 20000x20000. But 2000000x200000000 will take much, much more longer in time, CPU and RAM (during the load! not in the front-end).

If you can code it clean, commenting your script and the server resources permit, I don't see the disadvantage of not JOINing those tables, rather, in fact, the opposite, if you value most the user experience over loading times.

Last but not least, let's consider again the uniqueness of your data. Data from a transactional system with dozens of columns with 4 or more decimal positions will have a much higher uniqueness (meaning, will use more memory and CPU both loading and in the UI) than data from another transactional system with even more columns but integer values. How often the amount of "EUR 10.00" can appear in your data model and how often the amount of "EUR 112.7896"?

While dimensional tables may seem useful to keep separate (because our DBA minds tend to think so), understand your data first and your environment second to decide which model you want to use and why the closer you get to a single "big" table the better it will be for both

  • your developers to write cleaner and simpler expressions
  • and to your users to consume the data in a performing way.

And then test, test, test.

EDIT: some formatting added.

View solution in original post

6 Replies
Highlighted

End-user performance-wise? always. The lesser the tables the better your model will perform.

But...

The key factor is you. As the developer, only you can decide:

  • whether it is doable at all (e.g: given the data and that JOIN is every CPU and RAM intensive, does the server have resources to do that join?)
  • or whether you can do an ApplyMap() instead
  • or that the extra time it takes allows the rest of tasks to run fine, that the size of the resulting application is okay,
  • that the script is readable and maintainable as it was before, etc.

And above all, that makes sense for the application you are building: you are not losing any data, or adding any data, and having an extra table in the data model is worse than not having it.

Theory says join QlikView tables as often as possible, because eventually, the data model of your application when expanded in the computer memory is nothing else than a big table with pointers and another table with values to those pointers.

Also for the front-end expressions, one table is better than two: set analysis and aggregation expressions have to create intermediate tables when values come from different tables, e.g.: Sales = Sum(Qty * Price) where Qty comes from the "Orders" table and Price from the "Products" table.

Once again: this is not always possible and sometimes does not even make sense because you want two separate tables for a reason, like a link table. Or because the server cannot simply handle it.

Highlighted

Depends on data:

It is possible to join tables already in the script. The QlikView logic will then not see the separate tables, but rather the result of the join, which is a single internal table. In some situations this is needed, but there are disadvantages:

    The loaded tables often become larger, and QlikView works slower.

    Some information may be lost: the frequency (number of records) within the original table may no longer be available.

it usually is better to keep the data model as normalized as possible. A normalized model has many advantages:

    It is memory efficient. It is, by definition, the data model that uses least memory.

    It is CPU efficient. In most cases, QlikView calculations in a normalized model are as efficient - or only marginally slower - as in a denormalized model. In some cases the normalized model is faster.

    It is easier to understand and manage. It should be possible for other developers to read your script: A simple script with as few transformations as possible, is a script that is easy for other developers to understand and maintain.

    It minimizes the risk for incorrect calculations. Joins potentially change the number of records in the tables, which means that a normal Sum() or Count() function cannot always be used – they would sometimes return an incorrect result. You may counter that there is always a way to write a correct formula, but my point is that it should also be easy. Expressions in server objects will be written by users that do not have special knowledge about the data model in the app.

Highlighted

I have to disagree with some of your statements above:

  • QlikView does not work slower with bigger tables, especially in QlikView 12 and higher versions with the introduction of the new QIX engine where actually it's rather the opposite. As usual, between how big the table is and how far the table is? One fact table and one dimensional table makes probably not a big difference, if at all. Several fact tables and several levels of dimensional tables? Every "hop" between tables costs more than the joined [1] equivalent.
  • The normalized model may be the best for most relational database engines, but not for QlikView, which is not relational. The pointers should actually have the same length if the data model is 1 table or 20, exactly. Splitting fields on a table, however, (like date-time from a timestamp or dial code-area code-subscriber code in a phone number, or title-last name-first name in a full name field)  is a good idea and it's recommended, as much as the development and maintenance permits.
  • CPU efficiency depends on how fast the values in the data model can be retrieved. Operating on a single table is always faster than operating on different tables. As usual, in a very small model with a few hundred lines, the difference would be negligible.

[1] joined table as the result of two or more tables initially loaded separately which can be joined using JOIN LOAD or ApplyMap() (I'm usually inclined for the latter) or others.

Highlighted
Contributor II
Contributor II

Hi mbaeyens

The logic tells me that you are right, but I was using this post as a guidance Star & snowflake schema and in it engishfaque posted a very useful picture of Snowflake vs Star Schema vs Single Table. Please find below the picture that I'm referring to

Schema.png

The star schema seems to be better in RAM consumption, and I wonder if you are taking into account this point, or if you think that it might be not correct. Could you please give me your opinion on this?

Thank you very much for the help, Miguel Angel. Best regards,

Pedro.

Highlighted

This is an old slide we used to use for QlikView 11. And it mostly applies for QlikView 12.20 as it stands today.

In QlikView 12, the biggest change has been to introduce the newer QIX engine (the one Qlik Sense used since the beginning) which stores the values in a more columnar way than the associative engine used in QlikView 11 utilized. This affects to resource consumption overall, both storing/writing and retrieving/reading.

My main statement prevails: whatever makes sense to your data. I have applications running every 8 minutes, I have other apps running once every day. Obviously, I cannot apply, even if all best practices in the world say it, the same script for the former than for the latter. In the scripts to run every 8 minutes, I want them to be fast, even if it means doing less transformation in the script and having to do so in the front-end expressions.

However, for the task that runs daily, I can afford to do some more transformations and avoid workload for the front-end.

In addition, experience will tell you when using ApplyMap() instead of JOIN, even if JOIN seems the straightforward technical feature.

With regards to RAM consumption, that seems to be one of your concerns, do the maths following The Importance Of Being Distinct.

Data is stored into QlikView's internal database (and very similar to how QVDs work) in several tables, but oversimplifying, let's say 2 tables: data and symbol tables.

Symbol is each unique value for a field. If field Product has 3 distinct values 'QlikView', 'Qlik Sense' and 'NPrinting', the (internal) symbol table will contain one column named "Product" with those 3 values. Regardless how many times those values are in the source data, those will be stored only once.

Let's say you are reading from a 20 million line server log which stores Qlik product usage. Although there are 20 million lines, those values are stored only once, so again, the internal symbols table will not contain 20 million lines, but only 3.

To each distinct value, a bit stuffed pointer is assigned. Following with the example of the field Product and its values, the symbol table will store 3 lines with values 1, 10 and 11 (base 2). These pointers are also stored into the symbols table.

For the sake of the example then, our symbol table for "Product" looks like

Product, Value

QlikView, 1

Qlik Sense, 10

NPrinting, 11

Definitely, far smaller than whatever the size might be for those 20 million lines.

Then there is the data table, which stores each column with the pointers to its distinct values. We could say the first 10 lines of the logs, speaking from how QlikView stores values internally, could be

Product

1

10

10

10

1

11

10

1

1

1

Now do the equivalent with the rest of the source data feeding the QlikView app. Yet for the 20 million data table, the size is considerably smaller than its source counterpart.

How all this affects which schema should your data model look like and why that consumption "bad" in the slide is plainly wrong, given all the above?

Apply the example to that table, and see how could affect to have Product in a separate table (dimension) or in the fact table (logs). You will still have pointers to the symbols in the data table. Why then not doing it in the first place instead of separating "Products" as a different table in the script instead of keeping it on the fact table?

Because of how your data looks like. Coming from the relational world, making two tables looks like a no brainer. Easier to maintain, cleaner code, faster performance.

However, in the associative world of QlikView, the easiness to maintain is something you, as a developer, will have to decide. If you are performing 20 consecutive JOINs in your script, it will definitely take a harsh time for anyone else to understand where each field comes from, how it looked like in the first iteration and how it looks like in the resulting table.

Also as mentioned, in QlikView some operations are more resource intensive than others, and JOIN is definitely one of the most expensive (utilizes a lot of CPU to calculate the resulting internal data table and memory to the intermediate steps until the final table is calculated). How big are the tables you want to join?

In any small server, joining 20x20 takes probably exactly the same time to compute than 20000x20000. But 2000000x200000000 will take much, much more longer in time, CPU and RAM (during the load! not in the front-end).

If you can code it clean, commenting your script and the server resources permit, I don't see the disadvantage of not JOINing those tables, rather, in fact, the opposite, if you value most the user experience over loading times.

Last but not least, let's consider again the uniqueness of your data. Data from a transactional system with dozens of columns with 4 or more decimal positions will have a much higher uniqueness (meaning, will use more memory and CPU both loading and in the UI) than data from another transactional system with even more columns but integer values. How often the amount of "EUR 10.00" can appear in your data model and how often the amount of "EUR 112.7896"?

While dimensional tables may seem useful to keep separate (because our DBA minds tend to think so), understand your data first and your environment second to decide which model you want to use and why the closer you get to a single "big" table the better it will be for both

  • your developers to write cleaner and simpler expressions
  • and to your users to consume the data in a performing way.

And then test, test, test.

EDIT: some formatting added.

View solution in original post

Highlighted
Contributor II
Contributor II

Hello mbaeyens,

I found your answer very interesting, and also very good post from hic regarding the sybol Tables and Bit-Stuffed Pointers, very clear.

You just gave me the knowledge I needed on this regard, so I will mark your answer as correct.

Thank you very much

Pedro.