Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
hic
Former Employee
Former Employee

 

In the QlikCommunity forum I have often seen people claim that you should minimize the number of hops in your Qlik data model in order to get the best performance.

I claim that this recommendation is not (always) correct.

In most cases, you do not need to minimize the number of hops since it affects performance only marginally. This post will try to explain when an additional table significantly will affect performance and when it will not.

The problem is which data model to choose:

Data Models Hops.png

 

The question is: Should you normalize and have many tables, with several hops between the dimension table and the fact table? Or should you join the tables to remove hops?

So, I ran a test where I measured the calculation time of a pivot table calculating a simple sum in a large fact table and using a low-cardinality dimension, while varying the number of hops between the two. The graph below shows the result. I ran two series of tests, one where the cardinality of the dimensional tables changed with a factor 10 for each table; and one where it changed with a factor 2.

CalcTime vs NoOfHops.png

 

You can clearly see that the performance is not affected at all by the number of hops – at least not between 0 and 3 hops.

By 4 hops, the calculation time in the 10x series however starts to increase slightly and by 5 hops it has increased a lot. But this is not due to the number of hops. Instead, it is the result of the primary dimension table (the dim table closest to the fact table) getting large: By 5 hops it has 100.000 records and can no longer be regarded as a small table.

To show this, I made a second test: I measured the calculation time of the same pivot table using a fix 3-table data model, varying the number of records in the intermediate table, but keeping the sizes of the other tables.

Data Model Intermediate table.png

 

In real life, this structure would correspond to a part of a more complex data model, e.g.

  • Facts  -  Products  -  Product Groups
  • Order Lines  -  Order Headers  -  Customers

The result of my measurement can be seen in the red bars below:

 

CalcTime vs NoOfRecords Join.png

 

The graph confirms that the size of the intermediate table is a sensitive point: If it has 10.000 records or less, its existence hardly affects performance. But if it is larger, you get a performance hit.

I also measured the calculation times after joining the intermediate table, first to the left with the fact table, and then to the right with the dimension table, to see if the calculation times decreased (blue and green bars). You can see that joining tables with 10.000 records or less, does not change the performance. But if you have larger tables, a join with the fact table may be a good idea.

Conclusions:

  • The number of hops does not always cause significant performance problems in the chart calculation. But a large intermediate table will.
  • If you have both a primary and a secondary dimension (e.g. Products and Product Groups), you should probably not join them. Leave the data model as a snowflake.
  • If you have the facts in two large tables (e.g. Order Lines and Order Headers), you should probably join them into one common transaction table.

HIC

 

PS. A couple of disclaimers:

  1. The above study only concerns the chart calculation time - which usually is the main part of the response time.
  2. If the expression inside your aggregation function contains fields from different tables, none of the above is true.
  3. Your data is different than mine. You may get slightly different results.

 

Further reading related to this topic:

To Join or not to Join

37 Comments
rbecher
MVP
MVP

Great post, thanks Henric!

How about having two large independent fact tables loosely coupled by a link table, and a some dimensions?

0 Likes
3,420 Views
Not applicable

Much appreciated post HIC. Informative and clear as always.

0 Likes
3,420 Views
r_wroblewski
Partner - Creator III
Partner - Creator III

Hi Henric

thanks for this great post.

0 Likes
3,433 Views
afurtado
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi Henric.

Thanks.

0 Likes
3,433 Views
anderseriksson
Partner - Specialist
Partner - Specialist

My thought too.

Two fact tables with a common key table is a common practice.

With the above findings it sounds like this would be a major penalty hit?

0 Likes
3,433 Views
Not applicable

Hi Henric,

Very good explanation!

Clear and precise.

Thanks!

0 Likes
3,433 Views
hic
Former Employee
Former Employee

Anders and Ralf

A (large) link table does indeed mean a performance penalty. In my experience it is better to use a concatenated fact table and generic keys for the dimensions.

HIC

3,433 Views
rbecher
MVP
MVP

Thanks, this was also my impression and I followed this approach.

0 Likes
3,433 Views
shelvinpatel
Contributor III
Contributor III

It also important to check the keys between the tables whether they are loose key or primary/perfect key, especially having multiple fact tables.   Above ideal scenario is very relevant with perfect data set having hierarchy within dimensions and even between fact tables.  Otherwise you will end with cartesian joins which impact adversely on memory footprint.

0 Likes
3,463 Views
Clever_Anjos
Employee
Employee

hic, your blog posts are always inspiring and elucidative. Thank you for sharing your knowledge.

Your posts always keep me rethinking my working in QV

“The simple things are also the most extraordinary things, and only the wise can see them.” P. Coelho

0 Likes
3,463 Views