Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Employee
Employee

A Myth about the Number of Hops

In the QlikCommunity forum I have often seen people claim that you should minimize the number of hops in your Qlik data model in order to get the best performance.

I claim that this recommendation is not (always) correct.

In most cases, you do not need to minimize the number of hops since it affects performance only marginally. This post will try to explain when an additional table significantly will affect performance and when it will not.

The problem is which data model to choose:

Data Models Hops.png

The question is: Should you normalize and have many tables, with several hops between the dimension table and the fact table? Or should you join the tables to remove hops?

So, I ran a test where I measured the calculation time of a pivot table calculating a simple sum in a large fact table and using a low-cardinality dimension, while varying the number of hops between the two. The graph below shows the result. I ran two series of tests, one where the cardinality of the dimensional tables changed with a factor 10 for each table; and one where it changed with a factor 2.

CalcTime vs NoOfHops.png

You can clearly see that the performance is not affected at all by the number of hops – at least not between 0 and 3 hops.

By 4 hops, the calculation time in the 10x series however starts to increase slightly and by 5 hops it has increased a lot. But this is not due to the number of hops. Instead, it is the result of the primary dimension table (the dim table closest to the fact table) getting large: By 5 hops it has 100.000 records and can no longer be regarded as a small table.

To show this, I made a second test: I measured the calculation time of the same pivot table using a fix 3-table data model, varying the number of records in the intermediate table, but keeping the sizes of the other tables.

Data Model Intermediate table.png

In real life, this structure would correspond to a part of a more complex data model, e.g.

  • Facts  -  Products  -  Product Groups
  • Order Lines  -  Order Headers  -  Customers

The result of my measurement can be seen in the red bars below:

CalcTime vs NoOfRecords Join.png

The graph confirms that the size of the intermediate table is a sensitive point: If it has 10.000 records or less, its existence hardly affects performance. But if it is larger, you get a performance hit.

I also measured the calculation times after joining the intermediate table, first to the left with the fact table, and then to the right with the dimension table, to see if the calculation times decreased (blue and green bars). You can see that joining tables with 10.000 records or less, does not change the performance. But if you have larger tables, a join with the fact table may be a good idea.

Conclusions:

  • The number of hops does not always cause significant performance problems in the chart calculation. But a large intermediate table will.
  • If you have both a primary and a secondary dimension (e.g. Products and Product Groups), you should probably not join them. Leave the data model as a snowflake.
  • If you have the facts in two large tables (e.g. Order Lines and Order Headers), you should probably join them into one common transaction table.


HIC


PS. A couple of disclaimers:

  1. The above study only concerns the chart calculation time - which usually is the main part of the response time.
  2. If the expression inside your aggregation function contains fields from different tables, none of the above is true.
  3. Your data is different than mine. You may get slightly different results.


Further reading related to this topic:

To Join or not to Join

37 Comments
MCampestrini
Valued Contributor

Hi Henric

Great post. Thanks for share your thoughts.

Márcio

0 Likes
104 Views
datanibbler
Esteemed Contributor

Dito! Great post!

I would only like to add that, in a big company, where you are probably not always going to be the (only) one looking after QlikView, keeping the tables in your DataModel recognizable so one can see what comes from where and what goes into which chart may be equally important and you may have to renounce on optimizing performance in a way, in order to keep things understandable and manageable.

Best regards,

DataNibbler

0 Likes
104 Views
kuba_michalik
Contributor III

Very interesting, thanks!

Regarding the .2 disclaimer, my intuition is that having fields from different tables inside one aggregation function is (as a rule of thumb) a bad idea and a performance killer (as QlikView would have to conjure up a potentially huge "virtual join" before aggregating). Does that sound about OK ?

0 Likes
104 Views

Nice Post Henric

0 Likes
104 Views
Employee
Employee

Yes, it can be a performance killer for exactly the reason you mention.

But - I would still not rule it out altogether. The performance is often better than you'd expect, and sometimes it's a bad idea to move more fields to the fact table (which is the only alternative). So, my advice is to try it before you start optimizing.

HIC

104 Views
Not applicable

Hi Henric,

Great Post.

0 Likes
104 Views
jjordaan
Valued Contributor

Hi Henric,

Thanks for a nice and clear explanation

0 Likes
104 Views
mikkeltaylor
Contributor II

Hi Henric,

Very informative post as usual.  Where there is no impact to metrics we have seen a good performance boost by merging a large dimension table into a large fact table.

How would this work where you are summing a counter field in the fact table though?  For example we have a large "accounts" table joined to a large "Customers" table  (a customer can have 1 or more accounts).  Most charts show a count of customers, which is at the detail level of the dimension table. 

Regards,

Michael

0 Likes
104 Views
Employee
Employee

A good performance boost when merging a large dimension table into a large fact table is expected. Keep doing that.

On the accounts table: I assume that there is a third table also, containing transactions (one account can have several transactions). If so, I would consider merging the account data into the transaction table, and using a Count(distinct AccountID) as measure. But again - it depends on how large the accounts table is.

HIC

104 Views
Not applicable

Great post Hic - thanks again

0 Likes
104 Views
MVP
MVP

Great post, thanks Henric!

How about having two large independent fact tables loosely coupled by a link table, and a some dimensions?

0 Likes
104 Views
Not applicable

Much appreciated post HIC. Informative and clear as always.

0 Likes
104 Views
r_wroblewski
Contributor III

Hi Henric

thanks for this great post.

0 Likes
104 Views
MVP
MVP

Hi Henric.

Thanks.

0 Likes
104 Views
anderseriksson
Contributor III

My thought too.

Two fact tables with a common key table is a common practice.

With the above findings it sounds like this would be a major penalty hit?

0 Likes
104 Views
Not applicable

Hi Henric,

Very good explanation!

Clear and precise.

Thanks!

0 Likes
104 Views
Employee
Employee

Anders and Ralf

A (large) link table does indeed mean a performance penalty. In my experience it is better to use a concatenated fact table and generic keys for the dimensions.

HIC

104 Views
MVP
MVP

Thanks, this was also my impression and I followed this approach.

0 Likes
104 Views
shelvinpatel
New Contributor III

It also important to check the keys between the tables whether they are loose key or primary/perfect key, especially having multiple fact tables.   Above ideal scenario is very relevant with perfect data set having hierarchy within dimensions and even between fact tables.  Otherwise you will end with cartesian joins which impact adversely on memory footprint.

0 Likes
104 Views
Employee
Employee

hic, your blog posts are always inspiring and elucidative. Thank you for sharing your knowledge.

Your posts always keep me rethinking my working in QV

“The simple things are also the most extraordinary things, and only the wise can see them.” P. Coelho

0 Likes
104 Views
vgutkovsky
Honored Contributor II

Henric, I wrote an article about this a couple years ago: http://www.infinityinsight.com/blog/?p=164. I notified a greater improvement from denormalizing, even with low cardinality. Any thoughts on why you and I got different results? Is it purely a difference in testing methodology, or is there some data difference that's key?

Vlad

0 Likes
104 Views
cabhijit
Valued Contributor

Thanks Henric Cronström for this valuable post.

Number of hops is no more a myth for me now.

0 Likes
104 Views
jcarpenter9
Contributor

It isn't just in the Qlik Community forums that one hears this. Students in the QlikView Developer class also hear it, perhaps not in so many words, but in the repeated exercises to collapse and de-normalize the data model, and in the optimization chapter. This post and Henric's test results underscore the need to update the Developer training materials to reflect the new reality (or maybe, the reality that has been there all along). Perhaps QlikView did perform better in earlier years with fewer table hops...smaller, slower servers...earlier versions of QV...but this myth won't die easily unless we Certified Trainers receive updated information to teach.

As it is, I include several references to this blog when teaching QlikView Developer, pointing out where the current recommended practice disagrees with the course materials.

- James

104 Views
chiragpradhan
Contributor

Brilliant..! just what I was looking for. !! Thanks for sharing.

0 Likes
104 Views

Good comments James. As a fellow instructor, I share your frustration at the Dev curriculum not reflecting current understanding of how QV works. I'm hoping for an overhaul soon.

Great post HIC!

104 Views
chiragpradhan
Contributor

Hi Henric, Would really appreciate if you could post the method for calculating chart response/calculation times or perhaps the QVW?

0 Likes
104 Views
Employee
Employee

Vlad

There are many factors that can influence this type of measurement. In my measurement I kept as much as possible constant, but still noticed that response times could vary ±30% (and more for the fast charts) from one series to another, although I made exactly the same sequence of selections on exactly the same data set. The only change was a restart of QlikView. So, yes, the testing methodology matters. As does the cardinality of the data.

Finally, I would expect a denormalized model to always be slightly more performant, so your results are reasonable. But the question is: Is the denormalized model significantly better? And is it worth the trouble and the loss of manageability? In your test, your worst chart (CH05) decreased from 78ms to 60ms. This is not enough to denormalize, as I see it.

HIC

0 Likes
104 Views
Employee
Employee

James, Rob

Our training material has emerged from the empirical experience of a large number of implementations. I admit that we may not always have understood the reasons why things behave the way they do. So when we have formulated our experience in words, we have sometimes oversimplified things. The recommendation for a Star Scheme was probably an oversimplification, since no one made a systematic test (until last week).

The training material will most likely be changed.

However, I think that the recommendation is still a good one. A simplification, yes, but a Star Scheme is performant and reasonably manageable. It is a good compromise. The goal with my blog post was to give a deeper understanding for those who want a deeper understanding – which is far from everyone. So I think that this recommendation has been useful.

HIC

0 Likes
104 Views
Employee
Employee

Chirag

See Recipe for a Memory Statistics analysis. In this app, there is a field CalcTime, that is an approximate calculation time for objects that I use.

However, you need to clear the cache before doing it. See http://community.qlik.com/blogs/qlikviewdesignblog/2014/04/14/the-qlikview-cache#comment-16263

HIC

104 Views
vgutkovsky
Honored Contributor II

Henric, I completely agree. I wrote that article more as an example of what can be accomplished by taking denormalization to extremes. As you correctly point out though, it would be a nightmare to maintain such an application. Definitely not advocating that.

Good post as always, love reading your stuff!

Vlad

0 Likes
104 Views