Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Employee
Employee

To Join or not to Join

The QlikView internal logic enables a data model with several associated tables. It not only allows – it encourages you to use several tables when building a data model.

This is very different from many other BI or query tools where, when several tables are used, they are all are joined together into one table. The most obvious example of this difference is a simple SELECT statement. With it, you can use several tables as input and join them, but the output is always one single, denormalized table.

Denormalized2.png

With QlikView, in contrast, you can have a multi-table relational data model that is evaluated in real-time. The associations are evaluated as joins at the moment when the user makes a selection in the application. At the same time, all objects, some with complex calculations based on these joins, are recalculated.

Normalized.png

When creating the QlikView data model, you have a choice of loading the tables as several entities or joining some of them together. Joining in the script means that the result of the join is stored in the QlikView data model as one single table.

So what should you do? Is it better to keep the data model normalized (many tables) or is it better to de-normalize (fewer tables)?

My view is that it usually is better to keep the data model as normalized as possible. A normalized model has many advantages:

  • It is memory efficient. It is, by definition, the data model that uses least memory.
  • It is CPU efficient. In most cases, QlikView calculations in a normalized model are as efficient - or only marginally slower - as in a denormalized model. In some cases the normalized model is faster.
  • It is easier to understand and manage. It should be possible for other developers to read your script: A simple script with as few transformations as possible, is a script that is easy for other developers to understand and maintain.
  • It minimizes the risk for incorrect calculations. Joins potentially change the number of records in the tables, which means that a normal Sum() or Count() function cannot always be used – they would sometimes return an incorrect result. You may counter that there is always a way to write a correct formula, but my point is that it should also be easy. Expressions in server objects will be written by users that do not have special knowledge about the data model in the app.

But it is not a clear-cut case.

Often there is a trade-off between memory efficiency and CPU efficiency. In other words, there are cases where you can decrease response time by letting the data model use more memory; where performance will be better if you make the join in the script.

Denormalized.png

One such case is if you have two very large fact tables, like Order Headers and Order Details. An other is if you have chart expressions containing fields from different tables. Then QlikView has to perform the join in memory generating a virtual table over which the summation will be made. This can be both memory and CPU demanding, so you might get a better performance if you have made the join already in the script. But the difference is sometimes only marginal. You need to test, to be sure.

Bottom line is that you’ll have to weigh pros and cons. Don’t join unless you have to. If performance is important and you experience a noticeable improvement when you join, then you probably should join. But ask yourself what the implications are. Is the script still manageable? Can a user understand how the formula should be written?

The best join is often the one that never is made. Often – but not always.

HIC

See more about this topic in the Technical Brief about Joins and Lookups

26 Comments
Not applicable

Very interesting article. I have been using this but often the challenge is when you have to handle multiple 'facts' in your data model. So, a related query is 'is concatenation always recommended when handling multiple facts with diff levels of detail'? Sometimes building a link table is more intuitive  keeping the facts separated and visible!

0 Likes
17 Views
Employee
Employee

Glad you like it. My view on multiple fact table is that they should be concatenated. A link table will work also, but is not as efficient. The different levels of detail can be solved and I am in fact working on an article on exactly this problem. So stay tuned - you'll hear more about this.

HIC

PS see more on Fact Table with Mixed Granularity

17 Views
Not applicable

Dear Henric,

very interesting article.

Myself am trying to avoid Joins, because I find them pretty confusing - and dangerous.

Indeed reading your entry triggered a thought about KEEP and performances.

Let say I have a Table1, Order Headers and a Table2, Order Lines.

What you judge to be more effective performance-wise, between these 2 options:

1. filtering with a WHERE statement just orders of year 2012 in Table1 and then LEFT KEEP Table2

2. filtering with a WHERE statement just orders of year 2012 in Table1 and again filtering with a WHERE statement just orders of year 2012 in Table2, given that appropriate field is available.

In the second case, I see a double benefit, as data are filtered right from the sources (so there is no need for checking against Table1 when selecting from Table2) and further there won't be any real-time valuation when a selection is made in Qlikview.

I see the only drawback as a little less flexibility, as if it is wanted to include also orders of 2011, I would need to change the script for the two tables, rather than just for Table1 (and as I type, I realise that probably a variable at the beginning of the script can be set to determine the date, so all in all there isn't any single drawback).

What is your thought on this? (sorry for deviating from the original topic)

thank you!

17 Views
Employee
Employee

There are several ways to do this. All three approaches below will work, and which you should choose depends on how much memory you have. Or how fast your DB can perform a join.

1) Using Left Keep:

          SQL SELECT ... FROM OrderHeaders WHERE Year = 2012 ;

          Left Keep SQL SELECT ... FROM OrderLines ;

This is fast but could be very memory consuming: QlikView will load the entire OrderLines before it reduces it to the proper size.

2) Using where Exists:

          SQL SELECT ... FROM OrderHeaders WHERE Year = 2012 ;

          Load * where Exists(OrderID); SQL SELECT ... FROM OrderLines ;

This is not as fast, but on the other hand it uses a minimal amount of memory: QlikView will load only the correct records from OrderLines.

3) Using a join:

          SQL SELECT ... FROM OrderHeaders WHERE Year = 2012 ;

          SQL SELECT ... FROM OrderLines

             INNER JOIN OrderHeaders ON OrderHeaders.OrderID=OrderLines.OrderID

             WHERE OrderHeaders.Year = 2012;

Whether this is fast or not, depends on your DB. However, QlikView uses a minimal amount of memory. QlikView will load only the correct records from OrderLines.

Your thoughts on filtering at the source already is totally correct. Less data will be transferred to QlikView and this is probably good. But if the join is slow, the gain is lost...

I would probably try all three to see which is fastest.

HIC

17 Views
Not applicable

Hi Henric.

As in this case I have a question.

If we keep everyting in normalized form without having much more joins.

Then there is possibility that it may increase number of hops.

I came to know its best practise to have lesser number of hops in QlikView.

So what's your opinion over here?

Will hops cause wrong calculations?

0 Likes
17 Views
Employee
Employee

In most cases, QlikView performs very well also when you have many tables and then many "hops" in the calculations. So I don't agree that the best practice should be as small number of hops as possible.

For very large data sets there are however a couple of cases where you should be cautious. If the performance is bad you should consider changing the data model:

  • If you have several very large tables, like in the picture above with four tables. In such a case you could consider joining the two big tables together, so you get a star scheme instead of a snowflake scheme.
  • If you have calculations like Sum(x*y) where x and y are in different tables, then you should consider moving one of the two numbers so you get both numbers in one table.

Will hops cause wrong calculations? No – it is the other way around. If you join the tables together, then you may get wrong calculations.

HIC

17 Views
Not applicable

Hi Henric,

I have to make a join with two tables, but I don´t know how. I tried several things but it did not work. Can you please tell me your email adress or a personal message so I can send you the mdx statement.

Regards Lars

0 Likes
17 Views
Not applicable

Dear Henric,

psl let me come back on this topic.

You said

If you have calculations like Sum(x*y) where x and y are in different tables, then you should consider moving one of the two numbers so you get both numbers in one table.

How about if there is a dimension x which is sorted according to sum of y?

Also in that case it would be better to have x and y in the same table?
Thank you!

0 Likes
17 Views
Not applicable

Henric,

Thank you for this blog. I often try to keep the tables apart from each other but you're right. Joining is sometimes faster. Another good thing to notice. If you join a table by loading it DISTINCT, the initiatal table which will be joined to, will also be loaded as a distinct table. Of course there are solutions regarding this problem.

Patrick

0 Likes
17 Views
Employee
Employee

@ Erik Furlanis : Yes, if you sort by a field y from another table, you should also consider moving y into the main table. (But only if you already have a performance problem. "Premature optimization is the root of all evil." -- Donald Knuth).

@ Patrick Baal : Correct. The "distinct" clause will be evaluated after the transformation, which is counter-intuitive. However, it is exactly what you want in the concatenation case - but maybe not in the join case.

HIC

17 Views
geetaalhan
Contributor

very informative.

0 Likes
17 Views
Not applicable

What about the concept of a "link table" in the spirit of dimensional modelling. That is like two facts, instead of joining them and then have dim. around, you could link the fact and dim through a link table.

Is there any situations where this could be smart?

0 Likes
17 Views
Employee
Employee

Link tables work fine, but could lead to long response times when you have a lot of data. I personally think that it almost always is better to concatenate the two fact tables into one - even in cases when there are few common fields.

But this is also a matter of taste: There are many that prefer link tables. See also what my colleague Arturo Muñoz says about it: Concatenate vs Link Table.

HIC

17 Views
Not applicable

Thx, it seems given that it is concatenate and not join in these cases. As I have posted in another thread about big tables, I don't understand this, why get double rows, if you can avoid it? Can you plz explain me why or do you have a blog with that too?

0 Likes
17 Views
Employee
Employee

If you know for certain that you have a one-to-one relationship between the two fact tables, then you probably should join them. As you say: "why get double rows, if you can avoid it".

But usually, you cannot be sure about that. You may have two records in one fact table that should match one record in the other. If you then join, the numbers will not add correctly. Then it is much better to concatenate the two tables.

Concatenation also allows you to have different granularity in the two fact tables. And for that, I have another blog post. Fact Table with Mixed Granularity

HIC

17 Views
Not applicable

"And for that, I have another blog post." If you put all these blogs into pdf and merge them, you can have a very good book to sell us - Just don't choose "PACKT Publishing", I made a little blog with my review on them for the book "QlikView 11 for developers".

0 Likes
17 Views
jansen28
New Contributor III

Very useful article. Thanks!

0 Likes
17 Views
bestofwest
Contributor II

You say "QlikView uses a minimal amount of memory.".

Almost the situations that's not correct....

0 Likes
17 Views
sibideepak
Contributor II

Helpful words !!

0 Likes
17 Views
Not applicable

Thanks for the article!

I'm wondering if it is possible to optimize a sum( table1.field1 * table2.field2) expression, where there is a many to many relationship between two tables due to normalization. For example, by first dynamically summarizing the two different tables on only the dimensions that are required for the chart and the join, which would mean that not as many combinations would have to be calculated to perform the sum. I have not managed to figure out if this is possible or not.

I posted a question with more details here:

https://community.qlik.com/thread/165421

0 Likes
17 Views
Not applicable

great & useful article ,thanks.@srinuinqlikview

0 Likes
17 Views
Not applicable

Hi,

Bit confused as to which way to go.

I have a fact table that has rows

which I then concatenate with another fact table  around.

Then I left join 9 tables.

110,000 rows bringing through 6 field

48,000 rows x 2 7 fields

4500 rows x2 6 fields

19000 x2 6 fields

10 rows 3 fields

25 rows 3 fields

Sales:

So Load * from fact1 where date>value

Concatenate

Load * from fact2 where date>value

Left join  (Sales)

Load * from table1

What would be the best way to optimise this script/load?

Apply maps, joins within the load or anything else?

0 Likes
17 Views
Employee
Employee

I would not join the tables. In most cases you don't need the join. Further the script is more manageable if you don't join the tables. So start with the concatenation of your fact tables, then proceed to loading the dimensional tables - but without the joins.

HIC

17 Views
Not applicable

Hi,

Thanks for getting back to me.

At the end of the script I do some post load calculations on SalesTransactions. SalesTransactions would be the table that has the fact and dimension joined via left joins then at the end I use this to create fields from the dimensions:

How would I do this without joining please? Is this possible?

SalesTransactions:

LOAD *,

     TopDepartureName &'-'&TopArrivalName AS Flow,

     If (TopDepartureName < TopArrivalName,

      TopDepartureName & '-' & TopArrivalName,

      TopArrivalName & '-' & TopDepartureName) AS FlowGrouped,

     If (DepartureName < ArrivalName,

  DepartureName & '-' & ArrivalName,

  ArrivalName & '-' & DepartureName) AS JourneyGrouped,

     DepartureName & '-' & ArrivalName AS Journey,

     If (LegFare = 0, 1, 0) AS ZeroFlag,

     If(SaleType=0,Company, ProductCompany) as CompanyOverall

Resident Sales;

DROP TABLE Sales

0 Likes
17 Views
Employee
Employee

I would make these calculations directly in the fact table. I assume that you make the join to "move" fields from the dimensions into the fact table and then use these in the calculations. However, this you can do also with ApplyMap(). See Don't join - use Applymap instead.

HIC

17 Views
Not applicable

I used apply map for these post load calculations worked great, however my load is still over 2 hours long. I need to cut it down to at least an hour.

0 Likes
17 Views