Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
gmenoutis
Partner - Creator II
Partner - Creator II

QV uses 6NF?!? Many joins?

By generally looking around, it seems that in searching for performance, one should make their data warehouse need few joins if possible (one of the places I learnt this is my recent post here).

Then I stumbled on this qlik design post. In short it states that each data table we see, is actually created as one master/lookout table per field that assigns a minimum-size int (called a bit-stuffed pointer) to each possible value, and then the actual data table is a link table between all the possible bit-stuff index combinations.


In my understanding, this is a 6th normal form design, and it leads to a large number of join operations.


So, if in the script I load a single table with 20 columns, and present all of them on a chart, the Qlik Engine will run 20 joins! What probably saves the day is that their datatype, minimum cardinality and obviously the algorithm of the join are optimized.


But, in this respect, is it still correct to say that having many tables which need to be on-demand joined (in order to evaluate front-end expressions) can have a serious impact on performance?

For example, let's say we have a Transactions table with 20 fields, a Stores table with 10 fields, and both have a field "StoreCode" on which the natural join happens.

The Qlik Engine structure will already make 30 joins due to the 6nf structure; and one join due to the "split" table structure. However optimized the qlik engine splits are, is it indeed the case that this 1 out of 31 joins seriously impacts the performance?

11 Replies
marcus_sommer

I wouldn't consider it in this way else I see in it two different layer. The basic-layer are here the various symbol- and data-tables and above them is the datamodel-layer which connects all the data-tables.

I'm not sure if a data- and all of the related symbol-tables could be classified as 6NF. In my opinion the data-table is just a data-table and the load-order defined the record-index and there are - in my assumption - no "classical" joins necessary to match the fields respectively the symbol-tables (which are just the source for a mapping/lookup approach on parallel and/or later working-steps).

Therefore I think that the statements within the both mentioned postings are not in general contrary to eachother even if they both just provide a simplified view of the reality.

- Marcus

gmenoutis
Partner - Creator II
Partner - Creator II
Author

I do not understand. How are "joins" different from "mapping/lookup approach"? I understand that in the general case, a database join operation is not optimized by having either a 1-per-value or minimum-length-datatype field. But still, when you have two tables, and you do (data and symbol), you are running a join algorithm.

Which runs once per field needed, a number *usually* much larger than the number of tables involved.


marcus_sommer

A "classical" join could add fields from one table to another and depending on the relation of their values and the used join-prefix it could change the number of records by duplicating or removing them - and it is usually a rather heavy transformation.

A mapping/lookup is quite different to this approach because there will no columns be added and the number of records be changed - else only the content of the fieldvalues will be replaced.

- Marcus

Or
MVP
MVP

I'm not sure, but I think there's little gain in applying relational database concepts for a columnar, in-memory database - it is my understanding that they simply don't apply.

Qlik's engine uses this technique to reduce the amount of information that needs to be kept in-memory, because historically memory availability was the limiting factor for QlikView applications, not the number of actions needed to make calculations. Normalization simply does not apply in the relational-database sense, as there's no I/O concerns to speak of and there's no need to worry about the same value existing multiple times in different tables, thus requiring multiple actions when it changes and/or having to deal with dependencies. 

Peter_Cammaert
Partner - Champion III
Partner - Champion III

Where do you get it that columns in a single table need to be joined to get to the data? The bit-stuffed pointer technique doesn't need joins since comparing bit-stuffed pointers in the table is the same as comparing the original values. It just takes less RAM and less time.

AFAIK the Qlik engine(s) doesn't do much with the actual values unless calculations require their numerical representations and then it's just a matter of indexing into the Symbol table, just like an RDBMS Index does into an actual Relational table.

gmenoutis
Partner - Creator II
Partner - Creator II
Author

AFAIK the Qlik engine(s) doesn't do much with the actual values unless calculations require their numerical representations

OR an object presents the value itself (either the numeric or text part). And this object has to exist in memory.

and then it's just a matter of indexing into the Symbol table, just like an RDBMS Index does into an actual Relational table.

I agree. It seems I might have caused some confusion by using the term "join" to imply the relational one, where the join operation in qlik engine is to append columns to a data table.

Here, it seems you use "index" in the same way I use "join". Fair enough. The point is that this operation has to run (table count -1) to join tables...AND (field count) to bring the values from the pointers. And my question is, since the second one is usually much larger than the first, is it possible that having many tables only minimally affects performance?

gmenoutis
Partner - Creator II
Partner - Creator II
Author

@Marcus Sommer:


You mention two differences:


  1. No columns are added in the lookup: Fair enough, this makes a difference in the memory part. However, we are mostly talking about the time taken by the join algorithm, thus the CPU - and here, things do not change. The CPU bears the burden of assigning the real data values to each pointer
  2. The fact that record count does not change in the lookup is simply a product of the design: each pointer corresponds to exactly one look-up table row. This does not change the fact that the paring needs to be done. Also, since the design provides



@Or Shoham:


I used the 6NF term more in a way to quickly describe the symbol look up table structure of qlikview. It was not my intention to imply merits of the data integrity of the normalization (well, and even if I did, this IS the highest form).

The drawbacks however, ie the multitude of joins, have to do with performance and not integrity, and performance IS what I am worried about.

Even then, I am not judging or even commenting the qlik team's data structure design choice. I am simply wondering, given the fact that it is what it is, if it is indeed valid to say multiple tables are (considerably) slower when in fact their join inside is a very small amount of the true joins taking place.

Also, you state:



Qlik's engine uses this technique to reduce the amount of information that needs to be kept in-memory



However, when I make an object (ie a Table Box), qlik has to make one join for every table (to bring data together), and then one join for every field (in order to replace pointers with data), and this whole thing takes CPU and is stored in memory in order to render the object...correct?

Peter_Cammaert
Partner - Champion III
Partner - Champion III

Well, although we're currently guessing about implementation details which no one in this community knows about except a few of the people with a star behing their nickname (I personally know of only one member who can nose around in R&D territory and is still active in Qlik Community), maybe this consideration can help: AFAIK the display value of a TableBox cell is not present in memory except in the Symbol table of that particular field. The TableBox object will probably have an instance in RAM assembled for whenever someone wants to inspect multiple connected listboxes, but why would you need the display values? For every million occurrences of say field pointer value 101011, you have only one display value which can be easily and with minimal energy obtained when the Windows object is redrawn. Storing the display values in every RAM-copy of the object that is presented on screen would effectively undo the savings of using a Symbol table or bit-stuffed pointers.

But as I said, this is only guesswork. If've never had the opportunity to look into the code itself. Unfortunately...

marcus_sommer

Yes, the CPU needs to do the translation of the pointer-values to the symbol-table values - and this on the fly and again and again whereby nearly all matchings and calculations are cached in Qlik and therefore it reduced the number of CPU actions significantly.

Further many computings (and I think also this matching) in Qlik is executed in multi-threading and quite optimized. Another important point here is that these computings only happens on the for the possible dataset (in regard to the set selections and used dimensions from the objects which are visible). This meant that quite often only a very small number of fieldvalues needs to be translated and I think it's one of the fastest measures which Qlik needs to do to display/calculate any results in any objects.

- Marcus