Solved: Re: String Key vs Integer Key -- any difference in... - Qlik Community

rwunderlich · ‎2014-02-05

Karl Pover makes the assertion in his blog Is AutoNumber() really worth the hassle? | Harvesting Wisdom that links between tables are done using the record pointer and therefore there is no difference in link structure between a string or an autonumbered key. Can anyone confirm this or provide a different opinion?

I understand the RAM savings from autonumber(). I'm asking specifically about differences in link structure and possibly performance.

-Rob

hic · ‎2014-02-06

Whether you have links that are strings, or links that are integers, should make no difference whatsoever in response time. In both cases, the evaluation (when you click) are made through the binary bit-stuffed pointers.

There are however two differences:

The size of the symbol table - of course. With strings, the symbol table will be much larger. And this is really the reason why you sometimes should use Autonumber().
The script execution time. Strings are always slower to read and interpret. But if you have strings in your source data, and use Autonumber() in the script to convert them to numbers, you will no matter what need to read and interpret them.

@ AUNEZ FABRICE If you have different cardinality in two tables, the symbol table will be created from the union of values in the two tables, and this (new) number of values will determine the length of the bit-stuffed pointers for all tables where this key is found. In other words, the symbol table is re-calculated when a field is encountered a second time. And so are the bit-stuffed pointers of the previous tables.

HIC

View solution in original post

Report Inappropriate Content · ‎2014-02-05

Hi Rob, Maybe we can get some input from Henric Cronström since my understanding of this comes from a discussion I had with him 2 years ago in Qonnections.

And while we wait for an official written answer and to keep the discussion active until we get the answer, I've been thinking how I could test a link's performance in a controlled random data model. Removing the impact that less RAM usage has on performance to confirm whether integers have a better link performance than strings seems tricky.

I've created 2 QVWs based on the following 2 scripts that try to even out the RAM size by keeping "dummy" fields.

StringKey.qvw

Load *, left(%String,1)&'X' as Dim;

Load

floor(rand()*26)+65 as Number1,

chr(floor(rand()*26)+65) & chr(floor(rand()*26)+65) as %String

AutoGenerate 50000000;

Load *, Number2 as Expression;

Load

chr(floor(rand()*26)+65) & chr(floor(rand()*26)+65) as %String,

floor(rand()*26)+65 as Number2

AutoGenerate 50000000;

IntegerKey.qvw

Load *, chr(%Number)&'X' as Dim;

Load

chr(floor(rand()*26)+65) & chr(floor(rand()*26)+65) as String1,

floor(rand()*26)+65 as %Number

AutoGenerate 50000000;

Load *, chr(Expression)&'X' as String2;

Load

floor(rand()*26)+65 as %Number,

floor(rand()*26)+65 as Expression

AutoGenerate 50000000;

In both QVWs I've placed all the list boxes and one chart with Dim as the dimension and sum(Expression) as the expression.

The StringKey.qvw is slightly larger in RAM at about 1.09 GB while NumberKey.qvw weighs in at about 1.04 GB, but further manipulation of the data models might case more harm.

After opening and closing the QVWs repeatedly and measuring the calc time of the graph, the faster chart seems to depend more on which QVW was opened first than whether the QVW has a string key or an integer key.

Calctime StringKey.qvw

2719 (1st open)

2750 (1st open)

2859

2781

2703 (1st open)

Calctime IntegerKey.qvw

2750

2781

2766 (1st open)

2718 (1st open)

2750

Where could I have gone wrong?

First, I'm assuming floor() returns a integer because QlikView Help ambiguously says it returns a number.

Second, the models aren't igual and I could have unintentionaly manipulated them in favor of StringKey.qvw

Third, I'm assuming a graph's calc time tells us how fast tables are linked although there can be other variables like the number of unique values in the dimension that determine its calc time.

Fourth, data model has too few tables to show the difference between integer keys and string keys.

Fifth, string values may be to short to show realistic results.

Well, it's a start...

Report Inappropriate Content · ‎2014-02-06

Rob,

My understanding is that integer keys are faster than string keys to perform the link. That is the case in SQL and I believe that it is also the case in QlikView. I do not think that QV uses the pointers to perform the link.

The size of each pointer depends on the cardinality of the dimension. When are these pointers made ? When reading or at the end ?

When loading data ?

If they are built when loading the data, we can face a case when we have different cardinality and orders. if we have 129 different members in the dimension table (8 bytes) and only 120 in the fact table (7 bytes), I assume that QV will allocate the size of each pointer at the end of the reading of the table. If the cardinality is not the same between 2 tables, the pointers will not be the same.

The order of the values may also be different : in that case, the pointers will also be different.

At the end of the load.?

QV may also decide to group the different tables, the different fields and perform the compression at that moment : compute the cardinality, allocate the size for each pointer, attribute a unique pointer for each value across all tables. It will be less efficient at a disk storage point of view, but could be more efficient fr the links.

Henric Cronström writes that "During the QlikView script run, after each load statement, QlikView transforms the data loaded into two table types: one data table and several symbol tables. QlikView creates one symbol table per field. Source: http://community.qlik.com/blogs/qlikviewdesignblog/2012/11/20/symbol-tables-and-bit-stuffed-pointers

So, it it why I think that integer keys are faster to perform the links because they are used to do it. As in any comparison, integer will be faster than string.

Fabrice

hic · ‎2014-02-06

Whether you have links that are strings, or links that are integers, should make no difference whatsoever in response time. In both cases, the evaluation (when you click) are made through the binary bit-stuffed pointers.

There are however two differences:

The size of the symbol table - of course. With strings, the symbol table will be much larger. And this is really the reason why you sometimes should use Autonumber().
The script execution time. Strings are always slower to read and interpret. But if you have strings in your source data, and use Autonumber() in the script to convert them to numbers, you will no matter what need to read and interpret them.

@ AUNEZ FABRICE If you have different cardinality in two tables, the symbol table will be created from the union of values in the two tables, and this (new) number of values will determine the length of the bit-stuffed pointers for all tables where this key is found. In other words, the symbol table is re-calculated when a field is encountered a second time. And so are the bit-stuffed pointers of the previous tables.

HIC

Report Inappropriate Content · ‎2014-02-06

Henric,

Thanks a lot for explanation. I clearly did not understand your post. When reading it again, it is quite clear that QV computes the pointers for each field, and not as I believed, for each field within each table. Therefore, the cardinality is computed for all values whatever the number of tables they can be in. And one value has the same pointer for all tables: the link can be made on it, ensuring performance.

Many "Best Practices" that I have read and the document on LOAD that I have written contain this false assertion about integer keys to join table. See page 35 of this one (I do not know if it is an official one).

You have broken another myth !

Fabrice

simonaubert · ‎2023-07-25

Hello @rwunderlich and @hic

This topic is quite old. Since, there is the second version of QIX since december 2015. Is it still true? I have more than doubt t be honest.

Best regards,

Simon

Bi Consultant (Dataviz & Dataprep) @ Business & Decision

simonaubert · ‎2023-07-25

Here the article https://community.qlik.com/t5/Design/QlikView-12-please-stand-up/bc-p/1475959

Bi Consultant (Dataviz & Dataprep) @ Business & Decision

hic · ‎2023-07-27

It is still true.

The evaluation (when you click) is made through the binary bit-stuffed indices. So, whether the data are strings or integers makes no difference whatsoever in response time.

String Key vs Integer Key -- any difference in link performance?