The Importance Of Being Distinct - Qlik Community

Miguel_Angel_Baeyens · ‎2014-03-11

Part of my job as an Enterprise Architect at Qlik is dealing with hardware sizing. There are several factors to take into account to properly size any environment, and one of the most underrated I have found when discussing this subject with IT teams and developers, is the value of distinctness. But this is exactly one of the things that makes QlikView different. Let me explain this by taking a look at an example.

My app needs a calendar from January 1, 2011 to February 25, 2013. This data is stored in the database in the form of a timestamp:

DD/MM/YYYY hh:mm:ss

That makes 99 531 648 possible values. This field alone, when the QVW is opened, uses 2.3 GB of RAM.

That’s quite a bit of RAM for only one field. Let’s do some math. QlikView pointers are bit stuffed pointers meaning that QlikView needs 27 bits to store 99.5 million values. So, to calculate the RAM usage for the symbol tables, you need to use the following formula:

(Field Length in bytes * Number of Values) + Pointer Size (2^27) = Theoretical Total Size of Model
(18 * 99 531 648) + 134 217 728 = 1 925 787 392 Bytes

So for that field alone, I need to allocate 2 GB RAM, of which 134 MB are to store the pointers. It seems to be quite big for this small part of the model.

Since I cannot afford that much for one field alone, I’m going to split that field into two: date and time.

Now I have a Date field of 1 152 possible values and a Time field of 86 400 possible values. These two fields, when the QVW is open uses 24 MB of RAM.

Let’s do the math with these figures, using the same formula for Date and Time:

Date: 13 568 Bytes

Time: 707 576 Bytes

Because I’m always thinking of optimizing the models, I have decided to discard seconds, as they are not going to be used in my app, but I do need hour and minute.

The Date field has 1 152 possible values but the Time field is now reduced to 1 439 possible values. These two fields, when the QVW is open use 3.9 MB of RAM.

Let’s do the math with these figures, using the same formula

Date: 13 568 Bytes

Time: 9 243 Bytes

These file sizes may vary slightly depending on the file system, size of hard disk, amount of RAM in the system, etc.

We have seen some figures handling dates, but use this same technique with full names, addresses, composite fields, phone numbers… This is when the associative magic happens!

As you can see, distinctness is everything but trivial in QlikView, and taking proper care of fields when developing from the start will make the difference between a viable, well performing model and app and a huge, slow app.

Do you have any examples or tips and tricks of the value of distinctness?

Appendix I: Table with file sizes and RAM footprint for each example

Appendix II: Script code used for each scenario, that you can download here.

valerio_fatatis · ‎2014-03-11

great great great...!

Wonderfull doc!

Thanks

Magnific!

Report Inappropriate Content · ‎2014-03-11

Excellent doc.

I have two questions:

1) the field length is 18. Is is just for dates ? for all fields (because there is no data type in QV) ? (I thought it was the number of bytes, here 27 but it seems that it is not the case).

2) why a day is composed of 86 399 seconds and not 86 400 (24*60*60) ? Is there a second that should not be there like 00:00:00 ?

I have learnt a lot, thank you

Fabrice

ThornOfCrowns · ‎2014-03-11

It looks like today's job is to trim all of the date/time values from my loads. Thanks!

Miguel_Angel_Baeyens · ‎2014-03-11

Hi Fabrice,

To your first question, it is for this case as returned by the Len() function, because the representation is a literal, not a numeric value.

To your second, it's just a question of scripting. If 00:00:00 is the second 0, then 23:59:59 is the second 86 399. IterNo() will always return 1 as first value, however, for me the first second of each day is 00:00:00. That's why I have to substract 1 in the WHILE part.

If you are familiar with MS programming languages, think of the "Option Base" statement in Visual Basic to set arrays start their indexes in 1 and not in 0 (or viceversa).

You could use a different script using:

SET vMinDate = Date('31/12/2010');

instead of

SET vMinDate = Date('01/01/2011');

Discarding the "-1" in the rest of the script and using 86400 as the number of iterations in the loop.

Hope that makes sense.

Miguel

Report Inappropriate Content · ‎2014-03-11

Miguel,

Thanks for answering so fast.

1) 18 = length of the field. OK. I did not catch you used the Len() function to find that number. Now it is clear. And logic also.

2) You write "Now I have a Date field of 1 152 possible values and a Time field of 86 399 possible values." I was not speaking about scripting and iterations, but just number of possibilities. I think there are 86 400 possibilities (perhaps from 0 to 86 399) because there are 86 400 seconds in the day, and we count every second. No ? I perhaps did not understand correctly your sentences.

Fabrice

Miguel_Angel_Baeyens · ‎2014-03-11

Hi Fabrice,

You are totally correct, it's a mistake. I'm reviewing it right away.

Thanks!

Report Inappropriate Content · ‎2014-03-11

Miguel,

Thaks to you for this doc. As already said, I learnt a lot

Fabrice

Marcio_Campestrini · ‎2014-03-11

Miguel, thanks the explanation.

Anonymous · ‎2014-03-11

Miguel:

Great post. We developers need more information like this.

Thanks.

CB.

michael_solomon · ‎2014-03-11

Hi Miguel,

Firstly, good post, thanks. But what's to stop you splitting the date/time even further? ie. Separate day (31 possible values) from month (12 possible values), from year (3 possible values in the above example), and even hour (24 possible values), and minute (60 possible values). Separating these would result in even smaller storage and memory usage. - your QVD files size would reduce from 48kb to something like 4kb.

In a practical sense I don't imagine too many documents where you would split it in the way I've suggested above, but as you say with other data such as addresses/phone numbers... it's important to know when to stop splitting.

I expect knowing how these fields will be used in the document will help you decide where to land in the trade off between size and calculation speed. That is depending on the users' needs, storing the data as separated fields may need a calculation to reconstruct it for the user.

Thanks.

Michael