Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

The Importance Of Being Distinct

Part of my job as an Enterprise Architect at Qlik is dealing with hardware sizing. There are several factors to take into account to properly size any environment, and one of the most underrated I have found when discussing this subject with IT teams and developers, is the value of distinctness. But this is exactly one of the things that makes QlikView different. Let me explain this by taking a look at an example.

My app needs a calendar from January 1, 2011 to February 25, 2013. This data is stored in the database in the form of a timestamp:

DD/MM/YYYY hh:mm:ss

That makes 99 531 648 possible values. This field alone, when the QVW is opened, uses 2.3 GB of RAM.

That’s quite a bit of RAM for only one field. Let’s do some math. QlikView pointers are bit stuffed pointers meaning that QlikView needs 27 bits to store 99.5 million values. So, to calculate the RAM usage for the symbol tables, you need to use the following formula:

(Field Length in bytes * Number of Values) + Pointer Size (2^27) = Theoretical Total Size of Model

(18 * 99 531 648) + 134 217 728 = 1 925 787 392 Bytes

So for that field alone, I need to allocate 2 GB RAM, of which 134 MB are to store the pointers. It seems to be quite big for this small part of the model.

Since I cannot afford that much for one field alone, I’m going to split that field into two: date and time.

Now I have a Date field of 1 152 possible values and a Time field of 86 400 possible values. These two fields, when the QVW is open uses 24 MB of RAM.

Let’s do the math with these figures, using the same formula for Date and Time:

Date: 13 568 Bytes

Time: 707 576 Bytes

Because I’m always thinking of optimizing the models, I have decided to discard seconds, as they are not going to be used in my app, but I do need hour and minute.

The Date field has 1 152 possible values but the Time field is now reduced to 1 439 possible values. These two fields, when the QVW is open use 3.9 MB of RAM.

Let’s do the math with these figures, using the same formula

Date: 13 568 Bytes

Time: 9 243 Bytes

These file sizes may vary slightly depending on the file system, size of hard disk, amount of RAM in the system, etc.

We have seen some figures handling dates, but use this same technique with full names, addresses, composite fields, phone numbers… This is when the associative magic happens!

As you can see, distinctness is everything but trivial in QlikView, and taking proper care of fields when developing from the start will make the difference between a viable, well performing model and app and a huge, slow app.

Do you have any examples or tips and tricks of the value of distinctness?

Appendix I: Table with file sizes and RAM footprint for each example

qcomDBlogUsageTable.PNG.png

Appendix II: Script code used for each scenario, that you can download here.

24 Comments
Partner
Partner

great great great...!

Wonderfull doc!

Thanks

Magnific!

1,451 Views
Not applicable

Excellent doc.

I have two questions:

1) the field length is 18. Is is just for dates ? for all fields (because there is no data type in QV) ? (I thought it was the number of bytes, here 27 but it seems that it is not the case).

2) why a day is composed of 86 399 seconds and not 86 400 (24*60*60) ? Is there a second that should not be there like 00:00:00 ?

I have learnt a lot, thank you

Fabrice


1,451 Views
thornofcrowns
Valued Contributor II

It looks like today's job is to trim all of the date/time values from my loads. Thanks!

1,451 Views

Hi Fabrice,

To your first question, it is for this case as returned by the Len() function, because the representation is a literal, not a numeric value.

To your second, it's just a question of scripting. If 00:00:00 is the second 0, then 23:59:59 is the second 86 399. IterNo() will always return 1 as first value, however, for me the first second of each day is 00:00:00. That's why I have to substract 1 in the WHILE part.

If you are familiar with MS programming languages, think of the "Option Base" statement in Visual Basic to set arrays start their indexes in 1 and not in 0 (or viceversa).

You could use a different script using:


SET vMinDate = Date('31/12/2010');

instead of

SET vMinDate = Date('01/01/2011');


Discarding the "-1" in the rest of the script and using 86400 as the number of iterations in the loop.

Hope that makes sense.

Miguel

1,451 Views
Not applicable

Miguel,

Thanks for answering so fast.

1) 18 = length of the field. OK. I did not catch you used the Len() function to find that number. Now it is clear. And logic also.

2) You write "Now I have a Date field of 1 152 possible values and a Time field of 86 399 possible values." I was not speaking about scripting and iterations, but just number of possibilities. I think there are 86 400 possibilities (perhaps from 0 to 86 399) because there are 86 400 seconds in the day, and we count every second. No ? I perhaps did not understand correctly your sentences.

Fabrice

1,451 Views

Hi Fabrice,

You are totally correct, it's a mistake. I'm reviewing it right away.

Thanks!

1,451 Views
Not applicable

Miguel,

Thaks to you for this doc. As already said, I learnt a lot

Fabrice

0 Likes
1,451 Views
Marcio_Campestrini
Valued Contributor

Miguel, thanks the explanation.

0 Likes
1,451 Views
carbal1952
Contributor II

Miguel:

Great post. We developers need more information like this.

Thanks.

CB.


1,451 Views
Partner
Partner

Hi Miguel,

Firstly, good post, thanks.  But what's to stop you splitting the date/time even further? ie. Separate day (31 possible values) from month (12 possible values), from year (3 possible values in the above example), and even hour (24 possible values), and minute (60 possible values).  Separating these would result in even smaller storage and memory usage. - your QVD files size would reduce from 48kb to something like 4kb.

In a practical sense I don't imagine too many documents where you would split it in the way I've suggested above, but as you say with other data such as addresses/phone numbers... it's important to know when to stop splitting.

I expect knowing how these fields will be used in the document will help you decide where to land in the trade off between size and calculation speed. That is depending on the users' needs, storing the data as separated fields may need a calculation to reconstruct it for the user.

Thanks.

Michael

1,451 Views
Partner
Partner

Hi friend,

Great post, really needed having those aspects in account while designing your Qlik solution.

Awaiting for new posts talking about hardware sizing.

Best Regards

1,451 Views

Hi Michael,

You are absolutely correct. Where to stop? I did in Date and Time (without seconds) because for my analyses that is enough. Yes, I could have keep separating into several other fields, but that, for my purpose, would be very cumbersome to deal with.

Where is the balance? I think the point of the example is clear, and in my experience you start optimizing when you already have built something that needs to be fixed or, at least, needs to be improved. If your triangle Environment/Hardware-App/Data Model Complexity-Usage Patterns/Number of Users is balanced, there is no need for further improve.

Best,

Miguel

0 Likes
1,451 Views
datanibbler
Esteemed Contributor

Hi,

great post!

That is also in the excellent "QlikView 11 for Developers", it's called high-cardinality fields (which does not make this post any less excellent).

=> When you have to administer the QlikView infrastructure and thus take care of server_load and performance, that definitely does matter - also rgd. the size of the apps themselves and their performance - a problem that gets more accentuated when you think of opening apps via VPN or on a mobile device of any kind.

=> In that line of thought, you can also consider making some use of the "autonumber" function. I have no specific proof for the effect of this, but, as the book says,

   => QlikView has an easier time compressing numeric data as opposed to compressing text

=> Also, you can consider your keys (between tables): Having numeric values as keys is preferable to having some text as key (here again the autonumber() fct. might prove handy).

1,451 Views

Thanks for your annotations.

You are correct (Miguel Garcia and Barry Harmsen are too wise people).

The impact of AutoNumber() is that it does not allow incremental loads because it has to be executed at the same level of the script, and this will impact in the reload times.

Miguel

0 Likes
1,451 Views
Not applicable

Great post, Miguel.

1,451 Views
Luminary
Luminary

I must confess, i always wondered why some QlikView apps ground to a halt when the calendar was being generated, but i guess this explains it.  I've seen copious amounts of RAM being consumed at the time of calendar gen, and the application pretty much freezes whilst it appends the calendar.

Needless to say, i'm looking forward to trying this out on a few applications tomorrow to see what the impact on the RAM usage and load script execution times are.

Great stuff, Miguel!

1,451 Views

It's a great post, very easy to understand. Thanks!

1,451 Views
Not applicable

Great post!

1,451 Views
Not applicable

Hi,

I have one doubt, is this size formula correct for key fields?

I've heard that the size of a key field is bigger than a normal field, am I right?

cheers!

David

1,451 Views

It's OK. As you can imagine, these formulas cannot be used for a sensible sizing of the app or measuring, since there are many other factors in memory that are happening at the same time. But it's a good starting point.

In QlikView there are no such thing as proper "key" fields, each and every field has its distinct values and their indexes in the model. If they are named alike, then QlikView knows that both tables are associated.

1,451 Views
Not applicable

Thanks for this post Miguel

1,451 Views
Partner
Partner

Espectacular aporte Miguel Ángel.

Es apasionante lo que el estudio de la optimización puede mejorar nuestro proyecto!!

Mil gracias por compartir tus conocimientos en Qlik.

Luis García.

1,451 Views
jerifortune
Contributor III

Thanks. very helpful.

0 Likes
1,451 Views
vvira1316
Valued Contributor II
1,090 Views