Part of my job as an Enterprise Architect at Qlik is dealing with hardware sizing. There are several factors to take into account to properly size any environment, and one of the most underrated I have found when discussing this subject with IT teams and developers, is the value of distinctness. But this is exactly one of the things that makes QlikView different. Let me explain this by taking a look at an example.
My app needs a calendar from January 1, 2011 to February 25, 2013. This data is stored in the database in the form of a timestamp:
DD/MM/YYYY hh:mm:ss
That makes 99 531 648 possible values. This field alone, when the QVW is opened, uses 2.3 GB of RAM.
That’s quite a bit of RAM for only one field. Let’s do some math. QlikView pointers are bit stuffed pointers meaning that QlikView needs 27 bits to store 99.5 million values. So, to calculate the RAM usage for the symbol tables, you need to use the following formula:
(Field Length in bytes * Number of Values) + Pointer Size (2^27) = Theoretical Total Size of Model
So for that field alone, I need to allocate 2 GB RAM, of which 134 MB are to store the pointers. It seems to be quite big for this small part of the model.
Since I cannot afford that much for one field alone, I’m going to split that field into two: date and time.
Now I have a Date field of 1 152 possible values and a Time field of 86 400 possible values. These two fields, when the QVW is open uses 24 MB of RAM.
Let’s do the math with these figures, using the same formula for Date and Time:
Date: 13 568 Bytes
Time: 707 576 Bytes
Because I’m always thinking of optimizing the models, I have decided to discard seconds, as they are not going to be used in my app, but I do need hour and minute.
The Date field has 1 152 possible values but the Time field is now reduced to 1 439 possible values. These two fields, when the QVW is open use 3.9 MB of RAM.
Let’s do the math with these figures, using the same formula
Date: 13 568 Bytes
Time: 9 243 Bytes
These file sizes may vary slightly depending on the file system, size of hard disk, amount of RAM in the system, etc.
We have seen some figures handling dates, but use this same technique with full names, addresses, composite fields, phone numbers… This is when the associative magic happens!
As you can see, distinctness is everything but trivial in QlikView, and taking proper care of fields when developing from the start will make the difference between a viable, well performing model and app and a huge, slow app.
Do you have any examples or tips and tricks of the value of distinctness?
Appendix I: Table with file sizes and RAM footprint for each example
Appendix II: Script code used for each scenario, that you can download here.
You are absolutely correct. Where to stop? I did in Date and Time (without seconds) because for my analyses that is enough. Yes, I could have keep separating into several other fields, but that, for my purpose, would be very cumbersome to deal with.
Where is the balance? I think the point of the example is clear, and in my experience you start optimizing when you already have built something that needs to be fixed or, at least, needs to be improved. If your triangle Environment/Hardware-App/Data Model Complexity-Usage Patterns/Number of Users is balanced, there is no need for further improve.
That is also in the excellent "QlikView 11 for Developers", it's called high-cardinality fields (which does not make this post any less excellent).
=> When you have to administer the QlikView infrastructure and thus take care of server_load and performance, that definitely does matter - also rgd. the size of the apps themselves and their performance - a problem that gets more accentuated when you think of opening apps via VPN or on a mobile device of any kind.
=> In that line of thought, you can also consider making some use of the "autonumber" function. I have no specific proof for the effect of this, but, as the book says,
=> QlikView has an easier time compressing numeric data as opposed to compressing text
=> Also, you can consider your keys (between tables): Having numeric values as keys is preferable to having some text as key (here again the autonumber() fct. might prove handy).
The impact of AutoNumber() is that it does not allow incremental loads because it has to be executed at the same level of the script, and this will impact in the reload times.
I must confess, i always wondered why some QlikView apps ground to a halt when the calendar was being generated, but i guess this explains it. I've seen copious amounts of RAM being consumed at the time of calendar gen, and the application pretty much freezes whilst it appends the calendar.
Needless to say, i'm looking forward to trying this out on a few applications tomorrow to see what the impact on the RAM usage and load script execution times are.
It's OK. As you can imagine, these formulas cannot be used for a sensible sizing of the app or measuring, since there are many other factors in memory that are happening at the same time. But it's a good starting point.
In QlikView there are no such thing as proper "key" fields, each and every field has its distinct values and their indexes in the model. If they are named alike, then QlikView knows that both tables are associated.