Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Symbol Tables and Bit-Stuffed Pointers

 

Today I have a blog post for the Geeks¹. For the hard-core techies who love bits and bytes. The rest of you can stop reading now. For you, there are other interesting posts in the Business Discovery Blog and in this blog, the QlikView Design blog.

 

Now to the bit-stuffed pointers:

 

During the QlikView script run, after each load statement, the Qlik engine transforms the data loaded into two table types: one data table and several symbol tables. The engine creates one symbol table per field:

 

Symbol tables.png

 

The symbol tables contain one row per distinct value of the field. Each row contains a pointer and the value of the field, both the numeric value and the textual component. Basically, the symbol tables are look-up tables for the field values.

 

The data tables are the same tables as you can see in the QlikView internal table viewer (<CTRL>-T) when you have chosen the “Internal table view”  – the same number of rows, the same number of columns. However, the tables do not contain the data itself – they contain the pointers only. But since the pointers can be used to look up the real value in the symbol tables, no information has been lost.

 

Data table.png

 

These pointers are no ordinary pointers. They are bit-stuffed indices, meaning – they only have as many bits that it takes to represent the field, never more. So if a field contains four distinct values, the index is only two bits long, because that is the number of bits it takes to represent four values. Hence, the data table becomes much smaller than it would have been otherwise.

 

The bit-stuffed pointers and the symbol tables are the reasons why the Qlik engine can compress data the way it can.

 

Understanding this will help you optimize your document. It’s obvious that the number of records and number of columns in a table will affect the amount of memory used, but there are also other factors:

 

  • The length of the symbols will affect the size of the symbol table.
  • The number of distinct values in a field will affect the number of rows in the symbol table as well as the length of the pointers.

 

When creating QlikView scripts, always ask yourself if there is any way to reduce these numbers, to minimize the memory usage. Here are a couple of common cases:

 

  • You have a long, concatenated, composite key that you don’t need to display. Use Autonumber() and the symbols will take no space in the symbol table. The integer values will instead be calculated implicitly.
  • You have a field with many unique timestamps. Then you are sometimes better off if you first split it into two fields – Date and Time – and round the Time downwards to closest 15-seconds interval or to nearest full minute, e.g.:
       Date(Floor(Timestamp)) as Date,
       Time(Floor(Frac(Timestamp),1/24/60)) as Time,
    These expressions will give you at most 24*60=1440 distinct time values (11 bits) and typically 365 distinct dates (9 bits). In other words, as soon as you have a timestamp field with more than 1 million (20 bits) distinct values, the pointer for the timestamp field takes more space than the pointers for the two individual fields. And for the number of rows in the symbol table(s) you hit the break-even much sooner. So you should consider splitting it into two fields sooner, maybe when you have around 100k distinct values.

 

If you found this post interesting, I greet you welcome to the QlikGeeks.

 

HIC

 

PS. All of the above is of course true for both QlikView and Qlik Sense. Both use the same engine.

 

If you want to read more about QlikView internals, see also

Logical Inference and Aggregations

Colors, States and State vectors

The Calculation Engine

 

¹ Geeks, see picture:

Geeks.png

71 Comments
Or
Valued Contributor II

Interesting.

I would also be interested in a follow-up column on how QlikView sets up the Possible / Excluded lists, and whether we can impact that in any way with our data model. I'm somewhat aware of the process (I read through the patent records when I wrote a paper on QlikView for a course a few years ago) but would be nice to see it explained by someone who really understands it.

7,132 Views

I'll take a look at that to see if I can produce a blog post on that too. Perhaps the title willl be "Selection States and State Spaces"...

HIC

7,132 Views
Partner
Partner

It is very interesting insight… and the state space would be good to understand as we use the MEMory Stats when analysing big QVW documents.

Hoping the competition products don’t learn too much from this detail!

Adam Barrie-Smith

Expert Services Manager

Phone: +27 (0)87 8205521

Mobile: +27 (0)82 8511026

Support: +27 (0)861 754589

QlikView South Africa

7,132 Views
MVP & Luminary
MVP & Luminary

Thanks for an interesting post Henric. If you get to blogging about State Spaces, I would like to hear about fields of sequential integers and the impact on symbol tables.

-Rob

7,132 Views
Not applicable

Oh no. I think I am a dork.

7,132 Views

No, Erica. You're much too social for that... http://bit.ly/9gNDXV

HIC

7,132 Views

Rob

I just sat down with Håkan, and learned that the internal optimization was more extensive than I knew: If you use Autonumber() which produces a sequential integer, its symbols in the symbol table will disappear altogether since the values can be implied directly from the bit-stuffed pointer.

I have changed the blog post accordingly.

HIC

7,132 Views
Not applicable

Ha!

Sent from my iPhone

0 Likes
7,132 Views
hariharasudan_p
Contributor

Henric.. i was searching for right person to ask this question.. you are the one..

My question is why QLIKIVIEW cannot handle more than 2 billion unique values in a single field.. even though direct data discovery came.. qliktech havent solved the issue.. tat means from your blog.. it is 2^32.. above this it fails... why is it so ??

0 Likes
7,132 Views

I don't think that QlikView has such a limit any longer. The 32-bit version used to have it, a couple of versions back, but I think it has been removed.

Let me investigate and come back to you.

HIC

0 Likes
7,132 Views
hariharasudan_p
Contributor

Yup. Waiting for your confirmation. Thanks Henric. Your blogs are great. Learning from you a lot ☺

0 Likes
7,132 Views
danielrozental
Honored Contributor II

The limit was removed in Qlikview 9, it's in the faq document for that version.

7,132 Views

The people I've spoken to here in Lund all say the same: This limit has been removed.

And thank you Daniel for filling in about the FAQ for QV9.

HIC

0 Likes
7,132 Views
MVP
MVP

Hi Henric, thank you for a lot of well written (informative and entertainig) blog posts.

I really enjoy your series.

Two requests from my side:

a) In the current 11.2 release notes, chapter 2.2,  I still read:

- If data with over 2,147,483,647 unique values is loaded, QV does not load correctly. Also, after 2,147,483,647, the value turns negative as the data is displayed in record number in a statistic box.

Is this related to the 32bit limit or do I misunderstand this point? (I must admit that I haven't tried to load this amount of unique values myself and I won't be able to do so for  the next days).

b) If you are planning to elaborate a more on this bit-stuffed-pointers stuff, I would like to read a bit more on

- if and how QV is compressing the data table (e.g. by RLE). If so, is the compression depending on sorting the fields by the user?

- how QV is handling several selection states (like set analysis expressions, alternate states)

Thanks again,

Stefan

0 Likes
7,132 Views
hariharasudan_p
Contributor

Yup. Topic is open again. Stefan came up with proof now.

Henric its your turn.. pls share your comments

0 Likes
7,132 Views

I have now checked this again and found that my previous answer was not entirely correct. Here is a summary of the changes we made to QlikVIew 9:

  • We used to have different limits for the 32 bit and the 64 bit versions. They now have the same limits. (Which for the 32 bit version is mostly of academic interest. You will hit the 2/3 GB ceiling before you hit any other limit.)
  • We used to have a limit on number of rows in the data tables (number of records in a loaded table). This limit was removed.
  • We used to have a limit on number of rows in the symbol tables (number of distinct values in a loaded field). This limit was raised to 2 Billion.

This means that QlikView cannot load more than 2 billion different values of a field "Amount". But with Direct Discovery QlikView can sum more than 2 billion values of the same field, since the summation takes place on the source DB.

In theory, we could change the code so that we remove this limit. But currently we do not think that this would be practical. If you have 2 billion unique values, the symbol table itself (for one single field) would most likely use more than 24 GB. (If each record has a four byte pointer and an eight byte data part.) And the data table and additional symbol tables would make the app use a lot more...

So, we do not think that the limit has any practical consequences.

hariharasudan_31 was right all along.

HIC

7,132 Views
hariharasudan_p
Contributor

Thanks a lot.. I am just a beginner to this Qlikview. Just 3.5 yrs exp..

You are god father of qlikview.. I am happy to hear that you are saying i was right all along.

Only my passion towards learning qlikview has brought to this extent.. anyways thanks HIC...

0 Likes
7,132 Views
Not applicable

Good Post Henric....

I wondering, Can I see the Symbol Table structure and it data fro Qlikview Desktop version?

Thanks,

Siri

0 Likes
7,132 Views

There really isn't very much to see...

A simple list box will display the values of a field, which is equivalent of the data part of a symbol table. The binary pointers cannot be displayed.

The table viewer (using Internal Table View) will display the structure of the data tables, and that is the structure viewer that is useful.

HIC

0 Likes
7,132 Views
phc
New Contributor III

Hi Henric,

Thank you so much for your thoughtful post.  I am not smart enough to be a geek so I guess that makes me a nerd with a collection of knowledge.  My current collection of knowledge is around the topic of how QlikView supports "big data". 

MY QUESTION is about the 2.1B limit and more specifically about the 24G symbol table mentioned in your comment above (the size of even the smallest symbol table given 2.1B unique values).  Is the 24G a limit of some kind?  Or is it simply a size which is considered "too big" for practical use. 

As we all know, history is replete with examples of today's "big" becoming tomorrow's small and uninteresting.  If I am following along properly, using the full 64 bits would give us room for 9.2 Quintillion unique values.

0 Likes
7,132 Views

The 24GB symbol table is just an estimate and an example that already 2 billion unique values would use a lot of memory. Hence, "too big" for practical use.

But more importantly - if you have such data amounts, you will most likely also have charts that need to aggregate over fact tables that are much larger than 2 billion records - and this is where you will have your real bottleneck: The CPU-power will be insufficient; you will get too long response times.

But you are still right: today's big data will not be so big tomorrow... Which means that we some day will need to address the symbol table with a 64-bit pointer instead of a 32-bit.

HIC

0 Likes
7,132 Views
phc
New Contributor III

Thanks so much!  It soulds like we are just waiting for the technology to catch up with us

0 Likes
7,132 Views
Employee
Employee

You heard it here first!   with a 64-bit pointer we will be able to store 18+ septillion pointer values!   Woohoo!!!   I like saying "Septillion".  

2^64 = 18,446,744,073,709,550,000

So that's a very large number and well beyond anything reasonable for human analysis.   Like Hic said there are *effective* limits that are reached today in the 1-2 BN row range that will likely stop extremely large in-memory apps in their tracks before the 32-bit pointer stops them.  

However, with Windows 2012 memory capacity reaching 4TB now the theories about extremely large apps in memory are still interesting. 

Here's the reality....with best practice application and data architecture you can effectively USE several billion rows in QlikView in a seemless manner that facilitates analysis by humans.   Using a combination of pre-aggregation, doc chaining, drill-to-detail, loop & reduce and some navigation actions you can not only address several billion rows of data, but you can do it quicker than you could if you jammed it all into one QVW.   Now your manager is happy AND your users are happy!   Win-win.   Bring it on!!!

Hic, thanks for the great post.

-brad

7,132 Views
Employee
Employee

Before I get flamed here....it's 18+Quintillion, not Septillion.   Still a very large number and still fun to say, but more acurate this way.   thanks PeggySue! 

0 Likes
7,132 Views
MVP & Luminary
MVP & Luminary

Henric, could you post an example how to create a sequential integer? I don't think that it symbols gets optimized in a QVD..

0 Likes
7,132 Views
Not applicable

Hi Henric, if the same field value exist in two fields, then this value have only one pointer or each field having one symbol table so 2 pointers for same value.

Ex: CutomerID : 100 , 200, 300

      OrderID : 100, 101, 105, 200

0 Likes
7,132 Views

@ Ralf Becher A sequential integer is created when you use Autonumber(). And I think you are right that these are not optimized in a QVD.

@ dathu.qv If a value exists in two fields, you will get different representations (the bit-stuffed pointer) of this value in the two fields. Remember that the representation is bit-stuffed, i.e. you will most likely get representations of different lengths in the two different fields.

HIC

0 Likes
7,132 Views
Not applicable

Henric Cronström - Thank you for this great post. I am little confused the way qv assigns pointer values.

1. if i have two fileds (custID, CustomerID) with same values, will the values of the pointer same for each field value (custID = 2405, CustomerID = 2405) ?

2. if i store a table into qvd, will my qvd maintain data table and symbol tables?

3. if No.2 is right, fact table key field (ex: CustomerID) and dimensiontable key field (CustomerID) will hold the similar pointers for each distinct value ?

Thank you in advance.

0 Likes
7,132 Views

1. Probably not. It depends on the Load order. But, since it is two different fields, it really doesn't matter.

2. Both the data table and the symbol table will be stored into the qvd.

3. If you load two different qvd:s with a common key, the bit-stuffed pointers of the key in the second file will be re-mapped during the load - so that the files match.

HIC

7,132 Views
anantmaxx
Valued Contributor

Today I have a blog post for the Geeks¹. For the hard-core techies that love bits and bytes. The rest of you can stop reading now. For you, there are other interesting posts in the Business Discovery Blog and in this blog, the QlikView Design blog.

when you wrote aboove lines I thought of Leaving at once BCZ I know I am no Geek !! but once I realised I am following The HIC In Qlikview Community who has the power to make Rocket Science a Kid Play

I copy your Signature that is why....

Anant

0 Likes
7,132 Views