Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
hic
Former Employee
Former Employee

 

Today I have a blog post for the Geeks¹. For the hard-core techies who love bits and bytes. The rest of you can stop reading now. For you, there are other interesting posts in the Business Discovery Blog and in this blog, the QlikView Design blog.

 

Now to the bit-stuffed pointers:

 

During the QlikView script run, after each load statement, the Qlik engine transforms the data loaded into two table types: one data table and several symbol tables. The engine creates one symbol table per field:

 

Symbol tables.png

 

The symbol tables contain one row per distinct value of the field. Each row contains a pointer and the value of the field, both the numeric value and the textual component. Basically, the symbol tables are look-up tables for the field values.

 

The data tables are the same tables as you can see in the QlikView internal table viewer (<CTRL>-T) when you have chosen the “Internal table view”  – the same number of rows, the same number of columns. However, the tables do not contain the data itself – they contain the pointers only. But since the pointers can be used to look up the real value in the symbol tables, no information has been lost.

 

Data table.png

 

These pointers are no ordinary pointers. They are bit-stuffed indices, meaning – they only have as many bits that it takes to represent the field, never more. So if a field contains four distinct values, the index is only two bits long, because that is the number of bits it takes to represent four values. Hence, the data table becomes much smaller than it would have been otherwise.

 

The bit-stuffed pointers and the symbol tables are the reasons why the Qlik engine can compress data the way it can.

 

Understanding this will help you optimize your document. It’s obvious that the number of records and number of columns in a table will affect the amount of memory used, but there are also other factors:

 

  • The length of the symbols will affect the size of the symbol table.
  • The number of distinct values in a field will affect the number of rows in the symbol table as well as the length of the pointers.

 

When creating QlikView scripts, always ask yourself if there is any way to reduce these numbers, to minimize the memory usage. Here are a couple of common cases:

 

  • You have a long, concatenated, composite key that you don’t need to display. Use Autonumber() and the symbols will take no space in the symbol table. The integer values will instead be calculated implicitly.
  • You have a field with many unique timestamps. Then you are sometimes better off if you first split it into two fields – Date and Time – and round the Time downwards to closest 15-seconds interval or to nearest full minute, e.g.:
       Date(Floor(Timestamp)) as Date,
       Time(Floor(Frac(Timestamp),1/24/60)) as Time,
    These expressions will give you at most 24*60=1440 distinct time values (11 bits) and typically 365 distinct dates (9 bits). In other words, as soon as you have a timestamp field with more than 1 million (20 bits) distinct values, the pointer for the timestamp field takes more space than the pointers for the two individual fields. And for the number of rows in the symbol table(s) you hit the break-even much sooner. So you should consider splitting it into two fields sooner, maybe when you have around 100k distinct values.

 

If you found this post interesting, I greet you welcome to the QlikGeeks.

 

HIC

 

PS. All of the above is of course true for both QlikView and Qlik Sense. Both use the same engine.

 

If you want to read more about QlikView internals, see also

Logical Inference and Aggregations

Colors, States and State vectors

The Calculation Engine

 

¹ Geeks, see picture:

Geeks.png

71 Comments
anderseriksson
Partner - Specialist
Partner - Specialist

Steven, in your second part you are making a fundamental error in thinking the bit stuffed pointer needs to be able to hold all possible values!

A date (without year) could at most have 366 (February 29th included) distinct values but the symbol table only contains those values that actually exist in your data modell and the pointer field only needs to accommodate those values.
Say the data modell only contains the 1st of each month, then you only need twelve pointers, not 366.

Same thing with the time of day field, there are many possible values but the symbol table only contains the used values and the pointer only need to accommodate those.

Splitting fields can save space in the symbol tables but does not reduce the space needed for pointers. Timestamps being a good example, names another splitt in first name and last name.
Example, the following timestamps are nine distinct values needing 4 bits pointer value and 9 rows symbol table;

1/1/2016 06:00
1/1/2016 09:00

1/1/2016 12:00

1/2/2016 06:00
1/2/2016 09:00

1/2/2016 12:00

1/3/2016 06:00
1/3/2016 09:00

1/3/2016 12:00

Splitting into date and time yields two fields each needing 2 bits pointer value and 3 rows symbol table;

1/1/2016
1/2/2016

1/3/2016

and

06:00
09:00

12:00

Together they need 4 bits pointer values thus not saving any pointer space.

But the symbol tables have gone from one symbol table of 9 rows to two symbol tables of 3 rows each.

A saving of 3 rows and also each of those rows consumes less space for the values than the rows in the original symbol table.

I'd say the space needed for the values is about a third compared to the original symbol table.

Scale this up to millions of rows and the savings can become substantial.
But the savings depend on the frequency of the splitt parts.
You gain the most if every value in the first part is combined with every value in the second part.
If there is a one to one relationship between the two parts you will probably gain nothing.
More likely you will then loose space and get a more complicated data modell but it also depends on how you are going to use those values.

926 Views
swuehl
MVP
MVP

Steven,

without going into details, thank you for your comments, it's great to question what the best data model would be and to bring in new thoughts.

I don't think that Henric will disagree with most of your points, note that he has written "..then you are sometimes better off..." and was talking about a break-even, which implies that the one solution is not always superior to the other.

I guess you disregarded these statements for the benefit of a cleaner opposition .

I think his two main statements are written in bold:

  • The length of the symbols will affect the size of the symbol table.
  • The number of distinct values in a field will affect the number of rows in the symbol table as well as the length of the pointers.

Regards,

Stefan

926 Views
JonasValleskog
Partner - Creator
Partner - Creator

Tangential note on optimization:

When speaking of optimization it is worthwhile clarifying what the focus area is of what we are trying to optimize. From what I gather, you are concerned exclusively with the memory consumption side of how QlikView stores data and by optimization you mean, techniques to reduce the memory footprint. What is more relevant for most users is speed (CPU cost) of query execution. Note that the cost of processing data is not linearly proportionate to the memory footprint of the data. The length of the symbol tables, the volume of correlated records in the data tables and the cardinality of fields participating in computations I would argue is more closely correlated with query performance.

Anecdotal evidence of performance improvement of shorter symbol tables, I recently split a time stamp into a date and a time field on a fact table of about 150M records. The date and time was reference data required for presentation in a transaction level tabular output (straight table used). By splitting the time stamp and displaying the date and time separately, the table took aprox 2 seconds to recalculate versus 8 seconds plus previously.

0 Likes
926 Views
hic
Former Employee
Former Employee

It is reassuring to see so many intelligent questions and so many good answers...

My additional comments:

"splitting apart a single date/time field [...] into its constituent date and time parts [...] does not save memory."

It does save memory: Not in the bit-stuffed index, but in the length of the symbol tables.

  • One symbol table: 366 days * 86400 seconds = 32 M records.
  • Two symbol tables: 366 days in one table, 86400 seconds in the second = 86766 records

(In real life the effect may be marginal, though.) But the main point is the one that Jonas points out: it is not just the memory footprint. It is also the CPU time. An example: If a field is shown sorted in a list box, an index is created to speed up sorting. This indexing will take a lot longer time for a full timestamp than it does if you have split it in two.

"What about a field that will be split but that field is shared across tables?"

No, you shouldn't split keys. And in most cases you should not use timestamps as keys.

HIC

0 Likes
926 Views
Anonymous
Not applicable

Hi Anders,

"symbol table only contains those values that actually exist in your data model" that is correct and my example shows that. my example data model has the example fields existing only on one table. and those fields contain all days for a 365 year and all 1440 minutes in  day. therefore the symbol tables will only be storing those values.

You and Henric make the good point that memory is saved in the symbol tables and I should have been clearer in my questioning that I was talking purely about the length of a row in a table made up of those pointers (i.e. Memory Statistics\DATABASE\TABLES\RECORDS)

I appreciate you (and everyone else) taking the time to respond

Steven

0 Likes
926 Views
Anonymous
Not applicable

Johan, I have been using QVW Desktop's Memory Statistics feature (available via DESKTOP\SETTINGS\DOCUMENT\GENERAL\MEMORY STATISTICS) either directly or via Rob Wunderlich's DocumentAnalyzer tool to measure the RAM used by the data model.

The problem with using QVW file size as a means of measure is

A) the compression can be altered (off/medium/high

B) you don't know if the compression algorithm it uses works identically to the way compression works in-memory.

have a look at that feature it is very interesting

0 Likes
926 Views
Anonymous
Not applicable

Yes Jonas I should have clarified which optimization I was aiming for 🙂

I had never even considered the impacts to the UI as I was so focused on RAM usage at a single moment in time for the data model and also as RAM is moved around as tables are processed in LOAD scripts

Based on yours and Henric's feedback I will edit original REPLY so that no one assumes anything about my area of focus and optimizations

0 Likes
1,010 Views
Anonymous
Not applicable

Thanks Swuehl, I was focused on the memory usage of the tables we see that are made up of pointers back to those symbol tables (measurable via DESKTOP\SETTINGS\DOCUMENT\GENERA\MEMORY STATISTICS and then the following rows in the output file Database \ Tables \Records)

I should have made that clearer

0 Likes
1,010 Views
Anonymous
Not applicable

Thanks for the REPLY Henric, nothing like a posting with the word "binary" in it to get us geeks excited.

I will now add the qualifier to my brain and future presentation that the RAM in Symbol Tables may be reduced (just not in the DATABASE\TABLES\RECORD usage in Memory Statistics feature)

Thanks for the interesting posts. Keep them coming

Steven

0 Likes
1,010 Views
marcus_sommer

Here: Symbol Tables and Bit-Stuffed Pointers - a deeper look behind the scenes you could find some extensions to the origin posting.

- Marcus

0 Likes
1,010 Views