Symbol Tables and Bit-Stuffed Pointers - a deeper look behind the scenes

    Symbol Tables and Bit-Stuffed Pointers - a deeper look behind the scenes

     

    The intention of this postings is to extend the insights from this posting: Symbol Tables and Bit-Stuffed Pointers (and you should read there the comments, too). Unfortunately hides QlikView the symbol- and pointer-tables and therefore it's difficult for the most users to get a deeper understanding of the qlik datamodel. A bit easier sources to this topic could be found here: QlikView Data Storage and here: The data scalability of Qlik's in-memory associative data model.

     

    The overview of the datamodel within the table-viewer is very useful but it showed only an abstracted view of the datamodel and not how the data are really handled within QlikView (I think there is room for some improvements from the qlik side to avoid workarounds like the following).

     

    Therefore I decided to develop an application to create this view. To illustrate this I have chosen a very simple datamodel from the inbuilt sample load-script from QlikView - you could get it within the script-editor by menu --> insert --> Test-Script - which is a generic script and returned chars and numbers from the ascii-table as dimensions and a fact-table with some rand-functions. This meant each run of this script will return a slightly different results within the field-values but this isn't very important for demonstrating the logic (and there is an optional loop included which could read the results from qvd's).

     

    In my example I have separated the normal datamodel from the symbol- and pointer-tables which meant that this new view is created as an additionally datamodel within the origin application. There is no connection between them, of course they could be connected whereby I'm not sure if this would have much benefits.

     

    Although my intention was to explore this topic a bit deeper than the initial-posting from HIC it is like his a simplification of the truth. There are some details of the internal data-handling which isn't documented yet and some of the used mechanism couldn't be easily emulated.

     

    To mention is further that this was developed with QlikView 11. The new engine QIX from QlikView 12 and Qlik Sense is a bit different (there was some optimizations) whereby the main-logic with the symbol- and pointer-tables remained. I hope Henric Cronström could give further insights and an update for QlikView 12 and Qlik Sense .

     

    Nevertheless I think this posting will extend the understanding how qlik works under the hood and will make your future datamodels better.

     

    Here is the used script:

     

    /******************************** [ qlikview test-script] ****************************************

    - could be found here within the script-editor menu in tab insert
    - is a simple generic script which generates 3 tables around the alphabet combined with some
      random results
    - the if-loop to choose the load-mode and the store-statement here are additionally in the case
      you need exactly the same data/tables in this application or between several applications

    ***************************************************************************************************/


    let vLoadMode = 'old data'; // the alternative value would be 'new data'

    if '$(vLoadMode' = 'new data' then
        Characters:
       
    Load Chr(RecNo()+Ord('A')-1) as Alpha, RecNo() as Num autogenerate 26;

       
    store Characters into Characters.qvd (qvd);

        ASCII:
       
    Load
         
    if(RecNo()>=65 and RecNo()<=90,RecNo()-64) as Num,
         
    Chr(RecNo()) as AsciiAlpha,
         
    RecNo() as AsciiNum
       
    autogenerate 255
       
    Where (RecNo()>=32 and RecNo()<=126) or RecNo()>=160 ;

       
    store ASCII into ASCII.qvd (qvd);

        Transactions:
       
    Load
          
    TransLineID,
          
    TransID,
          
    mod(TransID,26)+1 as Num,
          
    Pick(Ceil(3*Rand1),'A','B','C') as Dim1,
          
    Pick(Ceil(6*Rand1),'a','b','c','d','e','f') as Dim2,
          
    Pick(Ceil(3*Rand()),'X','Y','Z') as Dim3,
          
    Round(1000*Rand()*Rand()*Rand1) as Expression1,
          
    Round(10*Rand()*Rand()*Rand1) as Expression2,
          
    Round(Rand()*Rand1,0.00001) as Expression3;
       
    Load
          
    Rand() as Rand1,
          
    IterNo() as TransLineID,
          
    RecNo() as TransID
       
    Autogenerate 1000
       
    While Rand()<=0.5 or IterNo()=1;

       
    store Transactions into Transactions.qvd (qvd);
    else
       
    Load * From Characters.qvd (qvd);
       
    Load * From ASCII.qvd (qvd);
       
    Load * From Transactions.qvd (qvd);
    end if

    Comment Field Dim1 With "This is a field comment";

    /******************************** [ creating of the symbol tables] *******************************

    - the general logic is by using the qlikview table- and field-function to loop through all tables
      and there through all fields and there through all field-values
    - all distinct fieldvalues and their pointer will be loaded into symbol-tables, each with a unique
      fieldname to avoid synthetic keys between them and to keep the symbol- and data-tables separated
      from the original datamodel
    - additionally to these approach are here two further fields generated - FieldValueSize +
      FieldValueFormat - with the purpose to calculate with them the bytes per  row and per field
    - the way of implementing those is quite simplified - the checking of autonumber-fields
      respectively similar stored integer-values (which meant fieldvalue = pointer) is done directly
      within the load and would without further checks lead to errors (quite probably would this need
      a separate load-logic with some loops and joined to these tables)
    - also dual-fields especially the various date-fields aren't included in these logic, the same with
      NULL and very probably there are further details respectively exceptions which aren't covered here
    - unfortunately are the results a bit different to the results from the mem-statistics (it means
      there are further adjustments to these logic needed) but within our very simple example-case we
      are quite near to the truth - some more informations about the fieldformats are covered here:

      ByteSizes from Values in QlikView


    ***************************************************************************************************/


    for i = 0 to nooftables() // looping through all tables
          let vTableName = tablename($(i)); // assigning the current table to a variable
          /* optional filter for reading only certain tables could be applied, maybe:
          if match('$(vTableName)', 'YourTable1', 'YourTable2') then ... */

      
    for ii = 1 to nooffields('$(vTableName)') // loop through all fields of the current table
                // assigning the current fieldname and the number of fieldvalues to a variable
                let vFieldName = fieldname($(ii), '$(vTableName)');
        
    let vFieldValueCount = fieldvaluecount('$(vFieldName)');
         
    // concat all fieldnames of the current table to a string
                let vFieldList = '$(vFieldList)' & if(len('$(vFieldList)') = 0, '', ',') & '$(vFieldName)';
         
    // check if the current field already included within the vFieldList
                let vFieldCount = substringcount('$(vFieldList)', '$(vFieldName)');
        
    let vSpecialField = wildmatch('$(vFieldName)',
                    'Fieldvalue@*', 'Pointer@*', 'FieldValueSize@*', 'FieldValueFormat@*');

         
    /* checking if the current field isn't an already loaded key-field and isn't a
             Fieldvalue/Pointer/Size/Format field */

         
    if $(vFieldCount) = 1 and $(vSpecialField) = 0 then
             
    noconcatenate [Symbol Table $(vFieldName)]: // creating the tablename
                           /* load with autogenerate $(vFieldValueCount) meant a loop through all
                 (distinct) fieldvalues of the current field */

             
    load
                
    dual(fieldvalue('$(vFieldName)', recno()), //loading fieldvalues as dual-values
                                  fieldvalue('$(vFieldName)', recno())) as [FieldValue@$(vFieldName)],
                
    // loading the pointer-value to the current fieldvalue=key to the data-tables
                                 num(recno(), '(bin)') as [Pointer@$(vFieldName)],
                
    // optionally loading the size and format from the current fieldvalue
                                if(fieldvalue('$(vFieldName)', recno()) = recno(), 0, //detect autonumber-fields
                                     if(isnum(fieldvalue('$(vFieldName)', recno())), 8,
                        7 +
    len(fieldvalue('$(vFieldName)', recno())) - 1)) as
                       
    [FieldValueSize@$(vFieldName)],
               
    if(isnum(fieldvalue('$(vFieldName)', recno())), dual('numeric', 1),
                    
    if(isnum(num#(fieldvalue('$(vFieldName)', recno()), '#')),
                       
    dual('number as text', 2),
                       
    dual('string', 3))) as [FieldValueFormat@$(vFieldName)]   
              
    autogenerate $(vFieldValueCount);

               store [Symbol Table $(vFieldName)] into [QVD Output\Symbol Table $(vFieldName)].qvd (qvd);
           
    else
              
    // removing the last fieldname from the fieldlist
                             let vFieldList = mid('$(vFieldList)', 1, index('$(vFieldList)', ',', -1) -1);
           
    end if
      
    next
    next

    let vFieldList = null(); let vFieldCount = null();
    let vFieldValueCount = null(); let vSpecialField = null(); // removing the variables


    /******************************** [ creating of the data tables ] ********************************

    - the general logic is by using the qlikview table- and field-function to loop through all tables
      and there through all fields to generate a list of fields
    - this list of fields will be used to create a generic load-statement which load the original
      data-tables within a resident-load
    - whereby the fieldvalues will be used to read their position within the fieldvalue-list
      (respectively the - new created - symbol-tables) to get the  pointer-value of them

    ***************************************************************************************************/


    for i = 0 to nooftables() // looping through all tables
          let vTableName = tablename($(i)); // assigning the current table to a variable
          //checking if the current table isn't a artificial symbol-table
          if wildmatch('$(vTableName)', 'Symbol Table*', 'Data Table*') = 0 then
          
    for ii = 1 to nooffields('$(vTableName)') // loop through all fields of the current table
                         // assigning the current fieldname to a variable
                         let vFieldName = fieldname($(ii), '$(vTableName)');
             
    /* first step of the generic load-statement - get the pointer-value through the
                 position of the fieldvalue within the distinct fieldvaluelist */

             
    set vFieldExpression = "num(fieldvalue('Pointer@$(vFieldName)',
                 fieldindex('$(vFieldName)', [$(vFieldName)]))) as [Pointer@$(vFieldName)]"
    ;
              
    /* second step of the generic load-statement - concat all the pointer-expressions from
                 above into a single statement */

              
    let vLoadStatement = '$(vLoadStatement)' & '$(vFieldExpression)' &
                     
    if($(ii) = nooffields('$(vTableName)'), '', ', ');
          
    next
            [Data Table $(vTableName)]:
    // loading the pointer-values within the data-tables
                    Load $(vLoadStatement) Resident [$(vTableName)];
            
    store [Data Table $(vTableName)] into [QVD Output\Data Table $(vTableName)].qvd (qvd);
           
    let vLoadStatement = null(); let vFieldExpression = null(); // removing the variables
          end if
    next

    let i = null(); let ii = null();
    let vTableName = null(); let vFieldName = null(); // removing the variables

    /*************************************************************************************************/

     

    And you could see the script is independent from the used application and you need only some adjustments if you used some of my KEY words (FieldValue@, Pointer@, FieldValueFormat@ and FieldValueSize@) within the table- and fieldnames within your normal datamodel. I assume this case would be rather seldom. But applying some filters to restrict the logic only for a certain part of your datamodel could be handy more often. In this case you could add some if-loops or pick(match()) conditions to pick only the wanted tables and fields - within the script for the symbol tables is an example of them included.

     

    What does the script do / How worked the script?

     

    The main-logic is quite simple. With the inbuilt table- and field-functions will be looped through all tables and fields. The number of fields and fieldvalues will be counted, a list of fieldnames created and everything of them is assigned to variables.

     

    To emulate the symbol-tables a loop (autogenerate in combination with recno) runs through the unique values of each symbol-table and for the data-tables a lookup of the current fieldvalue within the new created symbol-tables returned whose position respectively pointer-value.

     

    A more detailed description of what is happing could you find directly within the script.

     

    And this is how the datamodel looked like after the load:

     

     

    But this is only the beginning. Now you could use these tables to create your own objects and insights. Here are some examples from my gui-objects which I had grouped within containers into data-tables, symbol-tables, normal tables and some charts:

     

    This meant beside the normal table-view you could look how the data are stored within the symbol-tables and how the data-tables look like with all the pointer-fields.

     

    But this is only the first step. Now you want surely see which effects certain data have on your datamodel, for example how many RAM will be needed to store them. Until now the only way to get any information about the RAM consumption of the data are the mem-files:

     

    Recipe for a Memory Statistics analysis

    Obtaining Memory Statistics for a QlikView Document

     

    However the handling with the mem-files is not very handy and the workflow will be quite often broken to get the updated mem-files again and you will get only a consolidated view on a table- and field-level and not detailed on each fieldvalue. Therefore I wanted to add some extra insights by calculating the RAM consumption within these application.

     

    To calculate the sizes from the pointer is relative easy and could be done with the following expression:

     

    floor(log10(fieldvaluecount([$Field]))/log10(2))+1

     

    which could be aggregated with the aggr-function to get the size per row and from this of course the bytes per row and table which are then equally to the measures within the mem-files (there is another method by evaluating the field-lengths which could be used, too - you could find it within the chart "Data Tables").

     

    However calculating the needed bytes per each single value from the symbol-tables is more complicated. For this you will need to know for each value is it a numeric- or a string-value whereby numeric values within a mixed field (numeric + strings) will be handled differently which meant you need to consider all values from a field.

    Further important is by string-values the lengths of them and from which area of the ascii-table the char comes, for example chars with a higher index as 127 need an extra byte (related to them there are probably some more things to consider if other more extended charsets in use).

     

    Beside them you need to consider if the values are dual-values which meant the value had a numeric- and a string-value (for example many of the classical period-values like month(DateField) are dual-values). Another point are autonumber-fields (directly created respectively automatically detected consecutive fields) which need no extra bytes within the symbol-tables because their value is identically with their pointer-value.

     

    Surely some additionally information could be collected from the xml meta-data of a qvw but not so detailed like they are internally interpreted and handled and it will make the things more complicated and would like the mem-files break the workflow.

     

    More details to fieldvalues and their formatting and the resulting byte-sizes could you find here: Memory sizes for data types and here: ByteSizes from Values in QlikView.

     

    Nevertheless I have a very simplified version of them here implemented and for this the routine to create the symbol-tables extended and in this simple case we are quite near to the measures from the mem-files. After that looked the table-viewer:

     

    And here the belonging charts:

     

     

     

    For an easy comparing of these charts is the mem-file and an additionally small application which evaluated the data is attached for this reason.

     

    And this led us straight to the question how to optimize the RAM consumption. Although there is no general approach which could be applied to each kind of datamodel - there are some methods which are usually successful by optimizing an application whereby RAM consumption is an important part but not the only one.

     

    Quite often are the gui-calculations the biggest bottleneck by an application and very likely you will notice that a merging of tables, creating of some flag-fields and pre-calculating some measures speed up the gui-performance significantly by increasing the RAM consumption. Therefore only focussing on RAM isn't enough.

     

    Let's start with some tips to optimize an application. At first keep an eye on the data which you load. Do you really need all years/categories … and everything on an atomic level (you should really ask these questions) and with all those fields - and for the last mentioned topic there is a very fine tool Document Analyzer V3.0 Update available.

     

    The next point is to look to the cardinality of your fields than this had a big effect to the RAM consumption, see: The Importance Of Being Distinct  and Reducing the Size of a Data Model.

     

    A further consideration should go to the data-types respectively the applied formatting from your data: Is it usefull to format data in a Load script for a QVD? and When less data means more RAM.

     

    Furthermore unexpected effects could have the sensible use of autonumber-fields: String Key vs Integer Key -- any difference in link performance? and the way a fact-table is designed: Data Model Question - Fat or Thin Fact table? and here is a further collection of tips by handling larger datasets: Few tips for dealing with large QlikView applications.

     

    And there is another important part - the data itself. Are they really like they should be respectively like do you expect them? Of course you could simply assume that they are ok. but better would be to check them. A very simple version of a data-profiling approach is included within the second sheet of the example application - which is a hidden standard-sheet in each of my applications. It cost not much and saved really time but for more comprehensive tasks with this topic you need some bigger tools like the following:

     

    QlikView Data Profiler

    QVFieldObserver

    Advanced Data Discovery Module

     

    Conclusions:

     

    To analyse a datamodel on this way isn't the bread-and-butter business of a developer - the efforts for doing this will be too expensive and by larger datasets will be the run-times quite significant and by complex datamodels it will be difficult to keep an overview (whereby no one said you must include all records and/or apply it for all tables - it could be split to a particular part of the datamodel).

     

    Quite often it will be enough to be more pragmatic and building simply a datamodel (assigning the right fields and avoiding elementary failures like synthetic keys and circular loops) and to validate the main KPI's. But are there any issues with the performance or the data-quality - you know now, you could do more with your datamodels and the data within them.

     

    I hope you give me many feedback and maybe some improvements, especially by the calculation of the symbol-tables and I wish to start a very lively discussion.

     

    Update: Added are now store-statements which store the symbol- and data-tables into a directory below the application to enable the possibility to load the qvd's into an independent qvw respetively datamodel which are also added as attachment.

     

    Happy Qliking!

    Marcus Sommer