ByteSizes from Values in QlikView

    ByteSizes from Values in QlikView

     

    I think you know that QlikView had a very effective way to handle and store data. The general approach is described here: Symbol Tables and Bit-Stuffed Pointers and here: Symbol Tables and Bit-Stuffed Pointers - a deeper look behind the scenes which is beside the special logic how data are associated to each other is the main-reason why QlikView is so fast calculating even huge amounts of data.

     

    But even with this knowledge it's not easy to comprehend how much RAM an application consumed and which parts are consuming a reasonable amount of RAM and which parts might need an optimization. Even more complicated is the attempt to predict which changes within the datamodel would have which effect on the RAM consumption.

     

    The calculating of the RAM consumption from the pointers is relative easy and correlated directly with the number of distinct values within one field: The Importance Of Being Distinct.

     

    The calculating from the values of the symbol-tables is much more difficult then it's depending how the data type from a value is interpreted. Compared with the data-handling from many other programs and databases there is no data type within QlikView: Data Types in QlikView but from a RAM point of view this is too much simplified.

     

    For example, a date, created with date(DateNumField), will need 8 Bytes but only if there are any gaps of dates within the field - are the fieldvalues continuously without a gap the RAM consumption is 0 Bytes for this field. This meant the fieldvalue won't be stored directly else it will be stored with further meta-data to the symbol-tables (I could imagine that the $$SysTables contain these information).

     

    I assume that for this a start- and an end-value or an appropriate offset to the pointer (which then worked like an autonumber-field) will be stored together with the format-information and will be later calculated on the fly.

     

    But there are further complications. If the date-field is automatically detected from the script through the default-variables on the script-start the date-field will be handled like a dual-field and will by dual(date(Date, 'DD.MM.YYYY'), DateNumField) consume 20 Bytes per value. It will be even more complicated if the field contained several different formatting's and/or is it mixed with some string-values (one wrong value because of a poor data-quality might be enough to change the interpretation).

     

    I think this makes it clear how complicated it is to detect how the data are interpreted and stored and even more how it could be controlled and where is the best place to transform the data. By small datasets you don't need to worry much about but already by mid-sizes datasets it might be noticeable - depending then from the kind of datamodel.

    By a (ideally) star-scheme with relative less dimension-values compared to the fact-table most of the formatted values will be within the dimension-tables and it might not make much difference to the whole datamodel. Here you will find a similar example: Is it usefull to format data in a Load script for a QVD?

     

    But are there any issues with the performance of an application (and you could exclude most of the common one like unneeded fields, large link-tables …) it might be worth to take the efforts to optimize the application by the needed byte size of their values.

     

    Looking on this (old) posting: Memory sizes for data types I decided to develop a small tool to make it handier to detect how much RAM a field needs - in total and as average per value.

     

    The application is quite simple and contained a small generic load with some rand- and mod-functions to get a few different results within the fieldvalues and also a choosable number of records. The fields itself are a variety of different period-, number- and string-values to demonstrate that the same respectively similar values could be handled quite differently.

     

    You could adjust these fields to be equally or near your real-data or replace this table with one of yours. To make it more practically there is a small macro-routine included which reload these application without mem-file, stored the mem-file and reloaded then again with the mem-file (you might want to adjust these logic or do it manually if you are dealing with larger tables). To see what is meant with mem-file take a look here: Recipe for a Memory Statistics analysis and Obtaining Memory Statistics for a QlikView Document.

     

    Here the used script:

     

    /******************************** [ creating various data types ] *********************************

    - creating a variety of random period-, numeric- and string-values per autogenerate-loop
    - different fieldvalues will be created through the use of rand(), mod() and recno()

    **************************************************************************************************/


    TableForFieldValueSizes:
    Load
       
    recno() as RecNo,
       
    autonumber(date(today() + recno())) as Autonumber,
       
    date(today() + recno(), 'DD.MM.YYYY') as DateContinuous,
       
    date(today() + mod(recno(), 2), 'DD.MM.YYYY') as DateWithGaps,
        '01.01.2016'
    as DateAutomaticInterpreted,
       
    month(today() + recno()) as Month,
       
    monthname(today() + recno()) as MonthName,
       
    year(today() + recno()) as YearSingleValue,
       
    year(today()) + mod(recno(),2) as YearMultipleValue,
       
    time(frac(now()) + recno() / rand() * 86400) as Time,
       
    timestamp(now() + recno()) as TimestampSingle,
       
    dual(timestamp(now() + recno()), now()  + recno()) as TimestampDual,
       
    timestamp(now() + mod(recno(),2) + rand(),
        'WWWW DD. MMMM YYYY hh:mm:ss')
    as TimestampLongSingle,
       
    dual(timestamp(now() + mod(recno(),2) + rand(), 'WWWW DD. MMMM YYYY hh:mm:ss'),
       
    now() + mod(recno(),2) + rand()) as TimestampLongDual,
       
    chr(recno() + 65) as LetterAscii,
       
    chr(recno() + 130) as LetterAsciiExtended,
       
    floor(recno()) as IntegerFloor,
       
    ceil(rand() * 100) as Integer,
       
    text(chr(recno() + 48)) as NumberAsString,
       
    if(mod(recno(), 2) = 0, text(chr(recno() + 48)), chr(recno() + 48)) as NumberAndString,
       
    dual(ceil(rand() * 100), ceil(rand() * 100)) as DualInteger,
       
    ceil(rand() * 1000000000000) as BigInteger,
       
    rand() * 100000 as SmallFloat,
       
    ceil(rand() * 1000000000000) + rand() * 100000 as BigFloat
    autogenerate $(i); // variable i could be controlled within the gui per inputbox
         

    /******************************** [ loading the mem-statistics ] **********************************

    - creating a variety of random period-, numeric- and string-values per autogenerate-loop
    - different fieldvalues will be created through the use of rand(), mod() and recno()

    **************************************************************************************************/


    if '$(vLoadMode)' = 'including Mem-File' then
        MemStatistics:
       
    LOAD

           Class,
          
    Type,
          
    SubType,
          
    Id,
          
    Bytes,
          
    Count,
          
    num(num#(Size, '#.##0', '.', ','), '#.##0,00', ',', '.') as Size,
          
    CalcTime,
         
    AvgCalcTime
       FROM [ByteSizesFromValues.mem]
       (
    txt, codepage is 1252, embedded labels, delimiter is '\t');
    end if

    /*************************************************************************************************/

     

    Here the used macro-routine:

     

    option explicit

    sub StoreMemFile

    dim doc, path
    set doc = ActiveDocument

    path = doc.evaluate("=subfield(DocumentPath(), '.', 1)") & ".mem"

    doc.Variables("vLoadMode").SetContent "excluding Mem-File", true 'set the load-variable to exclude
    doc.Reload 'reload the application without the mem-file
    doc.SaveMemoryStatistics path 'store the mem-file
    doc.Variables("vLoadMode").SetContent "including Mem-File", true ' set the load-variable to include
    doc.Reload 'reload again and this time with mem-file

    end sub

     

    And this is the result of the application:

     

     

    with a short overview about the commonly data-types and their formattings:

     

    Type

    Size

    Example

    Remarks

    Number as String

    11 Bytes + 1 byte for each char

    '7' = 11 Byte and '2222' = 14 Byte

     

    AlphaNumeric

    7 Bytes + 1 byte for each char

    'a' = 7 Byte and 'abc' = 9 Byte

     

    AlphaNumeric Extended

    8 Bytes + 1 byte for each char

    'ò' = 8 Byte and 'òõ' = 9 Byte

    1 Byte more for chars with a larger index as 127, for example: chr(130)

    Number

    8 Bytes

     

    integer/float, positive/negative or size are not important but only if the whole field is numeric

    Mixed Number + String

    combination of the used values

    'a' = 7 Byte  + 1 = 8 Byte = Ø 7,5 Byte

     

    Autonumber

    0 Bytes

     

    fieldvalue = pointer

    consecutive numbers

    0 Bytes

    recno() will return a consecutive number

    will be handled like an autonumber - probably with an offset to the field-value which then could be used as pointer

    Date (consecutive + without dual)

    0 Bytes

     

     

    Date (non consecutive + without dual)

    8 Bytes

     

    will be handled numeric with a separate storing of the format-information

    Date (automatically detected)

    20 Bytes

    'DD.MM.YYYY'

     

    Date (dual + formatted)

    20 Bytes

    'DD.MM.YYYY'

     

    Month (dual)

    13 Bytes

    =month(today())

     

    MonthName (dual)

    18 Bytes

    =monthname(today())

     

    Year

    8 Bytes

    =year(today())

     

    Time (not dual)

    8 Bytes

     

     

    Time (dual)

    22 Bytes

    hh:mm:ss'

     

    Timestamp (dual)

    33 Bytes

    'DD.MM.YYYY hh:mm:ss'

     

    Timestamp (dual)

    47 Bytes

    'WWWW DD. MMMM YYYY hh:mm:ss'

    the number of chars which the formatting returned determined the byte size

     

     

    I hope playing with this tool will help you to understand how QlikView handled the data from the symbol-tables and you could adapt some points to improve the performance from your applications and don't hesitate with feedback and ideas to improve the tool.

     

    Happy Qliking!

    Marcus Sommer