Reducing the Size of a Data Model

    One thing we often need to consider is the size of the QlikView data model in memory. In this post, I’ve included a few ideas for reducing the size. Consider the following simple script:

     

    LOAD
      [Student Id],
      [Unique Identifier],
     
    [Email Address]
    FROM ...


    This example script loads approx 199k records identified by a [Unique Identifier]; there are approx. 169k [Email Address]es and 175k [Student Id] numbers. As we can see these fields take up about 13MB in memory:

     

     

    Class

    Type

    Id

    Count

    Size

    Bytes

     

     

     

     

     

             13,371,456

    Database

    Field

    Unique Identifier

    199218

    26

    5,179,381

    Database

    Field

    Email Address

    169780

    29.72

    5,046,275

    Database

    Field

    Student Id

    174774

    18

    3,145,800

     

    [Unique Identifier] is a 12 character primary key and is made up of the [Student Id] concatenated with 2 sequence numbers e.g. ‘1234567’8 & ‘01’ & ’02’ to give’ 123456780102’. [Student Id] should always be 8 characters and the [Email Address] is of the standard form <user name> @ <domain name> and is variable in length.

     

    Unique Identifier

    This is a key field used solely to associated tables in the data mode. It is never presented to the user so there is no need for it to consume so much space in the data model as long as we maintain its uniqueness. This can be done by applying one of the AutoNumber functions that QlikView provides. In this case we can use AutoNumber Hash128:

         AutoNumberHash128([Unique Identifier]) AS [Unique Identifier],

     

    Student ID

     

    This is not a key field and is not unique, but is presented to the user. Therefore we need to maintain the original value of it, so we can’t use one of the AutoNumber functions to save space. The id consists of 8 digits, which means that we need (in theory) to be able to store a number up to 99,999,999. However, we can take advantage of the fact that QlikView stores each occurrence of a value only once and internally uses pointers to link records to particular values. We do this by splitting the [Student Id] into two 4 digit components ID_1 and ID_2, which now each only have to store a maximum value of 9,999 as well as rationalising the space used. For example, consider two id numbers: 01024567 and 01034567. If we stored [Student ID] in its original form we would need to store two fields capable of holding the two 8 digit id numbers. If we split the id into two parts, ID_1 and ID_2, QlikView would need to store 0102, and 0103 for ID_1, but 4567 for ID_2 only once as it is the same in both cases.  Roughly speaking, this means that QlikView now has to store only 12 digits rather than 16. Over thousands of records this starts to make a difference.

     

    For presentation, we can reconstruct the [Student ID] by concatenating:

    ID_1 & ID_2

     

    Email Address

     

    Again this is not a key field and is presented to the user. We can use the same principle as with [Student ID] to save space by splitting the address into two components: [Email User] and [Email Domain]. Because many records will share email domains e.g. gmail.com, we save space in the data model. For presentation, we can reconstruct the email address by concatenating:

    [Email User] & ‘@’ & [Email Domain]

     

    Our modified script for this example now becomes:


    LOAD
    Left([Student Id],4) AS ID_1,
    Right([Student Id],4) AS ID_2,
    AutoNumberHash128([Unique Identifier]) AS [Unique Identifier],
    SubField([Email Address], '@', 1) AS [Email User],
    SubField ([Email Address], '@', 2) AS [Email Domain]
    FROM ...

     

    So, although we now have more fields in our data model, the results show that the size of the model has been reduced by around a third.

     

    Class

    Type

    Id

    Count

    Size

    Bytes

     

     

     

     

     

    4,743,745

    Database

    Field

    Email User

    168516

    17.21

    2,900,475

    Database

    Field

    Unique Identifier

    199218

    8.00

    1,593,744

    Database

    Field

    ID_2

    10001

    14.00

    140,010

    Database

    Field

    Email Domain

    5033

    21.29

    107,150

    Database

    Field

    ID_1

    173

    13.68

    2,366

     

    Credits:

      Memory stats from DocumentAnalyzer_V1.14 by Rob Wunderlich