5 Replies Latest reply: Feb 1, 2013 7:48 AM by Deepak Vadithala RSS

    Multiple Versions of Similar Data for All Tables in Data Model - Best Practices

    Kevin McGovern

      I'm currently developing a data model for a new dashboard I'm building and wanted to pick the communities brain on an approach for data modeling multiple versions of similar data.  The issue I'm running into is that these multiple versions of the same type data will need to be compared within the dashboard, so I can't really create different dashboards for each version.  Each month we take a snapshot from our transactional DB and store it in a data warehouse with a maximum storage time of 12 months.  What we end up with is a DB with a version of all sales that were booked in October but extracted from the transaction DB in November, a second version of sales that were booked in October but extracted from the transaction DB in December and so on.  The reasoning for keeping these versions is that past months data is revisable so it keeps multiple copies of a particular month for comparision.  The dashboard is supposed to give the user the option to select a specific month and see the data that was extracted in that specific month. 

       

      What I've come up with so far is to concatenate the data that was extracted each month into corresponding tables then I use a field within each table to distinguish the version (so Sales would have multiple versions October data).  However, to do this, I have to create unique fields on each table (e.g. the sales table would have a field ExtractedIn that contains the month name, and similar unique fields for all other tables).  There has to be a better way to do this though, does anyone have any suggestions?

        • Re: Multiple Versions of Similar Data for All Tables in Data Model - Best Practices
          Deepak Vadithala

          Kevin,

           

          My thoughts...

           

          I think your approach is correct but I didn't follow the last bit. I'd concatenate all the tables into one long table and there is one field as a flag which marks the version. If you have large volumes of data then I'd recommend using both Set Analysis and Flag approach with one large fact table. Because in QlikView it's more about number of columns/fieds rather than tables or rows. If you store the same fields together as one field then you the repeated values are just stored once instead of multiple times in different values. QlikView is storage works based on Cardinality of the data. Here is my understanding of Internal Storage Structure..

           

          The data records are read into the memory, so that all the processing of data may be done through memory. I am sure you know this bit. QlikView treats all the data as Data Element Type (Columns / Fields) and Data Element Values (Values / Records). So each different data element value of each data element type is assigned a binary code and the data records are stored in binary-coded form and they are also sorted. By using the binary coding, very quick searches can be done on the tables. Also, QlikView removes the redundant information information and reduces the amount of data. However, the redundant information in stored as seperately with the frequencies for each unique data element value and across each data element type. When user makes a selection on data element values then the implied selection (possible values) are kept track seperately to present them to the user. By this process QlikView can perform rapid linear searches.

           

          Good luck!

           

          Cheers,

          DV

           

          www.QlikShare.com

            • Re: Multiple Versions of Similar Data for All Tables in Data Model - Best Practices
              Kevin McGovern

              Deepak, I really appreciate the response.  Is there a reason that concatenating into one table would be more beneficial here?  Would I lose some of the associative powers that I'd typically have under a normal format?  I'm worried that if I try to put everything into one table, either I'll be concatenating the data which may lead to some problems with how the data is linked up or I'll kill my load script performance by trying to join data that wouldn't typically be contained within the same table.  Is there a downside to using identifiers on each table that are qualified with the table name to distiguish them from other identifiers?  I realize that when performing set analysis, it would force me to have to get the set from multiple places but I don't see this as a big enough issue to have to go to a one-table structure.

                • Re: Multiple Versions of Similar Data for All Tables in Data Model - Best Practices
                  Deepak Vadithala

                  You're welcome Kevin.

                   

                  My understanding about the data is you have multiple versions of similar data. Each version is used to compare with the previous version. Here are my thoughts...

                   

                  • You're not going to create ONLY BIG Table rather you need to have ONE BIG FACT Table. i.e. I'd keep my necessary Dimesions as Lookup tables. We are aiming to have a Star Schema here.
                  • In your Fact Table, you need a flag which marks the Month values as {201301,201302, 201301...201312}. And you will be using this field to understand the versions from the subset of one table. This way you can use stand Expressions on your FACTS which are written ONLY once. However, If you are planning to split your model into multiple tables for the sake of maintaining associative logic then you need to maintain multiple Expressions for each table as the field names are different (Massive development effort and it looks clunky, my opinion only!)

                   

                                    Example Expressions:

                                      Current Version  = SUM(<1{ MyFlag = {$(=Max(MyFlag))}>} Sales)

                                      Previous Version  = SUM(<1{ MyFlag = {$(=Max(MyFlag)-1)}>} Sales)

                   

                  • If you split them to multiple tables then you are wasting your storage because QlikView creates Information Vectors for each additional fields. This way you are increasing the RAM footprint and CPU cycles.
                  • You are not losing your Associative experience because you always make your selections on Dimensions which are seperate anyway. So I don't think you will miss out QlikView Associative experience.
                  • If you have same fields in multiple tables then it becomes confusing for the end user when they are making selections. And managing the real estate (UI) becomes difficult with many flavours of similar Fields.
                  • Use flags and Set Analysis in your expressions together to achieve faster calculation layer. Define your Expressions in Variables in the script (If possible) and re-use them wherever required. This way you will take advantage of easy code maintainence and also you can advantage of caching.
                  • I don't think you need complicated QlikView Joins rather you need to use Implicit & Explicit Concatenation which are not expensive operations.

                   

                  I hope this all makes sense! Please feel to add anything and we can discuss this further.

                   

                  Cheers,

                  DV

                   

                  www.QlikShare.com

                   

                    • Re: Multiple Versions of Similar Data for All Tables in Data Model - Best Practices
                      Kevin McGovern

                      Deepak, sorry, that was a misunderstanding on my part.  I thought you were saying create one big table but now I understand that you were saying store anything that could be versioned in a centralized table.  I'm already working towards a centralized fact table solution.  The issue I'm running into is that I essentially have two levels of detail, see the simplified example below.  The problem I'm running into is that if I want to create a central fact table where I select a ProjectID and it will give me associated SalesAmts for that project, I would have to join the two tables (both are big tables so it would eat a ton of script time).  If I concatenate here, then I lose that association because selecting a projectID will only show associated EntryIDs, not the next level of SalesAmt.  So what I've ended up doing is creating two flags, which I agree with you is pretty clunky.  Do you have any ideas for an alternative?

                       

                      Project:

                      Load

                      ProjectID,

                      EntryID,

                      OtherProjectLevelInfo,

                      ProjectFlag

                      From ProjectExtract;

                       

                      LineItems:

                      Load

                      EntryID,

                      SalesAmt,

                      LineItemFlag

                      From LineItemsExtract;

                        • Re: Multiple Versions of Similar Data for All Tables in Data Model - Best Practices
                          Deepak Vadithala

                          Kevin - No problem at all. Thinking about your problem can you post some sample data? It will be easy to work with actual data while doing data modelling. How about creating a Hash or Autonumber key which will be combination of ProjectID & Unique Identifier in (LineItems) table. This way you can still have one table and if you select ProjectID you will see all the values for SalesAmt in the same table because the combination of both the unique keys in two tables represent the data at the most granular level. So you are meeting both the requirements.

                           

                          I hope this makes sense. Please let me know your thoughts or post some sample data.

                           

                          Cheers,

                          DV

                           

                          www.QlikShare.com