4 Replies Latest reply: Jan 2, 2011 5:05 PM by Russell Christopher RSS

    (Long) Newb questions about Loose Tables & Synthetic Fields

      Hey All -

      I'm Russell Christopher, a long time BI practitioner. While trapped by snow I decided to give myself a holiday treat -- Learning Qlikview! I'm having a blast and have tons of questions. I don't have internet where I'm staying right now, so I (literally!) hiked down a mountain, got in my car and drove into town so I could post. I'll be back in a couple of days to see what sort of responses I get!

      Questions:

      Best Learning Resources?

      Right now I have the QlickView Reference pdf that came with 10, a 9.0 Tutorial document, and the examples that came with 10. Are there other (free and deep) resources out there I can grab?

      Loose Tables:

      I'm a Microsoft guy by trade, and am therefore using the Contoso Retail DW as a sample data source. The DW is basically a snowflake schema in which some of the dimensions share lookup tables. Qlikview did not like those shared lookup tables as they created "loops". I ended up having tons of loose tables and my lists, tableboxes, and charts pretty much stopped working.

      After reading up on the loose table statement I was able to selectively "target" the problem tables and keep the rest of my schema from getting too screwy.

      Question 1: I understand why QV does what it does in this regard, but it is kind of a pain. Is there a best practice to work around this behavior? For example, I might have two dimensions (Customer and Sales Territory) which point to the same fact table and both have a foreign key pointing at a "lookup" table called Geography - LOOP! Is the best practice to simply denormalize things during the data load and actually APPEND the Geography field coming out of the Geography table to both Customer and Sales Territory? That way I can get rid of the (problem) Geography table altogether and simplify things for my user? If there a QV function to help me do this? If not, what is the best approach?

      Question 2: As I experimented with "tuning" loose tables, I noticed an odd behavior. Was I doing something wrong, or am I just dealing with the sort of "odd behavior" that all software exhibits sometime?:

      Repro:

      1. Load only tables which won't cause loops. (A fact table and a couple dimensions)

      2. Launch Table Viewer. As expected, no "dotted" lines (representing loose relationships, I guess). I see solid "relationship lines" (what do you call these in QV-speak?) between all tables.

      3. Modify script to add a table which WILL cause looping, but do NOT use LOOSE TABLE statement - just let QV figure things out on its own. Reload data.

      4. QV complains about loops, as expected.

      5. Launch Table Viewer - TONS of dotted lines, often between a fact table and a dimension (like TIME) which has NO relationship whatsoever to the new "problem" table which caused the loop. QV seems to have "tapped" multiple, unrelated tables as loose, instead of just the "problem" table and tables related to it. Why?

      6. Edit script, tag the new "problem" table with LOOSE TABLE. Reload data.

      7. In the Table Viewer, none of the "incorrectly loose" tables from step 5 have been restored to "non-loose" status - they still have dotted lines. Weird.

      8. Edit script, commenting out the section where I load the "problem table". Reload data so that things like they did in steps 1-2 - No dotted lines.

      9. Un-comment "problem table" script section, but make sure that I'm tagging the "problem table" as LOOSE TABLE. Reload data.

       

      Result: Only "problem table" is tagged as loose in Table Viewer. Phew!

       

      It seems that once QV has decided (correctly or not) a table should be loose, it doesn't change its mind, even after the developer specifies which table(s) should be loose in the script. One must "manually" do something to get them into a "non-loose" state ala step 8. Is this normal?

       

      Synthetic fields:

       

      I noticed that QV created a synthetic field on my fact table. In this case, the field was made up of the table's primary key, and two measures in the table. How does QV determine when it should create these? It appears to have something to do with the cardinality of the values involved, but I'm not quite clear on things. Why one synthetic field which contains the PK and two measures instead of two synthetic fields, each with the PK and a single measure?

       

      I assume the synthetic field is a performance enhancer? Why?

       

        • (Long) Newb questions about Loose Tables & Synthetic Fields
          John Witherspoon

          Best Learning Resources:

          Those are probably the best learning resources other than the forum itself.

          Loose Tables - Question 1:

          Some common solutions when your source data has a single shared lookup table for multiple fields in the model:

          1) denormalize the data onto your main table with different names

          2) duplicate the shared lookup table for every field

          3) create a new table to connect to the shared lookup table

          Let's look at these options by example. You're writing an orders application, and each order has an ordered date, accepted date, and acknowledged date. You have lookup table in your system that stores whether or not a date is a holiday. You're very interested to know, for whatever reason, what is ordered, accepted or acknowledged on a holiday. Your source system has this data structure:

          Orders:
          Order, Customer, Ordered Date, Accepted Date, Acknowledged Date
          OI1, Bob, 12/20/2010, 12/23/2010, 12/25/2010

          Dates:
          Date, Date is Holiday?
          12/20/2010, No
          12/23/2010, No
          12/25/2010, Yes

          The three options will give you the following table structures:

          1)

          Orders:
          Order, Customer, Ordered Date, Ordered on Holiday?, Accepted Date, Accepted on Holiday?, Acknowledged Date, Acknowledged on Holiday?
          OI1, Bob, 12/20/2010, No, 12/23/2010, No, 12/25/2010, Yes

          2)

          Orders:
          Order, Customer, Ordered Date, Acepted Date, Acknowledged Date
          OI1, Bob, 12/20/2010, 12/23/2010, 12/25/2010

          Ordered Dates:
          Ordered Date, Ordered on Holiday?
          12/20/2010, No

          Accepted Dates:
          Accepted Date, Accepted on Holiday?
          12/23/2010, No

          Acknowledged Dates:
          Acknowledged Date, Acknowleded on Holiday?
          12/25/2010, Yes

          3)

          Orders:
          Order, Customer
          OI1, Bob

          Date Linkage:
          Order, Date Type, Date
          OI1, Ordered, 12/20/2010
          OI1, Accepted, 12/23/2010
          OI1, Acknowledged, 12/25/2010

          Dates:
          Date, Date is Holiday?
          12/20/2010, No
          12/23/2010, No
          12/25/2010, Yes

          Options 1 and 2 will behave the same as each other, with slight differences in performance on the load and in charts, probably favoring putting everything on one table for chart performance, though it may be slower to load. Option 3 has a somewhat different behavior, which may be either better or worse depending on what you're trying to do with the data. I've used and continue to use all three approaches. There's no clear winner. It's all about what you're after.

          And no, I don't think there are any tools in QlikView to automatically do this sort of data transformation for you. Though perhaps one of the wizards would do it - I've never been a fan of wizards, so I stay away from them and do everything manually.

          Loose Tables - Question 2:

          Idono. I almost never have loops and loose tables. If I do, it's a mistake, and I fix my mistake. I never loosen tables. I don't have any idea what QlikView's default table loosening behavior is.

          Synthetic Fields:

          If your tables share two or more fields in common, a synthetic key will be created. So if you load in these two tables:

          A:
          Field 1, Field 2, Field 3

          B:
          Field 1, Field 2, Field 4

          What you'll actually get is this:

          A:
          Synthetic Key Field, Field 3

          B:
          Synthetic Key Field, Field 4

          Synthetic Key Table:
          Synthetic Key Field, Field 1, Field 2

          There's nothing specifically wrong with synthetic keys. However, people new to QlikView usually encounter them as a result of mistakes, and not as a result of design. And even when it isn't a mistake, manually doing what QlikView does for you automatically often aids in understanding. So the standard advice is to always remove them, at least until you have enough experience to know when to ignore this advice.

          As for why it does it, you're probably on target with "performance enhancer". QlikView always wants every table connected to other tables by only a single field and a single path. If you connect tables by more than one field, it creates a synthetic key to change the data model back to how it likes it. I suspect that typically saves some memory, and probably having the single-field connections between tables allows the processing to be optimized in some way.

          If you're an experienced data modeler (and it sounds like you are), and you know you want those two tables connected by two key fields, just ignore what QlikView is doing with the synthetic key. Everything is fine. In fact, if you REALLY want to ignore it, when you bring up the table viewer, switch it to "Source Table View" instead of "Internal Table View". Poof, the synthetic table and synthetic key goes away, and you simply see your two tables connected by two fields, just like you loaded them in and intended. QlikView knows what you want, as demonstrated by this view - it's just giving you a slightly different version of it internally for its own purposes.

          For more information on synthetic keys, I recommend the following thread:

          http://community.qlik.com/forums/t/31028.aspx

            • (Long) Newb questions about Loose Tables & Synthetic Fields

              John -

              Many thanks for taking so much time to respond. I've learned a few important things:

              • Field Names are very important to QlikView - In my world, field names generally don't impact how the metadata layer is built during data load - not so in QlikView-land.
              • If QlikView does something to modify the data model (loose tables, synthetic fields), it's likely your data model is not what it should be and you need to go back to the drawing board

              Those two maxims should help keep me out of trouble!

              I have a few more questions based on your words of wisdom:

              In passing, you mention performance monitoring (and you show some numbers in the thread you pointed me to, as well). Is there a feature/mechansim/log built into QlikView to measure things like RAM consumed, time to complete certain operations, etc? Or, were you monitoring the qv.exe process with ProcessMon, TaskManager, etc. in Windows? Does QlickView Server add anything in this regard? I'm curious what, if anything I'm "misisng out on" by ramping with Personal Edition.

              (Warning: terrible, loaded question ahead). All things being equal (and things never are), what would you say is a good test dataset to work with to simulate what average users (there is no average user) do with QlikeView. To save on loading time, I'm only bringing in about 2M fact rows...I could bring in as many as 12M with this dataset...or switch to others which contain between 60M-120M transactions. I know perfectly wel that the answer is "it depends", but do you have any thoughts on what the appropriate size of a "close-to-real-world" dataset might be?

              Does QlickView offer a "developer" edition of it's server? (similar to what Microsoft does - you can use all features of "best" edition of the product in a development enviornment, but may not go into production with it). Again, hitting on the "what am I not learning by using only Personal Edition" theme.

              Thanks again!

               

               

               

               

                • (Long) Newb questions about Loose Tables & Synthetic Fields
                  John Witherspoon

                  For performance monitoring, I use three basic tools - task manager (to watch memory consumption during and on completion of the load), the size of the resulting files, and the object calculation times as displayed by QlikView. You can get the calc times by bringing up the sheet properties and looking on the objects tab. Note that this calc time isn't particularly reliable - I believe it is wall time instead of CPU time, objects and data may be pulled from the cache, and when there are multiple objects on the page, they are all rendering at once, making the calc time pointless. To get around these limitations to some extent, I typically put one object on a page and set my cache size to 0. QlikTech suggested that I use a more reliable approach of, I believe, writing code to check the CPU time, render an object, and check the CPU time again. I've never bothered, and I'd have to look up the old email to have any clue what I'm saying, which I kind of don't. There is probably some way to use the QlikView server logs for performance monitoring. I haven't done so.

                  I think the only thing you're missing out on by learning on personal edition is how to publish your applications to users using QlikView Server and QlikView Publisher. I believe QlikView Personal edition is full featured as far as what a typical developer can expect to be dealing with.

                  The "average" user of course wants EVERYTHING. :) So I think the most realistic test data set is all of the data that is available and potentially relevant, because that's what the users will ask for. So if you have a data set that users might want to see with 120 million transactions, that's probably the best set to use for performance testing. But it's also impractical to load that much data most of the time during development, so you might save that for a final test just to make sure everything is OK. I do most of my performance testing on my desktop PC, which is pretty old, uses only 32 bits, and has 2 GB of RAM, so I'm pretty limited in what I can test. I do also have access to a 64 bit machine with I think 6 GB of RAM, but I try to limit my usage since that's a shared machine. Many of my test data sets, at least for gaining general purpose knowledge, are autogenerated with random data. But as is probably obvious, if I'm trying to optimize performance for a particular application, I'll be using that application's actual data set.

                  I'm not aware of any free or developer edition of the server. This has sometimes been a bone of contention with a few forumites, particularly when it comes time to upgrade the server - we'd LIKE to run the new server in parallel in a test environment for a little while, but have to pay for an additional license to do so. This makes upgrading a little frightening when budgets are tight - push the button and hope for the best. But I'm not involved in purchasing and don't actually know what all the licensing arrangements are and what things cost. You'd need to talk to QlikTech or a partner for more reliable information.