Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
johnw
Champion III
Champion III

Should We Stop Worrying and Love the Synthetic Key?

Synthetic keys have a bad reputation.  The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure.  If you load tables like this:

TableA:  FieldA, FieldB, FieldC

TableB:  FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA:             SyntheticKey, FieldC

TableB:             SyntheticKey, FieldD

SyntheticKeyTable:  SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing.  If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually.  As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field.  That shows up as a connection.  Two tables can be connected by two or more fields.  That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem.  Either can cause problems when they result from a bad data model, when there are too many or the wrong connections.  But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables.  If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

·        Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.

·        Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes!  Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”.  No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written.  I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys.  Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys.  But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem.  I’ve never seen a good data model that resulted in a lot of synthetic keys.  Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include:  Forming your own non-composite keys”.  While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA.  And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage.  I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage.  In fact, I believe performance and memory usage will typically get marginally worse.

I only have one test of my own to provide, from this thread:

http://community.qlik.com/forums/t/23510.aspx

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved.  I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key.  I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

                            load time   file KB   RAM KB    CalcTime

synthetic key removed           14:01    49,507   77,248   46,000 ms 

synthetic key left in place      3:27    49,401   77,160   44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting.  I’m happy to learn something here.  I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking.  The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves.  Maybe I just have a bad memory, and failed to find good examples when I searched.  But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage?  Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

96 Replies
johnw
Champion III
Champion III
Author

orsh_ wrote:

...you'll get a whole lot of ugly synthetic keys as QV tries to hash out every possible combination of fields...

Earlier in the thread, I couldn't think how to set up the situation properly, even though I'd certainly seen it happen before.  The first attached file demonstrates it pretty clearly.  We have five tables that share some key fields, but they each share different key fields.  The result in this case is 40 separate synthetic keys.

Now, this is a bad data model.  In this case, the good data model would just have a single table.  There's nothing here to change my advice to not replace synthetic keys with composite keys in a good data model.  But it does demonstrate the synthetic key explosion we sometimes encounter.

I believe most of the synthetic keys here are useless.  I believe only five of them end up mattering, the five that link our five tables back to the key table that holds the actual values of the key fields.  We can replace just those five keys with composite keys as a way of improving the data model.  That's shown in the second attached file.  But the data model still isn't right, so again, this isn't a case where we have a good data model and we're replacing synthetic keys 1:1 with composite keys.

johnw
Champion III
Champion III
Author

Since Anthony seems to have confirmed that synthetic keys behave like composite keys internally, I think the more open question now has to do with script performance.

So far, the examples I've created or seen have loaded synthetic keys faster than creating the equivalent composite keys.  However, it's possible that with a some specific data model, it would take QlikView longer to figure out how to construct the synthetic keys than it would take us to just construct composite keys manually.  After all, as Oleg said, "When we build the composite key, we give QlikView very specific instructions about HOW to build it..."  In a sense, that should make QlikView's job easier, not harder.

The attached two files were my attempt to load a several composite keys as efficiently as possible, thinking that perhaps the synthetic key equivalent would be slower.  But the synthetic key load was still significantly faster - for a million rows, 66 seconds vs. 206 seconds.

Maybe someone knows a more efficient way to load composite keys?  Or some situation that will make it much harder for QlikView to figure out and build the right synthetic keys?

johnw
Champion III
Champion III
Author

Magnus Åvitsland and I got to talking about synthetic and composite keys in another thread:

http://community.qlik.com/message/132927

I need to look into the results more closely, but we set up a simulation of a semi-realistic case:

  • invoice header and invoice line items associated by 4-part key
  • data is stored in QVDs
  • QVDs are loaded incrementally, with only a small fraction loaded per day

With those assumptions in place, I tested two alternatives.  First, just load in the data and let the synthetic key form.  Second, add a hash128() key to both tables during the incremental load (negligible overhead), and skip the 4 key fields when pulling the invoice line items from the QVD.  (Yes, technically using a hash128() on separate incrementally-loaded files can result in data problems, but the chance is as close to 0 as makes no difference, so in practice, it should be fine.)  The results appear to be a counterexample to what I've been stating in this thread.

key type   peak RAM  final RAM  load time  chart time  QVW size
synthetic   896 MB   568 MB     0:30       797 MS      115 MB
synthetic   893 MB   576 MB     0:31       828 MS     
synthetic   868 MB   569 MB     0:31       844 MS
synthetic   887 MB   569 MB     0:32       797 MS
synthetic   864 MB   569 MB     0:31       859 MS
hash128    1104 MB   740 MB     0:11       797 MS      146 MB
hash128    1082 MB   733 MB     0:37       781 MS
hash128    1065 MB   733 MB     0:16       765 MS
hash128    1063 MB   730 MB     0:17       813 MS
hash128    1096 MB   733 MB     0:17       782 MS

Load time is definitely faster and chart time is likely faster with the composite key.  The RAM and file size differences are overstated since in the test case, my row size is small enough that adding a 16-byte incompressible key makes a large difference.  In a typical application, row sizes would be much larger, and the composite key a much smaller percentage of the data.

Now, this composite key is NOT a one for one replacement for the synthetic key, which is what I've mostly been talking about in this thread.  It is not structurally identical.  But it is FUNCTIONALLY identical, and a practical solution for real world cases.  I think that's good enough to call it a counterexample to my general assertion.

And that makes things more interesting.  Sometimes a synthetic key is better.  Sometimes a composite key is better.  It depends on the situation and on what you're trying to optimize (RAM vs. load time, for instance).

Not applicable

Great post John.

I've used synthetic keys in several scenarios and from my experience across multiple documents, I have never seen a performance impact outside of reload time.  I have several situations where we have legitimate need to have synthetic keys in place because the index data may or may not exist in several tables and each has slightly varying context.  Short of loading a temporarily concatenated set of all data points and reducing to a distinct set, my only opition is to load and let the synthetic keys take care of that for me.  We also have some calculation automation that utilizes the joined field data from those tables at varying points in the script so it is helpful to not replace them with a key. 

The reality is that most relational models have multiple loops and have those for a reason.  It is difficult to reduce them to a completely non looping set without utilizing synthetic keys, especially in situations where the data is coming from 4 or 5 sources that were not designed to work in a coherent data model.

In all honesty, my biggest frustration with them has been their use as a scapegoat in the support environment.  We have been experiencing a bug for quite some time that causes large reports to disconnect an ajax session.  The report runs fully but is not handed back to the user.  When we contacted support, the response was that synthetic keys were causing the issue.  I went through a two month process of removing the synthetic keys (not that it took that long but we've been incredibly busy and changes require thorough testing) only to find out that after removal, we had the same exact issue.   In addition to that, we have had server performace issues (not reload related) blamed on the synthetic keys as well that were not resolved as a result of the removal of the keys.

In total, we have spent a great deal of time removing them only to find out it had no affect on our problems.  I honestly don't know why it would create a problem since we are simply doing what the synthetic key is doing to resolve the issue anyway.

quiquehm
Contributor III
Contributor III

Excellent thread on synthetic/composite/autonumber keys, thank you all, I am learning a lot here. I read the whole thread from the start but I haven´t seen yet a word from QlikTech R&D. It would be nice to see what their input is here.

John, if I understand correctly, you mention several times that, normally, performance issues are caused by poor data models and " Only YOU can decide if your data model is correct or not "..., "Only experience with data modeling will tell you if your data model is correct. No computer can tell you. Only experience can tell you ". From a beginner position ( this is my case ) , do you have any best practices / recomendations ( or reference books ) to gain that knowledge on Data Modelling, to be able to see if your Data Model is not correct/optimum before starting to load everything in QlikView ( to avoid problems later on ) ?

Appreciate any advice here

Thanks for the good stuff as usual

Enrique

johnw
Champion III
Champion III
Author

quiquehm wrote:

From a beginner position ( this is my case ) , do you have any best practices / recomendations ( or reference books ) to gain that knowledge on Data Modelling, to be able to see if your Data Model is not correct/optimum before starting to load everything in QlikView ( to avoid problems later on )?

While not strictly necessary, the data model that most seem to gravitate towards for QlikView is either star schema or snowflake schema.  You would have one central fact table surrounded by dimension tables.  With snowflake schema, these dimension tables may themselves have dimension tables (in star schema, these tables would be denormalized onto the main dimension tables).

http://en.wikipedia.org/wiki/Star_schema
http://en.wikipedia.org/wiki/Snowflake_schema

I don't think these data models prohibit synthetic (multi-field) keys, but I wouldn't think that synthetic keys would be very typical with these structures either.  More typically, the central fact table would have an ID, and this ID would map you to your dimension table.  But if multi-field keys make sense, they make sense.  You CAN convert them to a single ID, and store the other fields on the dimension table, but this often involves additional script complexity and load time for little benefit (though it may be good experience for a new developer, but that's a bit of a separate argument).

Both star and snowflake schema tend to involve some denormalization,  more so for star schema.  In my background, denormalization is a dirty word, but it's not a problem in QlikView.  We're not talking about operational data here.  We're talking about reporting data, data warehouse data.  There are no updates back to the operational databases.  Everything generally flows in one direction.  QlikView's internal compression algorithms keep you from paying a memory or performance penalty, and the denormalized data often performs better.

In the interest of full disclosure, I personally make no effort to transform my data into a star or snowflake schema.  You just need to understand what tables make sense, how they relate to each other, when to denormalize and when to keep a separate table and so on.  A data model designed that way will often resemble a star or snowflake, but sometimes it won't.  So sometimes mine do, and sometimes mine don't.  But it's one of those cases where if you lack the data modeling experience to guide you, then very explicitly aiming for star schema may make sense and be good practice.

My data models also grow organically rather than by top down plan, though I may refactor if I become unhappy with them.  "OK, I'm reporting on order items, so that's my base data.  Looks like I need a little bit of product data.  Let me join that.  Oh, I need a bunch of product data.  let me make that its own table.  I need some descriptions for these codes, so I'll join those in.  Looks like I need a calendar.  Hmmm, they want to know what ordered material is now available in our warehouse for pick up, so I'm adding a warehoused items table."  I threw that example together randomly, but what I have now resembles a star schema.  The order items are the central fact table.  The calendar is a dimension connected by date.  The product data is a dimension connected by product ID.  The descriptions have been joined to the main data so don't turn it into a snowflake.  Warehoused material isn't a dimension, so maybe this isn't technically a star schema, but it IS connected to our main fact table by just order item ID, so it's at least structurally the same.

One thing I did when I first started QlikView was to normalize the data even more than it was normalized in the business systems.  I was so excited by how well QlikView could understand all the connections that I assumed it was the best thing to do, and that finally I could correct all those horrible denormalizations that still existed in our main systems!  This heavy normalization didn't really cause any problems, but I don't think was optimal.  While QlikView handles highly-normalized data models just fine, these days I tend to fairly significantly denormalize the data.  The consensus is that this improves performance, though I've paid little attention.  Somehow, the denormalizations just "make sense" now.  I know that's hardly a strong justification, but honestly, a lot of what I do is by feel, not by explicit rationale.  Some things just feel right.  Others don't.  I hope that it's 30 years of programming experience talking to me subliminally rather than random neuron firings, but who knows.  And certainly "do what feels right" isn't a very useful instruction to someone trying to figure out how to model their data.

As a very rough rule of thumb on when to denormalize and when not to, if you only have one or two or a few fields on a dimension table, go ahead and join it onto your main table.  If you have a bunch of fields, keep it as its own object.  Also, regardless of number of fields, I almost always keep my calendar table separate.

quiquehm
Contributor III
Contributor III

Thanks a lot for all the details John. Having checked the wikipedia articles on data model schemas, and other interesting whitepaper I found from a Data Modeling tool vendor ( I attach it here for other beginners like me ), where it explains Normalization of Data Models, I believe in my case I don´t even have a Star or Snowflake schema. Probably my model is much simpler, with the only complexity ( if it is really a complex thing ) of having several common fields across all 3 tables ( see tables in attached excel sheet ).

These 3 tables are loaded from 3 qvd files ( I concatenate periodic extractions from 3 SAP transactions into excel sheets, to be able to do time-series analysis ), so I believe I will have to try different scenarios with them and see which is the best option in terms of script load speed and validation of expected data associations. I am thinking of :

Option 1 : Leave Qlikview generate the needed synthetic keys and see how it works

Option 2 : Try to reduce the number of common fields. As I don´t have unique simple primary keys in any of the 3 tables , I think I need to generate primary (composite) keys ( and convert them to integers primary keys using Autonumber function, as leaving string generated primary keys could cause the out of virtual memory error when loading the script ). And I need to do this when loading the data from the QVD files ( and not when generating the QVD files ).

Regarding the Autonumber function, I have seen another thread ( see attached ), where you were debating with other guys about Autonumber() and Hash*() functions. Do you know what is the different between the various Autonumber "flavours" ( Autonumber(), Autonumberhash128(), Autonumberhash256() ...) which one would be adequate for my tests ?

http://community.qlik.com/message/73694

Appreciate any suggestions if I am taking the right approach with my tables loading.

Thanks a lot

Enrique

johnw
Champion III
Champion III
Author

Well, for your data, I expect that just letting synthetic keys form will NOT work very well.  There are a number of fields in common, and they're inconsistently in common, so you'll probably get at least a handful of interacting synthetic keys.  I wouldn't expect replacing them with composite keys to work any better.  It wouldn't surprise me if this is a case where concatenation is the best answer - just concatenate (not join) all of your tables into one table, and let the fields be null where they don't exist for that particular data.  It's really hard to know as an outsider to this data whether or not that would do what you need, though.

If you're creating composite keys, you might do this while generating the QVD files, or while loading from the QVD files.  Both have advantages and disadvantages, and some of these advantages and disadvantages are tied to how you create the composite key, and how you're generating the QVDs (in one application or multiple applications).

If all of your autonumbering is occurring in the same load in the same application, autonumber() is the function to use.  It produces integer key values, taking the least space and having presumably the highest performance in charts.

If your autonumbering is occurring on the same keys in different loads, different applications, autonumber() should not be used.  It will generate different numbers for the same keys, and the same numbers for different keys.  AutonumberhashXXX() technically has this problem as well.  The probability of mismatched keys is very small but non-zero.  Use at your own risk. (Edit: Hash 128 vs. 256 refers to the number of bits in the generated hash value, in the resulting key field.  So you're generating a 16 or 32 byte key.  The longer the key, the lower the chance for key problems.  But in some cases, the resulting key may be longer than the data you were hashing, in which case you're just wasting time.)

The safest but slowest function is to use no function at all - simply concatenate your field values with a separator of some sort.  These large keys would be expected to slow down chart performance a little, and of course take more space.

And then of course there are synthetic keys.  With synthetic keys, all autonumbering is internal to the product and occurs in the same load.  It is therefore safe, and should be at least as fast as an explicit autonumber().  But didn't we just see an example where synthetic keys loaded much more slowly than autonumbered keys?  Yes.  The reason is that "all autonumbering... occurs in the same load".  In the example, we used autonumberhash128() in an incremental load, saving us almost all of the time required to assign the key values.  The synthetic key approach, on the other hand, had to generate keys for the entire data set every time, and I believe this is what slowed it so much in comparison.

klausskalts
Partner - Creator
Partner - Creator

I is some very long stories .... however as the syntehic key table will contain all combinations of all possible values of all datafields it's just to do the math.

  1. Two fields each with two possible values => 2x2 = 4 records in the $Syn-table
    Not really a problem ....
  2. Three fields with each 120 possible values => 120x120x120 = 1,728,000 in the $Syn-table (hmmmmmm)
    Should be handled by every computer
  3. Five wirh 120 records, 10,000 records, 25 records, 1,500,000 records, 800 records => 36,000,000,000,000,000
    Start to worry

So my conclusion:

If you plan to depluy your document and you see a $Syn-table - Remove it ASAP .... it will only cause trouble later on.

klausskalts
Partner - Creator
Partner - Creator

.... in my younger days I have created documents with 20-30 internal tables and 10-15 $Syn-tables ..... creating the $Syn-tables often required as much reload time as the rest of the document.

And when you in that deep .... theres really no way out - QlikView will work, but there is no reason to be proud !