Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
johnw
Champion III
Champion III

Should We Stop Worrying and Love the Synthetic Key?

Synthetic keys have a bad reputation.  The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure.  If you load tables like this:

TableA:  FieldA, FieldB, FieldC

TableB:  FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA:             SyntheticKey, FieldC

TableB:             SyntheticKey, FieldD

SyntheticKeyTable:  SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing.  If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually.  As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field.  That shows up as a connection.  Two tables can be connected by two or more fields.  That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem.  Either can cause problems when they result from a bad data model, when there are too many or the wrong connections.  But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables.  If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

·        Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.

·        Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes!  Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”.  No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written.  I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys.  Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys.  But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem.  I’ve never seen a good data model that resulted in a lot of synthetic keys.  Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include:  Forming your own non-composite keys”.  While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA.  And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage.  I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage.  In fact, I believe performance and memory usage will typically get marginally worse.

I only have one test of my own to provide, from this thread:

http://community.qlik.com/forums/t/23510.aspx

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved.  I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key.  I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

                            load time   file KB   RAM KB    CalcTime

synthetic key removed           14:01    49,507   77,248   46,000 ms 

synthetic key left in place      3:27    49,401   77,160   44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting.  I’m happy to learn something here.  I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking.  The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves.  Maybe I just have a bad memory, and failed to find good examples when I searched.  But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage?  Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

96 Replies
mellerbeck
Creator II
Creator II

Well I like this post by Henric

http://community.qlik.com/blogs/qlikviewdesignblog/2013/04/04/intervalmatch

Further, the data model contains a composite key (the FromDate and ToDate fields) which will manifest itself as a QlikView synthetic key. But have no fear. This synthetic key should be there; not only is it correct, but it is also optimal given the data model. You do not need to remove it.

johnw
Champion III
Champion III
Author

mellerbeck wrote:

Well I like this post by Henric

http://community.qlik.com/blogs/qlikviewdesignblog/2013/04/04/intervalmatch

Further, the data model contains a composite key (the FromDate and ToDate fields) which will manifest itself as a QlikView synthetic key. But have no fear. This synthetic key should be there; not only is it correct, but it is also optimal given the data model. You do not need to remove it.

Agreed, but there may still be legitimate reasons for removing the synthetic key, even in the given example. Let's say that Events is is a huge table that we incrementally load every hour into a QVD, and that our application loads from there. Testing might show that every hour, it takes half a minute to do the intervalmatch, and another half a minute for QlikView to build the synthetic key.

As an alternative, add an IntervalDates field to the Intervals QVD, defined like FromDate*100000 + ToDate or something along those lines. When incrementally loading events, intervalmatch to ONLY the new rows (this assumes our historical intervals don't change), and add the IntervalDates field to the Events QVD. Now our intervalmatch happens in a flash since it's on so few records, and no time is taken building a synthetic key since we now match directly on a single numeric field. With what we said about performance, that should save us a minute per hour, which adds up.

Of course testing might also show that the intervalmatch and the synthetic key are not a performance bottleneck at all, and that we're complicating the process for no particular reason, including putting some interval information on an Event table where it doesn't really belong in a data sense. That's one reason why testing is important. Know what you're doing, and know why you're doing it.

hic
Former Employee
Former Employee

I agree with John: There are often good reasons to remove the synthetic key. But the reason should never be that the synthetic key is bad in itself. Because that is an incorrect assumption.

See more on http://community.qlik.com/blogs/qlikviewdesignblog/2013/04/16/synthetic-keys

HIC

christian77
Partner - Specialist
Partner - Specialist

Thanks John W. for bringing this issue. Here they go my ideas briefly.

Questioning the established is intelligent, interesting, educative and of course fun. Without that, there wouldn’t be evolution.

You get the ‘Jamming the System Award’; almost 70.000 views. It is a very interesting point for many, because it's about the basement of an application: the data model.

Users can get to know about an application more than those who build it.

Every project is different. Every human is different. We speak the same language in a different way, but it is still the same language.

I get very scared when somebody prohibits or vets a subject to be place.

We believe what is easier and more convenient for us to believe.

In my case, I allow ‘easy’ synthetic keys.

hugo_andrade
Partner - Creator II
Partner - Creator II

I just realized you might be right John.

I have a 3 fields synthetic key version and it works better than the version without it. I support this topic.

Regards,

Hugo.

Not applicable

For one document I used to combine two keys into a combined key, and created a link table. After reading this topic, I realized I could remove this operation from my script and just let QlikView do all the dirty work by creating a synthetic key.


The old link table and the new synthetic table contain exactly the same number of rows (+-300k) and the resulting data shown on screen is identical. Document size is only slightly reduced (<1%), but the most notable difference is the load speed, which is half of what it used to be. I can't measure performance but it doesn't "feel" any different.


Seems my data model is correct so the synthetic key does not cause any difficulties for me.

hic
Former Employee
Former Employee

I am not surprised. In fact, it is what I would expect. See more on my blog post about Synthetic Keys.

HIC

Not applicable

John,

Very interesting. I am also bit confused with the whole literature of that subject. Henric speaks about a consensus: and everybody repeats what the consensus says. I tried to understand that specific point but I fear that I failed. In 10 or 20 years, I will be an expert understanding that curious point

We learn that we should avoid these keys.

1) because of their poor performance. You stated and Henric stated also that the performance was not so poor.

And the literature encourages us to replace them with composite keys, autonumber and so forth : I do not see really the point of all that stuff!

2) because of a poor data model. But most of the time, a data model will be considered as poor if one can see a single synthetic key. And the model is changed to avoid these keys, even if the new model is far more difficult to understand,  handle to the previous one.

Anyway, very, very interesting post.

I will not avoid these keys anymore.

Fabrice


Not applicable

That's a quite revealing study about synthetic keys John!

You sound have explored the breadth of synthetic keys! I think you would be the best person to approach regarding the problem I am facing with Synthetic keys.

I am trying to binary load a QVW document into my Dashboard, and the source document which I am loading contains 26 synthetic keys, and the Dashboard doesn't pull any other information than the binary load. Still my dashboard doesn't give the same results as the source document.

As I noticed the synthetic table which has 38Million records at source document shows 284Million records after binary loaded into Dashboard. I am not sure if the problem is a culmination of binary load and over usage of Synthetic keys together, but could this be possible that synthetic tables exhibiting a different behavior to cause inconsistency of results?

Thanks & Regards,
Raj

johnw
Champion III
Champion III
Author

If you're loading in a data model with 26 synthetic keys, I can all but guarantee that the data is modeled incorrectly, that the data model itself has bugs. Of course I can't say that conclusively from afar, without knowing your data and how it's been modeled, but that's what I would bet on.

As I understand binary loads, you should have EXACTLY the same data and data structures in memory after you do one. If you're seeing a different number of records, I have to suspect you're either counting wrong, or are doing more than just a binary load in the script. But I've done little more than a few routine maintenance tasks in QlikView over the past couple of years, so perhaps something has changed. Bugs in QlikView itself are also a possibility.