Skip to main content
Announcements
Join us at Qlik Connect for 3 magical days of learning, networking,and inspiration! REGISTER TODAY and save!
cancel
Showing results for 
Search instead for 
Did you mean: 
johnw
Champion III
Champion III

Should We Stop Worrying and Love the Synthetic Key?

Synthetic keys have a bad reputation.  The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure.  If you load tables like this:

TableA:  FieldA, FieldB, FieldC

TableB:  FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA:             SyntheticKey, FieldC

TableB:             SyntheticKey, FieldD

SyntheticKeyTable:  SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing.  If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually.  As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field.  That shows up as a connection.  Two tables can be connected by two or more fields.  That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields.  If so, that’s a data model problem that should be fixed.  And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem.  Either can cause problems when they result from a bad data model, when there are too many or the wrong connections.  But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables.  If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

·        Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.

·        Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes!  Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”.  No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written.  I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys.  Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys.  But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem.  I’ve never seen a good data model that resulted in a lot of synthetic keys.  Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include:  Forming your own non-composite keys”.  While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA.  And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage.  I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage.  In fact, I believe performance and memory usage will typically get marginally worse.

I only have one test of my own to provide, from this thread:

http://community.qlik.com/forums/t/23510.aspx

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved.  I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key.  I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

                            load time   file KB   RAM KB    CalcTime

synthetic key removed           14:01    49,507   77,248   46,000 ms 

synthetic key left in place      3:27    49,401   77,160   44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting.  I’m happy to learn something here.  I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking.  The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves.  Maybe I just have a bad memory, and failed to find good examples when I searched.  But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage?  Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

96 Replies
Not applicable

John,

I see it, although I couldn't begin to offer a lucid thought on this topic! I'll see if I can get a couple of guys in R&D to take a look and share their thoughts.

johnw
Champion III
Champion III
Author

Thanks, Jason! Yes

Or
MVP
MVP

Just a quick note regarding my earlier post - I double checked, and the complexity in synthetic keys was related to the fact that key columns appeared in multiple tables, each time within a different key context. I'm sure if I had gone through and only eliminated the columns that resulted in data model issues / circular references, I'd have been left with a reasonable number of synthetic keys - but of course, when you're eliminating them anyway, it's often easier to just eliminate the whole bunch instead of trying to figure out which ones you can keep.

Looking forward to seeing what QV's technical folks have to say about this.. 🙂

Not applicable

Interesting as always John.

My perspective on the whole thing stems from the fact that Qlikview rationalizes synthetic keys this way in the first place. The simplest question here is, if it's so terrible, why does QV do it at all?

The fact that QV rationalizes correctly modeled data in a way that's consistent with good design says to me that this is a totally legit method of creating these data models.

I think the trepidation with synthetic keys stems from two places:

1. Bad data modeling design
2. General lack of understanding of what a synthetic key is

As you say, bad data design results in 'bad' synthetic keys, results in bad performance, and this usually stems from lack of experience in data design. From there, people assume that synthetic keys themselves are bad, when they're not.

If you have a data model that uses synthetic keys, and that's what you intended, then by all means, use a synthetic key. I can't imagine why you would go through the trouble of rationalizing the key manually, and creating your own composite key, when QV is prepared to do the heavy lifting for you.

Had I encountered a situation where I expected to see a synthetic key in my data model, and I did see such a synthetic key, I'd leave it right where it was. I do, however, get slightly more concerned when I've had a total blonde moment, and created multiple compound synthetic keys, and my model goes all nutty.

As you pointed out in your QV documentation, it MAY do this, and it MAY do that. It's like that warning label on your coffee that has to tell you that it's hot… because some genius put it between their legs once, and got 3rd degree crotch burns. Yup, that coffee MAY give you 3rd degree crotch burns. It may not, however, if you use it properly.

I suspect that a little bad press have done synthetic keys in for the general user community.

nathanfurby
Specialist
Specialist

Thanks John - as a beginner it's important to hear these kinds of discussions. During my training I was definitely taught that synthetic keys should never exist and should indeed be replaced by user created keys. That never made huge sense to me.

Colin-Albert

There is one area where synthetic keys do make life more difficult - when using the table viewer, you can only see the data that makes up the synthetic key in the synthetic key table itself, other tables that include SynX data do not display the synthetic key value, just the other fields in the table.

This can make it a lot more difficult to determine what is going on in the associations between the data when a report does something unexpected.

johnw
Champion III
Champion III
Author


Colin_Albert wrote:There is one area where synthetic keys do make life more difficult - when using the table viewer, you can only see the data that makes up the synthetic key in the synthetic key table itself, other tables that include SynX data do not display the synthetic key value, just the other fields in the table.
This can make it a lot more difficult to determine what is going on in the associations between the data when a report does something unexpected.


The view IS different, but it's the synthetic key version that shows all the fields in each table, not the concatenated key version. Or have I misunderstood what you're saying?

With synthetic key:

With concatenated key:

Both views make equal sense to me, though I do prefer the less cluttered second view. It sounds like you prefer seeing all the fields. I think this is going to be a personal preference thing, not a case where one way is better than the other. After all, if QlikTech felt that the second view was more informative, they could easily display synthetic keys this way. They don't, so it appears that QlikTech feels it is more informative to display all of the fields in all three "tables".

Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

John,

with all due respect, this idea sounds like an interesting hypothesis, but since it's against anything we've known and were taught before, I would suggest being careful about propagating and recommending it, until we can hear an opinion from R&D about it. There is a lot of interesting stuff happening behind the scenes in QlikView, and all we can do is speculate and theorize.

Working with large data volumes, I've seen synthetic keys crashing servers, getting QlikView "hung", raising RAM requirements beyond anything imaginable - all nine yards of fun. With all this experience, I would never recommend to anyone to "love the synthetic keys".

On a similar note, I've seen an "analysis" performed in a lab with thousands of 1 and 0 values, allegedly proving that IF formulas are performing as well or better than calculations with flags or Set Analysis conditions. Later on, I've met several beginners confused by that analysis. The fact remains the same - IF formulas are killing performance on large data sets, and synthetic keys can't be good on large data sets.

It's good to challenge basics and "rock the boat" once in a while, but we need to be careful when promoting a "maverick" hypothesis into a status of an expert recommendation.

This is, of course, just my personal opinion. It would be extremely interesting to hear from R&D on this issue.

Not applicable

I just get the feeling that this post owes the headline to the movie "Dr Strangelove or how I stopped worrying and love the bomb"!


SAP Consulting

johnw
Champion III
Champion III
Author


sharond31 wrote:I just get the feeling that this post owes the headline to the movie "Dr Strangelove or how I stopped worrying and love the bomb"!


Well of course. Big Smile "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"

Which, if you think about it, is hardly a ringing endorsement for using synthetic keys. *chuckle*