Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.
I believe that the consensus is wrong.
My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:
TableA: FieldA, FieldB, FieldC
TableB: FieldA, FieldB, FieldD
What you’ll actually get is this:
TableA: SyntheticKey, FieldC
TableB: SyntheticKey, FieldD
SyntheticKeyTable: SyntheticKey, FieldA, FieldB
Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.
Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.
Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.
Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.
Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.
Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.
What does the reference manual have to say on the subject?
"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.
Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:
· Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.
· Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."
Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them!
But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.
I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.
My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.
I only have one test of my own to provide, from this thread:
http://community.qlik.com/forums/t/23510.aspx
In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.
load time file KB RAM KB CalcTime
synthetic key removed 14:01 49,507 77,248 46,000 ms
synthetic key left in place 3:27 49,401 77,160 44,797 ms
What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.
So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?
Oleg Troyansky wrote:we need to be careful when promoting a "maverick" hypothesis into a status of an expert recommendation.
It's true. And I almost certainly came off too strong in my initial post.
For now at least, I'm mostly asking a question and making a hypothesis. If I'm making any "expert recommendation" at present, it would be to CONSIDER leaving synthetic keys in as ONE of your options that you test. I feel like I've seen sufficent evidence to at least consider that option, even if I've not seen sufficient evidence to support a much broader hypothetis like "synthetic keys are never the cause of a problem, even if they are coincident with a problem".
Thinking further, and I suppose this just echoes much of what I've said above, I have two main reasons for believing what I believe.
I could certainly be wrong. Just because I can't think of a reason for a difference doesn't mean there isn't a perfectly good reason. The testing I've done has been very simplistic. There is strong anecdotal evidence suggesting that synthetic keys ARE a problem. Even the reference manual urges caution. But if I am wrong, it would be fascinating to me to know WHY I'm wrong.
You say "I've seen synthetic keys crashing servers, getting QlikView "hung", raising RAM requirements beyond anything imaginable". Can you provide a simple example of this? Or perhaps the synthetic key problem only arises when the examples are NOT simple?
Definitely waiting to hear from R&D. I'm hoping we get a much more thorough answer than "yes" or "no", or I'm afraid it won't be particularly informative.
John,
I don't pretend to know all the answers, and I don't have time to perform a full-blown performance test. I'm mostly using my prior experience seeing QlikView using substantially more resources when it's forced to build a synthetic key, compared to us building composite keys as part of the script.
The speculation that I can offer about the possible cause for the difference is how I intuitively feel about it:
When we build the composite key, we give QlikView very specific instructions about HOW to build it - what tables, fields and values to use, etc... In contrast, when we leave multiple keys (sometimes in more than one pair of tables) for QlikView to "figure it out", QlikView has to determine - what synthetic keys need to be built? What tables are affected? What are all the possible permutations of the involved fields and their values? QlikView has to build the "game plan" before actually building the key, and I suspect, that algorithm is not a simple one. I can totally believe that, in attempt to cover all possible situations, the algorithm has to go through a lot of unnecessary iterations before settling on a certain solution. On the other hand, none of that is needed when we build the keys on our own - we tell QlikView what to load, and QlikView simply loads a few fields from a few tables. I can easily understand why it would be faster...
In a completely metaphoric way, the difference is similar to the difference between performing "parallel parking" and writing a program that performs "parallel parking". Everyone can parallel-park, but very few people can embark on covering all possible situations in a computer program that could do it for you.
Obviously, all those speculations could be full of ... nothing, but the empiric evidence of many years of delivering QlikView applications tells me that we are much better off without synthetic keys than with them...
just my two cents...
Oleg Troyansky wrote: The speculation that I can offer about the possible cause for the difference is how I intuitively feel about it:
When we build the composite key, we give QlikView very specific instructions about HOW to build it - what tables, fields and values to use, etc... In contrast, when we leave multiple keys (sometimes in more than one pair of tables) for QlikView to "figure it out", QlikView has to determine - what synthetic keys need to be built? What tables are affected? What are all the possible permutations of the involved fields and their values? QlikView has to build the "game plan" before actually building the key, and I suspect, that algorithm is not a simple one. I can totally believe that, in attempt to cover all possible situations, the algorithm has to go through a lot of unnecessary iterations before settling on a certain solution. On the other hand, none of that is needed when we build the keys on our own - we tell QlikView what to load, and QlikView simply loads a few fields from a few tables. I can easily understand why it would be faster...
OK, what you say makes a lot of sense, and I haven't been thinking of it that way. It's also very consistent with my interpretation of the warnings in the reference manual:
"When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them."
And my interpretation:
"I don't think these warnings are about having synthetic keys; I think they're about having a LOT of synthetic keys."
I haven't encountered this situation in practice, except when I've had data model mistakes. But there are certainly good data models where you want multiple tables connected by multiple fields. If a lot of these fields are the same, perhaps it is ambiguous how the tables should be connected. QlikView does its best to try to figure out what you wanted, but it could get it wrong, and the synthetic keys that result could be all wrong and cause a serious problem. That seems VERY plausible from a theoretical standpoint, and again, is very consistent with the reference manual. In such a situation, it makes perfect sense to build your own concatenated keys, just like the reference manual recommended, so now the recommendation is making much more sense to me. Sometimes it is easier to just tell a computer what to do than to hope it figures it out.
So I'll now admit that the general form of my hypothesis, "synthetic keys aren't a problem if you have a good data model" isn't really true. Even if you only have two tables in your data model that share two fields, in a sense, you data model isn't good, because you haven't indicated how those two tables are supposed to connect. Now in a simple case like this, it's easy for QlikView to figure out, and I'll still hypothesize that the resulting synthetic key won't cause any problems. But in a fundamental sense, QlikView is completing your data model for you, so one of the basic assumptions in my hypothesis, "you have a good data model", is simply incorrect. In a fairly fundamental sense, if you haven't specified how the tables are to be connected, you do NOT have a good data model yet.
So I now see two risks associated with synthetic keys that weren't clear to me before. First, if the data model is ambiguous enough, there could be performance and memory problems as QlikView tries to guess what you probably wanted. Second, it might guess incorrectly, and the resulting data model could be simply wrong, and/or cause performance and memory problems.
I still suspect that in most practical cases, we're doing something simple like connecting two tables by two fields. In such a case, there is no ambiguity. QlikView should be able to get it correct without wasting time or memory, and the resulting data structure should be at least as efficient as if you'd modeled the connection manually with a concatenated key. I can't prove those "should"s, though. They're just educated guesses combined with overly-simplistic testing.
But regardless, any strong form of my hypothesis is likely simply incorrect. Synthetic keys probably CAN cause a problem when the data model is ambiguous. And to remove such ambiguity, you'll need to model the connections between tables yourself, manually. And that right there is the traditional advice. It at least makes more sense to me now, even if I still suspect you'd be better off leaving things alone in most cases.
Assuming we don't hear back from R&D, and when I can get some time, I should try to figure out how to make an ambiguous data model to try to create an actual synthetic key problem. People do it all the time on accident, including me, so in a sense it should be easy. The hard part, I think, is to do it with what is otherwise a correct data model, and where only the connections are ambiguous.
Thank you all for a very interesting thread. I thought I would quickly weigh in with my perspective... as much as that can be considered the "official" QlikView perspective. In short, John has it right. Synthetic keys are neither bad nor good, they just are QlikView's attempt to resolve a data model linkage. If that linkage makes sense, from a data model perspective, then a synthetic key is neither better nor worse than resolving the linkage manually in the load script.
So, why the bad reputation? Largely this is because synthetic keys are automatic. So, the most typical case is a new user who creates links between tables that were not expected and this results in large, and meaningless, synthetic keys. This case is not likely to perform well, and worse yet, is likely to have unexpected/meaningless results in the UI. Also, synthetic keys may create unexpected linkages with unexpected meanings. When you explicitly link tables together, you have to think through the meaning of linking tables/field. When QlikView does it automatically, it may have a meaning you didn't intend or expect.
My practical suggestion would be to use synthetic keys as a warning. If you see these pop-up, especially if you didn't expect them, you should investigate what tables you've loaded, what fields are in those tables, and what data model you *intended*. If QlikView automagically did the right think, then you're fine. If something unintended happened, then you may need to rename some columns to resolve an unexpected link.
My two cents (no time for much more...)
If you are a beginner, the recomandation stays true - avoid synthetic keys. As a rule it's a result of poor design.
If you're an expert, it's a different story. You know the risks and understand what you're doing.
(Compare to car driving, you may by chance run on two left wheels, but better leave it for the experts :-))
Anthony,
thank you very much for your explanation, it's wonderful to shed some light on this issue. One clarifying question for you:
Let's suppose that we have a situation with an "expected" syntetic key (or a number of synthetic keys) in a large data set. Would you expect any substantial difference in the performance between building the composite keys manually and building them automatically? Can QlikView run out of resources trying build all the synthetic keys automatically, while it could have been successfully performed when the keys are explicitly specified?
The reason for my question - in my practice, I've seen many instances when we had to "kill" the load processes taking too much time and resources because of an overlooked synthetic key. Admittedly, most of the times those were not intentional - in most cases, it happened when we forgot to drop a table or rename a field. Intuitively, I always thought that it's more "cost-efficient" to build the composite keys explicitly than to let QlikView figure it out automatically.
What would be your recommendation here?
Oleg Troyansky wrote:
Anthony, thank you very much for your explanation, it's wonderful to shed some light on this issue. One clarifying question for you:
Let's suppose that we have a situation with an "expected" syntetic key (or a number of synthetic keys) in a large data set. Would you expect any substantial difference in the performance between building the composite keys manually and building them automatically? Can QlikView run out of resources trying build all the synthetic keys automatically, while it could have been successfully performed when the keys are explicitly specified?
The reason for my question - in my practice, I've seen many instances when we had to "kill" the load processes taking too much time and resources because of an overlooked synthetic key. Admittedly, most of the times those were not intentional - in most cases, it happened when we forgot to drop a table or rename a field. Intuitively, I always thought that it's more "cost-efficient" to build the composite keys explicitly than to let QlikView figure it out automatically.
What would be your recommendation here?<div></div>
Oleg-- In your question you say, "I've had to kill loads... admittedly when the keys were not intentional". This is exactly my point... if you want a join between two+ tables on two+ fields, and that is "expected", then QV should handle it "as well as" if you resolved the key manually. If you end up with an unexpected join... then all bets are off.
Why? Because when you're joining two+ tables on two+ fields that weren't expected to join, then there are likely to be MANY unique combinations of the two+ fields. In other words, it is quite likely that you will build a massive synthetic key table... in fact far more massive than ever makes sense. That is why QV starts consuming memory, etc... it's trying to figure out how to union all these unique values between these fields.
Hope that helps...
-A
Anthony Deighton wrote:if you want a join between two+ tables on two+ fields, and that is "expected", then QV should handle it "as well as" if you resolved the key manually.
So to confirm, you're saying that as long as the synthetic keys that QlikView builds are correct, then QlikView will handle those keys as well as if we resolved the same keys manually in the load?
And by "handle it 'as well as'" do you mean in the final data model, so this is mostly talking about memory use and chart performance?
What about script performance and memory usage? I would guess that as long as the synthetic keys are intended, straightforward and correct, then it should be no slower, and may sometimes be significantly faster to allow QlikView to figure them out automatically. For example, in my test of a 10-field key, the load took 3:36 when I built the concatenated key manually, and only 1:21 when QlikView built it for me. There was a similar difference in another example I tested in another thread, but I'm too lazy to track that one down. (Edit: 14:01 vs. 3:27)
That said, what about the possible exception that Oleg and I discussed earlier? When you have a lot of legitimate multi-field keys that you're letting QlikView build for you, can it become difficult (but not impossible) for QlikView to figure out the right key structure? So in some cases, might the script execute faster instead of slower if you built the same keys manually, simply because of the difficulty of building them automatically? Or might we expect the creation of synthetic keys to ALWAYS be as fast or faster, no matter what the key structure, than building the exact same key structure manually?
Very nice. Now I understand what these tables.
Thanks
Since the move to the new forum ruined the formatting of my original post and made it unreadable, and since it won't let me edit it, and since I keep needing to refer to it, I've fixed the formatting below. I need to summarize the information from this thread in a document at some point, but until then, this will have to do.
--------------------------------------------------------------------
Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.
I believe that the consensus is wrong.
My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:
TableA:
FieldA, FieldB, FieldC
TableB: FieldA, FieldB, FieldD
What you’ll actually get is this:
TableA:
SyntheticKey, FieldC
TableB:
SyntheticKey, FieldD
SyntheticKeyTable:
SyntheticKey, FieldA, FieldB
Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.
Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.
Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.
Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.
Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.
Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.
What does the reference manual have to say on the subject?
"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.
Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:
Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them! But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.
I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.
My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.
I only have one test of my own to provide, from this thread:
http://community.qlik.com/forums/t/23510.aspx
In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.
load time file KB RAM KB CalcTime
synthetic key removed 14:01 49,507 77,248 46,000 ms
synthetic key left in place 3:27 49,401 77,160 44,797 ms
What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.
So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?