Should We Stop Worrying and Love the Synthetic Key?
Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.
I believe that the consensus is wrong.
My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:
TableA: FieldA, FieldB, FieldC
TableB: FieldA, FieldB, FieldD
What you’ll actually get is this:
TableA: SyntheticKey, FieldC
TableB: SyntheticKey, FieldD
SyntheticKeyTable: SyntheticKey, FieldA, FieldB
Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.
Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.
Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.
Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.
Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.
Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.
What does the reference manual have to say on the subject?
"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.
Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:
· Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.
· Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."
Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them!
But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.
I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.
My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.
I only have one test of my own to provide, from this thread:
In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.
load time file KB RAM KB CalcTime
synthetic key removed 14:01 49,507 77,248 46,000 ms
synthetic key left in place 3:27 49,401 77,160 44,797 ms
What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.
So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?
I just ran a test on a couple of files, and ended up with nearly-identical results for both the synthetic key and the manually-created key. The files I tested on included both one with a synthetic key (which I removed for testing) and one with manually-created keys.
Personally, I have no problem whatsoever with synthetic keys - in my own work, I don't remove them unless I feel there's a good reason to. I think a lot of their bad press stems the fact that synthetic keys are often created (often excessively) when you make a mistake in the data model. This makes people shy away from allowing them to exist. This can get particularly bad in some situations - for example, when you run 'Load *, 1 as myField resident myTable;', you'll get a whole lot of ugly synthetic keys as QV tries to hash out every possible combination of fields.
As a rule of thumb, I teach other developers to create manual keys when a synthetic key would have four or more columns in it. I've never tested if this actually improves performance, but anything over three columns (four synthetic keys, iirc - AB, AC, BC, ABC) tends to make the data model look messy, so I try to avoid those.
I would certainly be interested in seeing results - primarily for application performance, which I think is far more interesting than load times and file size - that would give me a better idea of when (if ever) to eliminate synthetic keys.
You know, I don't think I've ever had a synthetic key on more than two fields, and I haven't tested it either. Just fiddling with it, three matching fields did what I expected - created a single synthetic key, not a synthetic key for each possible combination of the three fields. Same for ten matching fields. I'm going to set up a test with a large number of matching fields and see if it performs like I would expect (about the same but marginally better with the synthetic key).
I agree that script speed tends to be much less important than chart speed, though I try to optimize my script speed too. Many applications are more useful the fresher the data is, and if all your scripts run twice as fast, you may be able to load them twice as often. But no one script is that critical, and I suspect that most synthetic keys can be replaced for much less script overhead than the one case I tested.
I agree that much of the bad press originates from data model mistakes. You make a mistake, it generates one or more synthetic keys, takes forever to load, memory usage shoots through the roof, your charts stop working, and you have a complete mess. You fix the mistake, the synthetic key goes away, and so do all the problems. It's easy to conclude (incorrectly, I believe) that the synthetic key was the cause. But the synthetic key in this case was probably just another symptom of the underlying data model mistake. And sometimes, the mere process of removing the synthetic key can fix your data model mistake. If your experience is then that removing synthetic keys fixes your performance problems, it's easy to believe the synthetic keys are to blame. But I think this is usually or always mere coincidence - while removing the synthetic key, you ALSO fixed your data model.
John Witherspoon wrote:But the synthetic key in this case was probably just another symptom of the underlying data model mistake. And sometimes, the mere process of removing the synthetic key can fix your data model mistake.
Agree. I understand that when a syn key appears that something is going wrong with the model as I didn't make the syn key (nor anything realated) on purpose. I take them more as a warning rather than a problem.
Thanks for sharing your experience on qlikview. But I personally believe that people may not understand if they read this, they will only understood only with experience when they work around the synthetic keys. What I can see is that you have amazing knowledge in qlikview but that is not just because of reading the books it is just because of your experience.
I faced serious problem because of synthetic keys. So I always prefer to avoid the synthetic in my data model.
I also feel that qlikview should be in a position to give the scores on efficiency of the script that we are using in the load script then only we will be in a position to understand whether these synthetic keys is really a problem or not. If qlikview can't say about this then there is not other way to find the same other than your experience.
Rikab Kothari wrote: ...I personally believe that people may not understand if they read this, they will only understood only with experience when they work around the synthetic keys.... I faced serious problem because of synthetic keys. So I always prefer to avoid the synthetic in my data model.... What do you say?
What I suspect is that you have not actually faced serious problems because of synthetic keys. Instead, you have probably faced serious problems because of poor data modeling. Synthetic keys often show up when you model your data poorly. But the synthetic keys do not CAUSE your poor data model, any more than coughing makes you sick. And removing the synthetic keys may not fix your data model, any more than taking cough syrup will make you healthy. I believe you are confusing correlation with causation, or confusing cause and effect.
To fix your data model, you have to understand what is wrong with your data model. Synthetic keys, when you didn't expect them, are certainly a clue that may help you to understand what is wrong. But they are not the actual problem.
When less experienced people encounter synthetic keys, they should NOT ignore them. I wasn't trying to suggest that. I would suggest that they should try to UNDERSTAND them. Why does your data model end up with a synthetic key? Why DO you have the same fields on more than one table? Does it make sense to do that? Does the synthetic key make sense? Does your data model make sense? If so, I'm not aware of any reason to remove the synthetic key. If not, then fix the data model itself. Don't think of the synthetic key as the problem to be solved, or you may simply remove the synthetic key in a way that doesn't fix your data model, such as doing an exact replacement with a concatenated key. That probably wouldn't help.
John Witherspoon wrote: I'm going to set up a test with a large number of matching fields and see if it performs like I would expect (about the same but marginally better with the synthetic key).
OK, I made a trivial 10-field synthetic key. I created two million rows on two tables linked by this synthetic key. I put an X value in one table and a Y value in the other. I then did a scatter chart of sum(X) and sum(Y) by one of the fields in one of the tables. Then I created a concatenated key instead of the synthetic key for comparison.
LoadTime FileSize RAM CalcTime Synthetic Key 1:23 30985 K 167264 K 1344 ms Synthetic Key 1:21 30989 K 167960 K 1391 ms Synthetic Key 1:20 30987 K 167252 K 1375 ms Concatenated Key 3:36 30988 K 167732 K 1375 ms Concatenated Key 3:36 30990 K 168500 K 1437 ms Concatenated Key 3:36 30988 K 169732 K 1359 ms
I'm not sure we can draw any conclusion about memory use and calculation time. I'd need a lot more data points given the closeness and the variability. But the extremely preliminary conclusion would be that the synthetic key version is better for memory use and calculation time, but insignificantly so. The main difference is load time, and while load time isn't nearly so important, why waste it for no improvement?
I'll go ahead and post my test file, though with only 2000 instead of 2000000 rows, in case anyone wants to repeat the testing or tell me how to concatenate my key more efficiently.
Edit: I should mention that this is a very poor data model, and is only being used to test synthetic key performance. If this were a real application with real fields, you should join everything into a single table. The synthetic key would go away, and the application would likely peform somewhat better. However, the synthetic key wasn't the problem (as the concatenated key performance demonstrates); the data model was the problem.
Edit2: I've gone ahead and fixed the data model for comparison purposes, simply loading everything into a single table. It is loading much, much faster, taking much less file size, less RAM, and calculating over twice as fast. What I'm hoping to show here is the difference between simply removing the synthetic key and actually fixing the data model. Either way, the synthetic key goes away. But if all you do is remove it, things get worse. Things only get better when you fix the underlying data model problem. In this example, anyway. And again, this isn't to say that synthetic keys MEAN you have a data model problem. Good data models can contain synthetic keys. This one, however, should not.
LoadTime FileSize RAM CalcTime Synthetic Key 1:23 30985 K 167264 K 1344 ms Synthetic Key 1:21 30989 K 167960 K 1391 ms Synthetic Key 1:20 30987 K 167252 K 1375 ms Concatenated Key 3:36 30988 K 167732 K 1375 ms Concatenated Key 3:36 30990 K 168500 K 1437 ms Concatenated Key 3:36 30988 K 169732 K 1359 ms Data Model Fixed 0:12 19143 K 150408 K 625 ms Data Model Fixed 0:12 19145 K 149112 K 641 ms Data Model Fixed 0:12 19146 K 149228 K 625 ms
Edit3: I was asked to post the script for the "Data Model Fixed" version. It's pretty trivial:
LOAD ceil(rand()*10) as A ,ceil(rand()*10) as B ,ceil(rand()*10) as C ,ceil(rand()*10) as D ,ceil(rand()*10) as E ,ceil(rand()*10) as F ,ceil(rand()*10) as G ,ceil(rand()*10) as H ,ceil(rand()*10) as I ,ceil(rand()*10) as J ,ceil(rand()*50000) as XID ,ceil(rand()*1000) as X ,ceil(rand()*1000) as Y AUTOGENERATE 2000000 ;
Rikab Kothari wrote:I also feel that qlikview should be in a position to give the scores on efficiency of the script that we are using in the load script then only we will be in a position to understand whether these synthetic keys is really a problem or not. If qlikview can't say about this then there is not other way to find the same other than your experience.
QlikView DOES give you a score - the run time. If you change the script, and it runs twice as fast, your script is now twice as efficient. I think that's about the only score QlikView COULD give you. It sounds like you want QlikView to read your script, somehow consider how else you might have written it more efficiently, and then score your code based on a comparison to what it thinks would be the most efficient way to do it. If that program existed, nobody would need you as a programmer. QlikView would write itself. But perhaps I'm misunderstanding your request.
And understanding whether synthetic keys are a problem or not isn't a matter of understanding your script anyway. It's a matter of understanding your data model. MORE than that, it's a matter of understanding what your data model SHOULD BE. Just as with evaluating script, how is a computer program supposed to understand what your data model should be? All it can possibly understand is what it is. And if what it is has a synthetic key in it, well, then that's just how it should be from the computer's standpoint. Only YOU can decide if your data model is correct or not.
So I'll agree with your conclusion, "then there is no other way to find [whether these synthetic keys is really a problem or not] other than your experience." Correct. Only experience with data modeling will tell you if your data model is correct. No computer can tell you. Only experience can tell you.
As John pointed out, the creating of simple synthetic keys with a synthetic key table is very much the traditional response to the problem of multiple keys between tables.
But Qlikview is not a traditional database (and Mr Codd would have certainly objected to hearing it referred to in the same sentence). The way we should think and design with Qlikview is very different and it takes a mind shift to, for example, not to get hung up about duplication of values.
The truth is we have to take a leap of faith with Qliktech and specifically the documentation. We can try to perform 'tests' that produce often inconclusive results and may be invalid under different circumstances.
My pet gripe about the product is Qliktechs absolute reluctance to discuss anything related to performance and optimisation and here is another example where they are the only ones who can resolve it.
So, I would say if you are happy with the time it takes for the script to run and the perfomance of the document in use then let simple synthetic keys be, move on and create something new that adds real value to your business.
Have you found QlikTech to be reluctant to discuss performance and optimization?
Other than simple bug reports, I've only really dealt with QlikTech on two technical issues, an in both cases they were very forthcoming.
The first was the issue of mathematical imprecision, where they confirmed for me that they were using a binary floating point to store numbers, which is why some of the numeric functions have "unpredictable" and incorrect results. (Edit: I should probably add "occasionally" to that. Most of the time, numbers behave as you would expect in QlikView. I do accounting reports in QlikView, and have not implemented known workarounds. To me, it isn't a big enough problem to worry about. See http://community.qlik.com/wikis/qlikview-wiki/explanation-of-mathematical-errors-in-qlikview.aspx for information and workarounds.)
The second was the issue of set analysis performance vs. other approaches, which resulted in an open (if very high level) discussion of how set analysis actually behaves internally, and what we should be able to conclude about performance from that behavior. That also pointed out that set analysis in an early release of version 9 was not performing as expected, leading them to do some additional fixes and optimization.
This case seems similar in a way. I THINK I understand how synthetic keys behave and how they are stored. Well, I think I'm pretty solid on one level, and a little sketchy on another. On one level, where we can actually talk about tables and the like, I think a synthetic key is just another table, like the one I showed near the top of the original post. This is strongly suggested by the "Internal Table View" in the table viewer. For example, for a 10-field synthetic key, I see this:
I believe the only significant differences between this and a real table is that it is built automatically, and that you cannot refer to the table by name or to the synthetic key by name. The table name and synthetic key name are purely internal values, used only by the product itself.
The level where I'm a little or even a lot sketchy is that I'm just generally sketchy on QlikView's internal storage format for tables. I don't think it's storing them in anything approaching the relational model that it displays when you look at the "Internal Table View". It appears to directly store lists of (or at least a lists of pointers to) the distinct values of each field, the evidence for which is how much faster LOAD fieldvalue('Field',iterno()) can generate a list of values than LOAD DISTINCT Field on large tables.
"QlikView data is stored in a straightforward tabular format.
QlikView data is compressed via what QlikTech calls a "symbol table," but I generally call "dictionary" or "token" compression.
QlikView typically gets at its data via scans. There is very little in the way of precomputed aggregates, indexes, and the like. Of course, if the selection happens to be in line with the order in which the records are sorted, you can get great selectivity in a scan.
One advantage of doing token compression is that all the fields in a column wind up being the same length. Thus, QlikView holds its data in nice arrays, so the addresses of individual rows can often be easily calculated.
To get its UI flexibility, QlikView implicitly assumes a star/snowflake schema. That is, there should be no more and no less than one possible join path between any pair of tables. In some cases, this means one will want to rename fields as part of QlikView load scripts."
While I'm not sure how much that tells us about synthetic keys, I see nothing there that would lead me to believe that synthetic keys and the synthetic key table are stored any differently than a regular key and a regular table. There are obviously minor differences, such as perhaps the synthetic key being flagged as such. But it certainly seems like it is mostly the same, both from thinking theoretically about how they might store the data, and from checking the actual performance differences (very minor once built) between a concatenated key table and a synthetic key table. I'm guessing that they aren't just nearly identical in the "Internal Table View", but that they are also nearly identical as internally stored and compressed by QlikView.
That IS something I'd love to have confirmed by a techie from QlikTech. If so, that would be fairly solid confirmation that MERELY replacing a synthetic key with a concatenated key is useless, because it would be stored almost exactly the same internally. You would have accomplished nothing, essentially.
Is there any techie from QlikTech reading this and able to confirm how synthetic keys and synthetic key tables are stored internally?