Re: Should We Stop Worrying and Love the Synthetic... - Page 9 - Qlik Community

johnw · ‎2010-06-30

Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed.

I believe that the consensus is wrong.

My understanding of a synthetic key is that it’s a pretty basic data structure. If you load tables like this:

TableA: FieldA, FieldB, FieldC

TableB: FieldA, FieldB, FieldD

What you’ll actually get is this:

TableA: SyntheticKey, FieldC

TableB: SyntheticKey, FieldD

SyntheticKeyTable: SyntheticKey, FieldA, FieldB

Where neither the synthetic key nor the synthetic key table can be directly referenced, only the original fields.

Well, that doesn’t sound like a bad thing. If I had two tables that I legitimately wanted to connect by two fields, that’s pretty much what I would do manually. As far as I can tell, QlikView is simply saving me the trouble.

Two tables can be connected by one field. That shows up as a connection. Two tables can be connected by two or more fields. That shows up as a synthetic key.

Now, maybe you should NOT connect two particular tables by one field. If so, that’s a data model problem that should be fixed. And while you could say that THAT connection is a problem, you should never say that CONNECTIONS are a problem.

Similarly, maybe you should NOT connect two particular tables by two or more fields. If so, that’s a data model problem that should be fixed. And while you could say that THAT synthetic key is a problem, perhaps you should never say that SYNTHETIC KEYS are a problem.

Synthetic keys should be no more a problem than automatic connections between tables are a problem. Either can cause problems when they result from a bad data model, when there are too many or the wrong connections. But neither should cause problems when they result from a good data model. You should not remove all synthetic keys any more than you should remove all connections between tables. If it is appropriate to connect two tables on two or more fields, I believe it is appropriate to use a synthetic key.

What does the reference manual have to say on the subject?

"When two or more input tables have two or more fields in common, this implies a composite key relationship. QlikView handles this through synthetic keys. These keys are anonymous fields that represent all occurring combinations of the composite key. When the number of composite keys increases, depending on data amounts, table structure and other factors, QlikView may or may not handle them gracefully. QlikView may end up using excessive amount of time and/or memory. Unfortunately the actual limitations are virtually impossible to predict, which leaves only trial and error as a practical method to determine them.

Therefore we recommend an overall analysis of the intended table structure by the application designer. Typical tricks include:

· Forming your own non-composite keys, typically using string concatenation inside an AutoNumber script function.

· Making sure only the necessary fields connect. If you for example use a date as a key, make sure you do not load e.g. year, month or day_of_month from more than one input table."

Yikes! Dire warnings like “may not handle them gracefully” and “may end up using excessive amount of time and/or memory” and “impossible to predict”. No WONDER everyone tries to remove them!

But I suspect that the reference manual is just poorly written. I don’t think these warnings are about having synthetic keys; I think they’re about having a LOT of synthetic keys. Like many of you, I’ve gotten that nasty virtual memory error at the end of a data load as QlikView builds large numbers of synthetic keys. But the only time I’ve ever seen this happen is when I’ve introduced a serious data model problem. I’ve never seen a good data model that resulted in a lot of synthetic keys. Doesn’t mean they don’t exist, of course, but I’ve never seen one.

I’d also like to focus on this particular part, “Typical tricks include: Forming your own non-composite keys”. While I agree that this is a typical trick, I also believe it is useless at best, and typically A BAD IDEA. And THAT is what I’m particularly interested in discussing.

My belief is that there is no or almost no GOOD data model where this trick will actually improve performance and memory usage. I’m suggesting that if you have a synthetic key, and you do a direct one to one replacement with your own composite key table, you will not improve performance or memory usage. In fact, I believe performance and memory usage will typically get marginally worse.

I only have one test of my own to provide, from this thread:

http://community.qlik.com/forums/t/23510.aspx

In the thread, a synthetic key was blamed for some performance and memory problems, and it was stated that when the synthetic key was removed, these problems were solved. I explained that the problem was actually a data modeling problem, where the new version had actually corrected the data model itself in addition to removing the synthetic key. I then demonstrated that if the synthetic key was reintroduced to the corrected data model, script performance was significantly improved, while application performance and memory usage were marginally improved.

load time file KB RAM KB CalcTime

synthetic key removed 14:01 49,507 77,248 46,000 ms

synthetic key left in place 3:27 49,401 77,160 44,797 ms

What I would love to see are COUNTEREXAMPLES to what I’m suggesting. I’m happy to learn something here. I’ve seen plenty of vague suggestions that the synthetic keys have hurt performance and wasted memory, but actual examples seem to be lacking. The few I ran across all looked like they were caused by data model problems rather than by the synthetic keys themselves. Maybe I just have a bad memory, and failed to find good examples when I searched. But I just don’t remember seeing them.

So who has a script that produces a good data model with a composite key table, where removing the composite key table and allowing QlikView to build a synthetic key instead decreases performance or increases memory usage? Who has an actual problem CAUSED by synthetic keys, rather than by an underlying data model problem?

johnw · ‎2018-10-29

While talking in general terms about the user-friendliness of QlikView would be off topic, I think your concern with synthetic key behavior is fairly central to the topic. To borrow from the thread title, I find it difficult to stop worrying when I have no explanation for the seemingly-inconsistent behavior in your examples. I'm sure there is an explanation. But I'm not sure if we'll find one, or that if we do, the explanation will satisfy me.

That said, I consider the example data model to be flawed from the very start, before we even get to what synthetic keys QlikView will or won't build, or how the binary load behaves. To me, tables A and B are the same table. Same key = same table. Where I work, we actually have all sorts of examples like A and B, where we have two or more tables with the exact same key. But in these cases, it's still really one abstract "table", where we've broken it into pieces for technical reasons. But the technical reasons that apply in our database system do not apply in QlikView, so when reading in tables A and B, I would always join them back together into one table. We also don't allow nulls in key fields where I work, so I wouldn't allow the null value for KEY1 in your "x" record on table A. I would force it to be blank instead. Technically, our database system would force it to be blank instead, but in any case, I wouldn't let it be null in QlikView. So even if I weren't setting out to remove synthetic keys, I would load the data in a way that would be equivalent to this:

C:

LOAD * INLINE [

KEY1, KEY2, C

1, 2, c

];

AB:

LOAD * INLINE [

KEY1, KEY2, KEY3, A

1, 2, 3, a

y, y, y, y

, x, x, x

];

LEFT JOIN (AB)

LOAD * INLINE [

KEY1, KEY2, KEY3, B

1, 2, 3, b

];

And that gives me the results I expect, a data model that shows the relationships between the records that I would expect. It still has a synthetic key, but only one. It's easily removed, and I might do so in practice, but this is a case where I'd expect to be able to leave it there with no ill effects, no strange behavior. And indeed, the strange behavior we saw from the combination of multiple synthetic keys and a binary load does go away.

That's not to say I'm comfortable with the behavior you found. I don't care how messed up the data model is, I would expect a binary load with no further script, and a binary load with some unrelated SET statements to result in the same data and data model. It seems like there's something funny going on when it tries to rebuild the synthetic key after the sets. That makes me uncomfortable. There's probably a logical explanation, but I don't know it and that makes me uncomfortable.

swuehl · ‎2018-10-29

Hi John, it's always a pleasure to read your posts!

I am also kind of worried about the issue Thomas pointed to. I've put my 0.02€ in Thomas' main thread

https://community.qlik.com/message/1564595?tstart=0#1564595

I've tried to get Henric's attention to that topic and he answered, but I am not sure if I made myself clear enough or maybe just haven't understood what Henric tried to explain to me.

Regardless of the 'correctness' of the model, I see it critical that the relation between field values on a single resident data table seems to be broken by the creation of the syn key table.

It seems to me that the key table contains the correct key field value combination, but misses a correct linkage to the data table. Hence, the key table contains records that are unlinked to any other table, in other words, orphaned.

I just can't see any logical reason for that given the demo sample data set. It really looks like a bug to me.

And this, in the context of the OP, makes me worry....

[knowing that we see only symptoms of the bug, today in a situation that we call 'incorrect' modelling, but tomorrow maybe in production models]

Regards

Stefan

johnw · ‎2018-10-29

It's perfectly fine to completely abandon the use of synthetic keys. And if you wish to do so, you surely consider yourself fully capable of explicitly modeling all of the connections between your tables, right? And if you're capable of doing that, you should also be capable of handing the task off to a coworker, and when she's done, recognizing whether or not she's modeled the connections between tables correctly, right?

QlikView is that coworker. If you wish, you can model only the basic tables, and then hand off modeling the connections between those tables to your "coworker". Run your script. Look at your data model. If your "coworker" got it right, then there's no problem. If your "coworker" got it wrong, then you'll need to be more explicit about how you meant for the tables to connect.

So stepping on a mine in this minefield is simply, "Oh, that's not how I wanted QlikView to connect those tables. I'll have to model the connection explicitly."

And that's all we mean by "trial and error" here - that in complicated cases of multiple tables connected by multiple fields, it is simply much faster and easier to run the script and observe the results than it would be to develop a 100% complete understanding of the (secret and proprietary) algorithms that QlikView uses and predict with 100% accuracy how it will interpret your intentions.

But it's also fine to completely abandon the use of synthetic keys. I'm not trying to talk you out of it. It is at the very least good practice for you to understand and explicitly model the connections between tables. It will help make clear to people maintaining the script later exactly what you intended. In going through the effort of modeling the connections, you may well figure out a better way to model the data. Largely through the responses in this thread, I've come to not love the synthetic key. I still have a few in my data models, but I'm not proud of them. They feel lazy to me. They feel like me failing to figure out a better way to model the data that wouldn't need composite keys.

thomas_wang · ‎2018-10-29

Hi Stefan,

I am very grateful to you for retelling my questions in real English. I have been worried that my description may cause misunderstanding because my mother tongue is not English.

Best Regards,

Thomas Wang (T&W)

thomas_wang · ‎2018-10-30

Hi John,

JOIN is a traditional solution that already exists in relational databases, and its logic can stand the test of time. but it also has the inherent problem of data explosion. This is the main reason why we use QlikView (in fact its analytical engine). Even if the simplest two structurally closely related tables are joining, it's possible to get near Cartesian product data when their data correlation is weak.

When we use QlikView, I usually do not recommend that users go to Join, because “it is not QlikView”.

In addition, I think the database's non-null restrictions on keys are reasonable, and QlikView which does not make any restrictions or warnings is relatively unfriendly.

My humble opinion.

thomas_wang · ‎2018-10-30

Hi John,

“QlikView is that coworker.” This analogy is quite appropriate. But the situation is not what you describe.

Imagine you are a new boss. You know only a little about technology in this field. You have received a resume. Resume says job seekers can be skilled in completing related tasks. You were moved by his resume and hired him. But what about the result? He made a mess of things. You need not only re-hire a new employee, but also pay for him.

When I asked QlikView to perform some tasks, it did not report any errors or warnings, which meant that everything was going well. It's like a false and brilliant resume. When my product is online, users say that these data are not what they want. Or even more frightening, users don't know that the data isn't what they want, leading to more serious consequences. This is like a masterpiece of incompetent coworker.

We data practitioners should respect data and facts very much. It is our responsibility to make our products lack of ambiguity and deception.

Looking forward to a bright future.

johnw · ‎2018-10-30

Avoiding joins because of the possibility of accidental Cartesian products is to me the same as avoiding loops because of the possibility of accidental endless loops. Code your join condition properly. Code your loop termination properly. There are much harder things in programming. That's not to say I don't make mistakes. Of course I do, and I've had both Cartesian products and endless loops. But both are easy to recognize and fix when they happen.

I also don't agree that joins are not QlikView. I'll accept that there's a nebulous concept we might call "The QlikView Way", but I don't think avoiding joins is part of that. When I first started, I was so excited by QlikView's ability to make sense of complicated data models that I was making special efforts to normalize data as much as possible, even if the original data wasn't normalized. But with experience, I've come to take the opposite approach. I think lots of joins are very typical of QlikView script, and that that's fine. I'm not saying that your data model should be one gigantic table of every single bit of your data joined together, but don't shy away from a few joins, even ones that denormalize your data, and definitely not ones that make sense, like joining two tables that have the exact same unique key structure.

johnw · ‎2018-10-30

You have two main choices - model all of the data connections yourself (such as with composite keys or a more substantial data model change) or allow QlikView to attempt to do it for you (such as with synthetic keys). You are saying "well, I'll just model it all myself then". Good choice! But in saying that, you also admit that you know how to do so. You admit that you understand exactly how all the data should be connected and how to do it. And if you understand all that, then it should also be trivial for you to recognize whether or not QlikView did the right thing.

If you're truly claiming that you're more like a boss than a programmer, and that you have no ability to tell whether or not QlikView tied your tables together the way you need, then you also have no ability to do it yourself manually. And if you can't do it yourself, you certainly can't conclude "well, I'll just model it myself then."

I have never, in over a decade of use of QlikView, looked at how QlikView connected any of my actual tables (not hypothetical intentionally-ambiguous examples), and spent more than a few seconds thinking, "does that look right or wrong to me?" Before I even start writing script, I figure out how I want my tables to look and how I want them to be connected. Data modeling is fundamental.

thomas_wang · ‎2018-10-30

“Data modeling is fundamental.” I agree now, but I didn't think so before. Because I believe in a rumor that it claims that QV is easy to use and doesn't need ETL.

I might want to make it clear that what I mean by ETL is a concept, not a specific approach including the use of tools. I think the ETL process can be done by QV, though it may not be official.

thomas_wang · ‎2018-10-30

I am sorry that some of my earlier descriptions are not rigorous enough to cause some misunderstanding.

The true Cartesian product caused by the incorrect setting of the connection condition is what we must correct. This is almost our common sense, because the behavior of various database products is similar.

But the problem I described is that when the connection condition is set correctly. There are some extremely exotic data whose derivative table capacity is increased significantly. It may be exaggerated to say that it is infinitely close to Cartesian product, but there is no clear dividing line between it and Cartesian product.

“JOIN is not QlikVIew.” This sentence is really not rigorous. It is certain that at least this function exists in QlikView. Actually, what I want to say is that in the final report products, we should not overuse JOIN. As I made another reply above, I once thought QlikView did not need ETL. Until now, I still think that JOIN is what ETL wants to do. But when you decide to use QlikView to perform ETL, you may need to use JOIN in QlikView.

I emphasize that "don't use JOIN on the final report product" because when a data explosion occurs in the ETL process, we can detect it (such as a dramatic increase in the capacity of intermediate storage media) without immediately affecting our final product.

Should We Stop Worrying and Love the Synthetic Key?