While talking in general terms about the user-friendliness of QlikView would be off topic, I think your concern with synthetic key behavior is fairly central to the topic. To borrow from the thread title, I find it difficult to stop worrying when I have no explanation for the seemingly-inconsistent behavior in your examples. I'm sure there is an explanation. But I'm not sure if we'll find one, or that if we do, the explanation will satisfy me.
That said, I consider the example data model to be flawed from the very start, before we even get to what synthetic keys QlikView will or won't build, or how the binary load behaves. To me, tables A and B are the same table. Same key = same table. Where I work, we actually have all sorts of examples like A and B, where we have two or more tables with the exact same key. But in these cases, it's still really one abstract "table", where we've broken it into pieces for technical reasons. But the technical reasons that apply in our database system do not apply in QlikView, so when reading in tables A and B, I would always join them back together into one table. We also don't allow nulls in key fields where I work, so I wouldn't allow the null value for KEY1 in your "x" record on table A. I would force it to be blank instead. Technically, our database system would force it to be blank instead, but in any case, I wouldn't let it be null in QlikView. So even if I weren't setting out to remove synthetic keys, I would load the data in a way that would be equivalent to this:
LOAD * INLINE [
KEY1, KEY2, C
1, 2, c
LOAD * INLINE [
KEY1, KEY2, KEY3, A
1, 2, 3, a
y, y, y, y
, x, x, x
LEFT JOIN (AB)
LOAD * INLINE [
KEY1, KEY2, KEY3, B
1, 2, 3, b
And that gives me the results I expect, a data model that shows the relationships between the records that I would expect. It still has a synthetic key, but only one. It's easily removed, and I might do so in practice, but this is a case where I'd expect to be able to leave it there with no ill effects, no strange behavior. And indeed, the strange behavior we saw from the combination of multiple synthetic keys and a binary load does go away.
That's not to say I'm comfortable with the behavior you found. I don't care how messed up the data model is, I would expect a binary load with no further script, and a binary load with some unrelated SET statements to result in the same data and data model. It seems like there's something funny going on when it tries to rebuild the synthetic key after the sets. That makes me uncomfortable. There's probably a logical explanation, but I don't know it and that makes me uncomfortable.
Hi John, it's always a pleasure to read your posts!
I am also kind of worried about the issue Thomas pointed to. I've put my 0.02€ in Thomas' main thread
I've tried to get Henric's attention to that topic and he answered, but I am not sure if I made myself clear enough or maybe just haven't understood what Henric tried to explain to me.
Regardless of the 'correctness' of the model, I see it critical that the relation between field values on a single resident data table seems to be broken by the creation of the syn key table.
It seems to me that the key table contains the correct key field value combination, but misses a correct linkage to the data table. Hence, the key table contains records that are unlinked to any other table, in other words, orphaned.
I just can't see any logical reason for that given the demo sample data set. It really looks like a bug to me.
And this, in the context of the OP, makes me worry....
[knowing that we see only symptoms of the bug, today in a situation that we call 'incorrect' modelling, but tomorrow maybe in production models]
It's perfectly fine to completely abandon the use of synthetic keys. And if you wish to do so, you surely consider yourself fully capable of explicitly modeling all of the connections between your tables, right? And if you're capable of doing that, you should also be capable of handing the task off to a coworker, and when she's done, recognizing whether or not she's modeled the connections between tables correctly, right?
QlikView is that coworker. If you wish, you can model only the basic tables, and then hand off modeling the connections between those tables to your "coworker". Run your script. Look at your data model. If your "coworker" got it right, then there's no problem. If your "coworker" got it wrong, then you'll need to be more explicit about how you meant for the tables to connect.
So stepping on a mine in this minefield is simply, "Oh, that's not how I wanted QlikView to connect those tables. I'll have to model the connection explicitly."
And that's all we mean by "trial and error" here - that in complicated cases of multiple tables connected by multiple fields, it is simply much faster and easier to run the script and observe the results than it would be to develop a 100% complete understanding of the (secret and proprietary) algorithms that QlikView uses and predict with 100% accuracy how it will interpret your intentions.
But it's also fine to completely abandon the use of synthetic keys. I'm not trying to talk you out of it. It is at the very least good practice for you to understand and explicitly model the connections between tables. It will help make clear to people maintaining the script later exactly what you intended. In going through the effort of modeling the connections, you may well figure out a better way to model the data. Largely through the responses in this thread, I've come to not love the synthetic key. I still have a few in my data models, but I'm not proud of them. They feel lazy to me. They feel like me failing to figure out a better way to model the data that wouldn't need composite keys.
I am very grateful to you for retelling my questions in real English. I have been worried that my description may cause misunderstanding because my mother tongue is not English.
Thomas Wang (T&W)
JOIN is a traditional solution that already exists in relational databases, and its logic can stand the test of time. but it also has the inherent problem of data explosion. This is the main reason why we use QlikView (in fact its analytical engine). Even if the simplest two structurally closely related tables are joining, it's possible to get near Cartesian product data when their data correlation is weak.
When we use QlikView, I usually do not recommend that users go to Join, because “it is not QlikView”.
In addition, I think the database's non-null restrictions on keys are reasonable, and QlikView which does not make any restrictions or warnings is relatively unfriendly.
My humble opinion.
“QlikView is that coworker.” This analogy is quite appropriate. But the situation is not what you describe.
Imagine you are a new boss. You know only a little about technology in this field. You have received a resume. Resume says job seekers can be skilled in completing related tasks. You were moved by his resume and hired him. But what about the result? He made a mess of things. You need not only re-hire a new employee, but also pay for him.
When I asked QlikView to perform some tasks, it did not report any errors or warnings, which meant that everything was going well. It's like a false and brilliant resume. When my product is online, users say that these data are not what they want. Or even more frightening, users don't know that the data isn't what they want, leading to more serious consequences. This is like a masterpiece of incompetent coworker.
We data practitioners should respect data and facts very much. It is our responsibility to make our products lack of ambiguity and deception.
Looking forward to a bright future.
Avoiding joins because of the possibility of accidental Cartesian products is to me the same as avoiding loops because of the possibility of accidental endless loops. Code your join condition properly. Code your loop termination properly. There are much harder things in programming. That's not to say I don't make mistakes. Of course I do, and I've had both Cartesian products and endless loops. But both are easy to recognize and fix when they happen.
I also don't agree that joins are not QlikView. I'll accept that there's a nebulous concept we might call "The QlikView Way", but I don't think avoiding joins is part of that. When I first started, I was so excited by QlikView's ability to make sense of complicated data models that I was making special efforts to normalize data as much as possible, even if the original data wasn't normalized. But with experience, I've come to take the opposite approach. I think lots of joins are very typical of QlikView script, and that that's fine. I'm not saying that your data model should be one gigantic table of every single bit of your data joined together, but don't shy away from a few joins, even ones that denormalize your data, and definitely not ones that make sense, like joining two tables that have the exact same unique key structure.
You have two main choices - model all of the data connections yourself (such as with composite keys or a more substantial data model change) or allow QlikView to attempt to do it for you (such as with synthetic keys). You are saying "well, I'll just model it all myself then". Good choice! But in saying that, you also admit that you know how to do so. You admit that you understand exactly how all the data should be connected and how to do it. And if you understand all that, then it should also be trivial for you to recognize whether or not QlikView did the right thing.
If you're truly claiming that you're more like a boss than a programmer, and that you have no ability to tell whether or not QlikView tied your tables together the way you need, then you also have no ability to do it yourself manually. And if you can't do it yourself, you certainly can't conclude "well, I'll just model it myself then."
I have never, in over a decade of use of QlikView, looked at how QlikView connected any of my actual tables (not hypothetical intentionally-ambiguous examples), and spent more than a few seconds thinking, "does that look right or wrong to me?" Before I even start writing script, I figure out how I want my tables to look and how I want them to be connected. Data modeling is fundamental.
“Data modeling is fundamental.” I agree now, but I didn't think so before. Because I believe in a rumor that it claims that QV is easy to use and doesn't need ETL.
I might want to make it clear that what I mean by ETL is a concept, not a specific approach including the use of tools. I think the ETL process can be done by QV, though it may not be official.
I am sorry that some of my earlier descriptions are not rigorous enough to cause some misunderstanding.
The true Cartesian product caused by the incorrect setting of the connection condition is what we must correct. This is almost our common sense, because the behavior of various database products is similar.
But the problem I described is that when the connection condition is set correctly. There are some extremely exotic data whose derivative table capacity is increased significantly. It may be exaggerated to say that it is infinitely close to Cartesian product, but there is no clear dividing line between it and Cartesian product.
“JOIN is not QlikVIew.” This sentence is really not rigorous. It is certain that at least this function exists in QlikView. Actually, what I want to say is that in the final report products, we should not overuse JOIN. As I made another reply above, I once thought QlikView did not need ETL. Until now, I still think that JOIN is what ETL wants to do. But when you decide to use QlikView to perform ETL, you may need to use JOIN in QlikView.
I emphasize that "don't use JOIN on the final report product" because when a data explosion occurs in the ETL process, we can detect it (such as a dramatic increase in the capacity of intermediate storage media) without immediately affecting our final product.