Qlik Community

QlikView Documents

Documents for QlikView related information.

Synthetic keys vs. AutoNumber() (- concatenation)

robin_hausdoerfer
Valued Contributor III

Synthetic keys vs. AutoNumber() (- concatenation)

I’ve read a lot of articles about synthetic keys. Most people recommend not using them, because QlikView builds internal mapping tables. Concatenating keys (=composite keys) in combination with the AutoNumber() function will result in better performance without any doubt.

But in my opinion there’s a very big drawback concerning concatenated keys. It’s very easy to understand. For my example I need an Excel input file with following tables:

LEFT:

RIGHT:

(The empty cells will lead to null() – values)

Example 1: synthetic keys:


TABLE1:
LOAD KEY1,
KEY2,
LEFT
FROM
[input.xlsx]
(
ooxml, embedded labels, table is LEFT);

TABLE2:
LOAD KEY1,
KEY2,
RIGHT
FROM
[input.xlsx]
(
ooxml, embedded labels, table is RIGHT);

Result:

 

You can see, that the keys only match, if KEY1 and KEY2 match. Null() isn’t a value so it will never match.

Example 2: concatenation + AutoNumber()


TABLE3:
LOAD AutoNumber(KEY1KEY2) as AUTONUMBER_KEY,
LEFT as LEFT2
FROM
[input.xlsx]
(
ooxml, embedded labels, table is LEFT);

TABLE4:
LOAD AutoNumber(KEY1KEY2) as AUTONUMBER_KEY,
RIGHT as RIGHT2
FROM
[input.xlsx]
(
ooxml, embedded labels, table is RIGHT);

Result:

Now the drawback I mentioned is clearly visible. For example LEFT2=4 will be matched with RIGHT2 = 14. This can cause big problems, because KEY1 has nothing to do with KEY2 normally.

Conclusion


On the one hand concatenation of keys can cause problems, because there’s no clear separation between the keys. Especially the existence of null() values is risky. A separator like '*#*’ in combination with the replacement of null() values can reduce the risk significantly.


On the other hand the behaviour in example 2 could be desired. There’s no right or wrong. I just wanted you to think about your data before using composite keys.


Example 3: an approach for fixing the problem


TABLE5:
LOAD AutoNumber(if(IsNull(KEY1),'*1*',KEY1) & '*#*' & if(IsNull(KEY2),'*1*',KEY2)) as AUTONUMBER_KEY2,
LEFT as LEFT3
FROM
[input.xlsx]
(
ooxml, embedded labels, table is LEFT);

TABLE6:
LOAD AutoNumber(if(IsNull(KEY1),'*2*',KEY1) & '*#*' & if(IsNull(KEY2),'*2*',KEY2)) as AUTONUMBER_KEY2,
RIGHT as RIGHT3
FROM
[input.xlsx]
(
ooxml, embedded labels, table is RIGHT);

Result:

Comments
MVP & Luminary
MVP & Luminary

I can not quite agree because it indicates that synthetic keys are a good and easy choice by building a datamodel and even preferable against composite keys.


From a technical point of view a synthetic key and a composite key are quite the same but the synthetic key has disadvantages in regard to the consumption of CPU + RAM. This isn't obvious with small datasets but by large and/or complex applications which might without a proper datamodel contain a lot of synthetic keys (maybe even synthetic keys between synthetic keys) it will make a difference.


This is really a complex topic and I suggest for each one with interest to read these (long) posting: Should We Stop Worrying and Love the Synthetic Key? to decide which way they want to go. For everyone else I suggest to avoid synthetic keys by reading this one: Synthetic Keys.


Beside this each combination of fields to generate a key needs a delimiter-char between them to ensure that a certain combination is really unique and couldn't be mismatched in any way. This meant:


AutoNumber(KEY1 & '|' & KEY2) as AUTONUMBER_KEY


would work well and something like your third approach won't be needed (although it might be helpful in some cases to replace NULL's or other invalid values with some default-values).


- Marcus


robin_hausdoerfer
Valued Contributor III

Hello Marcus,

I awaited a comment like that.


My article shall not be a general recommendation for using synthetic keys.

Synthetic keys are just a very convenient way for me to deal with composite keys, even I know that the performance and ram usage is for away from optimal.

I just tried to to point out, that composite keys could lead to different results than synthetic keys do.

I just want the readers not the make the same mistakes as I did.

You suggested the following:

AutoNumber(KEY1 & '|' & KEY2) as AUTONUMBER_KEY


This leads to a totally "new" result:

This is also not the same as the result of example 1, because LEFT=2 is matched with RIGHT= 12 although KEY2 is null() on "both sides".

I wished there would be a "qlikview synthetic key autonumber" function which calculated autonumbers that create the same results as example 1 does.

MVP & Luminary
MVP & Luminary

Hi Robin,

I think that I now understand what you meant whereby the autonumber-key generation and matching of the following records respectively table-associations:

LEFT = 2     RIGHT = 12     AUTONUMBER_KEY = 2

by the original keys of both tables:

KEY1 = A     KEY2 = NULL

is correct and even the synthetic-table showed within the table-viewer only 6 key-values by loading all 5 records of both tables.

You are right that with both approaches (synthetic key - respectively a composite key by association or by joinning) which from an association point of view are the same you will get a different view within a tablebox. What I'm not sure about is if this view is more helpful than the other whereby I must admit that I had never viewed it in this way. Maybe johnw could add some valuable insights.

- Marcus

MVP
MVP

"composite keys could lead to different results than synthetic keys."

Good point!

I've not tested it, but it makes sense to me that with a synthetic key, our left A/null wouldn't match our right A/null. Null isn't a value, so null doesn't equal null.

With example 2, yes of course if you fail to add an unused character between your keys, and your keys can overlap in some way without it, you'll get funny results.

For what I'll call 2a, we add the character. This is how I think most of us build composite keys. But as you point out, this is not the equivalent of the synthetic keys when we're dealing with nulls. Which behavior we want is a matter of requirements.

And your example 3 does seem like it would produce the same results as a synthetic key since a left null will be given a different value than a right null. Good work.

"I wished there would be a "qlikview synthetic key autonumber" function which calculated autonumbers that create the same results as example 1 does."

There is! If you let QlikView build the synthetic key for you, it will build a composite key with an autonumbered ID that takes nulls into account the way you want them taken into account!

Should you use a synthetic key here? No. So I agree you should build the composite key, right? No. Both of these tables are keyed exactly the same. They are the same table. Join them together so that they can be the same table in your data model. No synthetic keys. No composite keys.

Yes, I know it's just an oversimplified example like we all use. And my solution is also just an example. But the point I'm making is that if you find yourself with synthetic keys, or find yourself wanting to join tables by composite keys, think harder. What are you doing? Why are you doing it? Is there some other approach that will give you the results you want without using either synthetic or composite keys? Is that approach better? And if you do ultimately want to just link your tables together on multiple fields, synthetic keys and composite keys have advantages and disadvantages that you may want to weigh before deciding what to do.

"Concatenating keys (=composite keys) in combination with the AutoNumber() function will result in better performance without any doubt."

Testing did not bear out this confident assertion. Synthetic keys would seem to be typically faster to load, with marginally better performance and memory usage.

Think of it this way - for synthetic keys to perform worse than concatenated keys, they would have to be stored significantly-differently internally. And if this difference were causing problems, why would QlikView store synthetic keys that way, when they could instead store them the way we store composite keys and get better performance? Be dubious of claims which, to be true, would require Qlik to have coded QlikView in a less efficient way than most QlikView programmers know how to code.

I do know of one case where composite keys are faster to load - when that key is part of an incremental load. Don't use autonumber() in that case. Consider using hash128() or just the raw composite key. Either way you only need to build composite keys for the new rows, not for all rows, and that can pay off. If you have 100 million rows in a table, and are only adding 100,000 per day, it is faster to build 100,000 composite keys yourself than to let QlikView build 100 million synthetic keys.

There are reasons to not use synthetic keys. Many experts refuse to use them ever. But performance is rarely the reason.

robin_hausdoerfer
Valued Contributor III

Thank you all for your remarks.

There's a lot of very interesting information here.

0 Likes
MVP & Luminary
MVP & Luminary

Oh, don't even let me get started here... 🙂

Marcus already covered the issue with accidental linking when the keys are concatenated without a delimiter, so I won't get into that...

The fact that you are getting different results even after you add the delimiter, can be interpreted both ways. I'd say, for most of business purposes, you'd like the keys to match when one of the component keys are "null", but Synthetic keys don't offer that possibility. For example, in a combo key Customer_Product you'd like identical Products to link even if Customer is NULL for some of the transactions. Moreover, in many business cases you'd like the Product with no Customer to link to the same Product and *any* Customer (or, for example, with any Customers that purchased the same Product...)

For that reason, you would typically transform NULL values to strings and build your links manually, in order to facilitate your business logic correctly. No Synthetic key can do it for you.

Most people recommend avoiding synthetic keys (I'm a big proponent of that view) not necessarily for performance reasons, and not because Composite Keys are better, but BECAUSE THERE IS ALWAYS A BETTER WAY. In your case, as John rightfully suggested, the tables should be simply joined with no linking necessary. In other cases, a concatenated table would be a better solution. 9 times out of 10, there are better ways to express your business case than by keeping a Synthetic key.

I'm explaining all of that with a lot of detail and many hands-on exercises in my book QlikView Your Business - I encourage everyone to check it out.

cheers,

Oleg Troyansky

Upgrade your Qlik skills at the Masters Summit for Qlik - coming soon to Munich, Germany!

robert99
Valued Contributor II

"Concatenating keys (=composite keys) in combination with the AutoNumber() function will result in better performance without any doubt."

It's not what I have found when I have tested both to compare. Have you done comparative tests to support this statement?

Thanks

0 Likes
robin_hausdoerfer
Valued Contributor III

I admit, I was wrong. I read it somewhere and just believed it.

I crossed out my "confident assertion".

robert99
Valued Contributor II

"Most people recommend avoiding synthetic keys (I'm a big proponent of that view) not necessarily for performance reasons, and not because Composite Keys are better, but BECAUSE THERE IS ALWAYS A BETTER WAY."

"9 times out of 10"

So not performance reasons. Composite keys are not better. But etc.

I know this has been discussed before. But at times I prefer to use a synthetic key. Not often. But mostly I don't as I think there might be some valid reason not to. But for this example its as easy to explain as a composite key. and works well without the effort of creating a new field

SyntheticJoin2.JPG

0 Likes
MVP & Luminary
MVP & Luminary

Robert,

when I say "not necessarily for performance reasons", the precise interpretation is "performance may or may not be the only reason to avoid synthetic keys". It certainly doesn't mean "So not performance reasons. Composite keys are not better. But etc."

In my view, composite keys are still better - they data structure is clearer, easier to understand, explain, and extend, and there are less hops to resolve in the run-time. However, other developers reported benchmarking results that demonstrate that synthetic keys performed faster than concatenated keys in a specific situation.

The example that you bring is very simple. In some of the simpler data models, loaded with a small amount of data, you almost want to say "whatever". No matter what is your preferred data model, it will fly. Who cares if the chart takes 1.2 sec or 1.4 sec to calculate?

However, as data models become more complex and the data sizes become bigger, the importance of good data modelling becomes more and more apparent. Try some of these techniques on 2-3 billions of rows and see for yourself.

Let's take your data model for example. I see Sales Actuals and the Currency rates, and the relation between Sales and Currency Rates is defined by two fields - Currency and Year/Month. Your claim is that you are better off saving the effort of creating a Combo field and allowing Qlik to build your Synthetic Key for you. Perhaps. If you have a few hundreds of thousands of rows - maybe. Even if you have a few millions of rows - it would fly. With hundreds of millions of rows - I honestly doubt it.

See, Qlik aggregation is fast when all the components for that aggregation are coming from a single table. So, your calculation:

sum(Sales*CurrencyRate)

will be reasonably fast if both fields reside in one table. The more links you introduce between the two fields, the longer it's going to take to collect these fields for a given chart. So, if you took the time and built a composite key, you'd have 1 link between the two tables. If you kept the synthetic key, you'd have 2 links (Actuals - Synthetic Table - Currency Rates). Your aggregation will be considerably slower. Can you see how synthetic keys may add problems in a large data set?

So, in conclusion, am I saying that Concatenated Key is the best solution here? NO! It's a better solution, but not the best. In a large dataset, when performance matters, I'd populate Currency Rates into the same Fact table (Actuals) to ensure that all components of my aggregation reside in a single table. Which brings us to my original message - THERE IS ALWAYS A BETTER WAY!

To add to the possible list of complications - what if your data is more complex than a single fact and a single table with currency rates? What if you also have returns, and allowances, and inventory records? And all of them have different sets of multiple keys? You will end up with several Synthetic keys built on top of each other, and that's a well known problem that there is no disagreement about. To prevent any future complications, it's just better to keep your data model clean from the beginning and not worry about it. It's not that hard to do...

Now, does any of that matter if your data is always small - no, nothing matters at all. Build any data model and watch it fly. But in that case, you shouldn't bother with these questions altogether...

cheers,

Oleg Troyansky

Upgrade your Qlik skills at the Masters Summit for Qlik - coming soon to Munich, Germany.

Version history
Revision #:
1 of 1
Last update:
‎2017-01-15 01:29 PM
Updated by: