Qlik Community

Qlik Design Blog

All about product and Qlik solutions: scripting, data modeling, visual design, extensions, best practices, etc.

Synthetic Keys

In a well visited post on the community forum, John Witherspoon some time ago asked “Should We Stop Worrying and Love the Synthetic Key?” John’s post begins: “Synthetic keys have a bad reputation. The consensus seems to be that they cause performance and memory problems, and should usually or even always be removed. I believe that the consensus is wrong.” Here’s my view on this topic.

The creation of synthetic keys is simply QlikView’s way of managing composite keys. There is nothing strange or magic around it.

A single key is easy to manage: Just list all unique values in a symbol table (see Symbol Tables and Bit-Stuffed Pointers), and then link the data tables using a natural join.

Source table view.png

But a composite key is slightly different – there is no single symbol table that contains the relevant combinations of the multiple key fields. So QlikView needs to create such a table for all combinations: the $Syn table. In a way, you can say that the $Syn table is a symbol table for composite keys. In the data tables, the multiple keys are replaced by an identifier that uniquely identifies the combination of the values of the original keys: the $Syn key.

Hence, if you have the same set of multiple keys in two or more tables, the QlikView synthetic keys create a general, correct, compact and efficient solution. Synthetic keys do not per se cause performance and memory problems. They do not use a lot more memory than if you autonumber your own concatenated key. And they treat NULLs correctly, as opposed to an explicit concatenated key.

Hence: The synthetic key is in itself good and we should all love it.

Internal table view.png

However…  still, also I avoid synthetic keys. Why?

  1. A synthetic key is in my experience often a sign of a poorly designed data model. I say that, given the number of times I have found a synthetic key in the table viewer only to realize that I made a mistake in the script. If you get a synthetic key and didn’t expect it, I can only say: Back to the drawing board! You should most likely change your data model.
  2. QlikView creates an additional table (the $Syn table) that in many cases is superfluous: An additional table is the best solution if none of the data tables by itself completely spans the set of composite keys. But in real life, there is usually one table that contains all relevant combinations of the keys, and then this table can be used to store the clear text of the individual keys.
  3. For clarity, I like to create my own concatenated keys. It forces me to think and create a data model that I believe in. Removing the synthetic keys becomes a method to ensure a good data model, rather than a goal in itself.

But in principle, I totally agree with John’s initial conclusion: Any problem around synthetic keys is really a data modeling problem and not a problem with the synthetic key itself.

The short answer to John’s question is Yes and No. Yes, we should love the synthetic key. But, No, we should not stop worrying. We should always be alert and ask ourselves: “Do I want this synthetic key? Is the data model OK?

And so, because of the automated and irrevocable data-modeling process which rules out human meddling, the Synthetic Keys are scaring. But they are simple to understand. And completely credible and convincing.

Dr HIC

Further reading on Qlik data modelling:

Circular References

Fan traps and Chasm traps

34 Comments
Not applicable

Hi,

I have always trained my colleagues not to care about it unless it produces strange results.

Normally Qliktech was teaching to solve the synthetic keys at the beginning of the data-modelling process. However since I'm using the library of Rob Wunderlich Qlikview Components to help me with synthetic keys, I can solve the synthetic keys at the end of the data-modelling process.

Instrumental in this is the function Qvc.LinkTable which has the ability to solve the synthetic keys at the end, instead of the beginning of the data-modelling process.

I hope with this post it will bring it to a bigger audience.

Loops however are a bigger concern of me. I hope you can elaborate more on that in a future blog post.

Kind Regards,


Dion

4,187 Views
MVP & Luminary
MVP & Luminary

Hi Henric,

I think a concatenated key will consume more memory because of it's higher cardinality. One recommended best practice is to split fields to save memory (eg. timestamps). This would lead into a syn key if it's a key field. How to decide?

Another question: Is the calculation effort of a link table comparable to a synkey table?

- Ralf

0 Likes
4,187 Views
Partner
Partner

HI HIC,

Personally this is very rich knowledge. Thanks for sharing your knowledge as always.

One thing that sprouts to mind is, could there be hidden agenda for Synthetic Key (table) that is not documented or a useful way of using synthetic key (table) and QlikTech has yet to make known?

Best Regards,

Gabriel


4,187 Views

Ralf

I don't think that an autonumbered concatenated key will consume more memory: They should have identical cardinality. So I would not split fields that are keys.

A $Syn table is like a link table with only one key. The inference engine would treat these two cases identically. But a link table usually has several keys and with that usually many more records, and this will affect the calculation effort.

Gabriel

No, there is nothing hidden with the synthetic key. It is merely a necessity for the QlikView inference engine in order to treat composite keys correctly.

However, one aspect, which isn't obvious (but it is not hidden), is NULL value handling: Should records of different tables be linked if one of several linking keys have a NULL value and the other keys have matching values? My answer is: "No. NULL should never link." And this is also how the synthetic keys work. A manually created, concatenated key will usually link the records, though.

HIC

4,187 Views
Not applicable

Sir,

     You enlightened me about $Syn tables. Thank You.

0 Likes
4,187 Views
Not applicable

Very well explained. I agreed, someone who needs a Syn Key indeed has a bad written data model, always we can find a better solution.

Thanks Henric.

0 Likes
4,187 Views
walidhussein
Contributor

Good write up. Thanks Henric.

0 Likes
4,187 Views
robert99
Valued Contributor II

Thanks for this post

I used one (and only one) synthetic key to join the three main files at work (using the personal edition). When we purchased the SB server edition the consultant work employed to check my work (I was self taught) and set up the server replaced the synthetic key with a concatenated key (to improve performance). I preferred the synthetic key (personal preference) but both worked. But the file size increased noticeable. Over 10% from 26000KB to 30000KB 

Unsure of the memory impact though

0 Likes
4,187 Views
Not applicable

Are there performance statistics that support this?  I'd be curious to see if a large data model with a synthetic key table will perform as efficiently as one with a concatenated (and autonumbered) key.  Has this been tested?  If so, what were the findings?

0 Likes
4,187 Views

My experience is the same as RJ's above. If the data model is OK (e.g. a single synthetic key between two tables) then the synthetic key performs as well as (or better than) a manually created key. And often at slightly less memory usage.

But as soon as you have a questionable data mode (e.g. synthetic keys created from synthetic keys) then the performance deteriorates drastically.

However, I have no formal tests to support it...

HIC

0 Likes
4,187 Views
twebber
Contributor II

I will argue that QV is very limited in its use with an dimensionally designed enterprise DW without synthetic keys.  QV is a BI tool and the most BI solutions go against a DW (forget the QV claims that you dont need one - this is rarely the case for real "enterprises"). Most DW's are dimenionally designed with conformed dimensions being the integration points for the various fact tables (things you are measuring).

Consider the (very very simple) example below.  There are two dimensions (date and LHA) that are conformed (linked/related) to two fact tables. Each dimension has a primary key.  That key is a foreign key in each fact table.  The primary key of a fact table is the composite of all the dimensional foreign keys within the table. Now a regular relational and multidimensional model implement this as a series of relationships and use stitch queries when fact grain varies ect...  There are no requirements for a composite key.  For QV tho, it needs one key so it creates a $Syn table that has a composite key of all foreign keys in the fact.  What a pain to create this in your ETL just for QV.  I am glad they have a structure that handles it.  However, it looks goofy and feels like a hack but you need it unless you want to build something "just for QV".

Here's another thing i noticed.  Say you filter your date table.  QV will then go through the fact and eliminate any "LHA's" (the other related dim) that dont have a "facts" for those date.  QV then also goes one step further.  It will filter out the LHA's associated with facts that dont have values in the fact table based on your date query.  Now things seem "great"...at first because you have filters actively filtering other filters.  But guess what?  If you have another fact table that does have facts in the LHA's that were just filtered -> they will go missing!  This is bad and deceiving.

In a sentence, If you filter on a conformed dimension, you may eliminate facts from one or more related fact table even though they exist.  I love QV for simple data marts.   It seems to fall apart on enterprise DW's.

QV_DM.jpg

0 Likes
4,187 Views
Partner
Partner

Tim Webber wrote:

  I love QV for simple data marts.   It seems to fall apart on enterprise DW's.

Tim,

You appear to be forgetting the extremely powerful scripting capabilities within QlikView. Whereas everything you say is technically true "out of the box", it is extremely rare that you cannot reshape data as desired within the script to overcome the "limitations", as you put it (although I'd argue for design features that make QV unique rather than limitations that differentiate it from other tools).  Yes, this script will generally ultimately be creating a simple(ish) data mart, but it is almost always faster to do this in QV than it is in the DWH/Mart environment.  I'd almost always rather see a mart and a more simple QV script if possible, but:

  • you may not have a DWH/Mart infrastructure
  • it is often a good tactical solution to use the QV script to define what the mart should eventually do - gets you to market quicker and with less resource, plus drives the requirements through actual use-case rather than guesswork

So, QV doesn't "fall apart" on enterprise DWHs - you just have to use more of the provided functionality to leverage the power of the tool.

Jason

0 Likes
4,187 Views
twebber
Contributor II

Thanks for your comments Jason.  I'll think about them.  The main point of my post was in relationship to the requirement to either create new keys to satisfy a modelling constraint in QV or accept $Syn tables.  There is no other way around it.  You accept QV $Syn keys or you create them yourself.  Either way, its more work and I suspect quite a bit of work in key management for large DW's. 

With regards to filters filtering other filters (like cascading filters), the work around is to limit each QV application to one fact table.  Otherwise, the issue I describe will occur and in my mind, its incorrect.  Just because I dont have steering wheel produced on any given "production date" doesn't mean that my application should filter out that production date related to Cars produced on that filtered date.  It doesn't make sense.  I suspect the heart of the issue is that QV should only remove a dimensional member when "all" facts do not have a relationship to that member.  

If you only have one fact table, you dont have a problem.  However, if you only have one fact table, you have a data mart application that deals with one business process in the organization.  One business process is not enterprise by definition.  Dont get me wrong, I love QV for data mart specific applications.

0 Likes
4,187 Views
MVP & Luminary
MVP & Luminary

Tim,

a good practice to bridge the EDW gap is to chain several QWVs, each containing a single data mart and the conformed dimensions. So, a user can hope with the selection from one mart to the other.

- Ralf

0 Likes
4,187 Views
twebber
Contributor II

Thanks Ralf - I did exactly as you mentioned in a few applications.  It definately helps but you cannot use the  measures together on one report object or use them together in a calculation.   So then this "EDW gap" is a known issue?  I have not found much about it in my travels through the internet.  At a minimum you would think that Gartner would have something to say about this...

0 Likes
4,187 Views
Partner
Partner

Hi,

I think you're missing something on how QlikView works. When dealing with multiple fact tables/business areas, the most recommended approach I think is the use of concatenated tables. This way you can concatenate as many fact tables as you need. Also you can identify every fact with text flags with something like 'Sales' as _FactType so you can distinguish every record to its corresponding Fact in the front-end. You can even use the system variable SET HidePrefix in order to avoid end user seeing these "developer fields".

This way you'll have a unique fact table, that will include all the facts that you need for your QlikView app as well as all the surrounding dimensions (This is the way QlikView implements the star schema). Of course not every dimension has to be related to every fact.

Also you can unify date field by naming the main date field in every fact with the same name. Besides you can create a master calendar (or more than one if you need different calendars).  More over if you combine this approach with QVW chaining and using different levels of QVDs/QVWs then you have an outstanding set of possibilities.

Another thing is a misconception within QlikView is that claim that you DON'T need a DW in order to use QlikView. I think that the answer to this is YES and NO. Why? because if you don't even have a DB and work just with .txt, xls or maybe 1 or 2 access db files then you can still work with QV and create DW-like structure combining different data tiers with different QVWs and QVDs. However as your data analysis needs increase, you'll eventually need to move to a more robust structure.

My point is that if you don't have a DW then you can create a similar data structure with QV, but if you already have a DW,  take advantage of it and leave all the heavy transformation, creation of custom views, etc to the DW and focus QV in visualization and analysis (while QV can also be in charge of some transformations, like creation of calculated fields that do not need to exist in original DB). The better the quality of the input data the better analysis you'll be able to do with QlikView or any other BI solution.

I don't think the argue that QlikView is not enterprise ready is a valid argument, I've seen QlikView running in really big companies and it does really well, but what makes the difference is how you model your data.

4,187 Views
kirankkk
Contributor II

@ hic Just have one query. Is there any way to see what synthetic actually hold.

As you are saying it  Theydo not use a lot more memory than if you autonumber your own concatenated key.

So can it also store sequence key like autonumber or some other integer value?

Thanks & Regards,

Kiran Kokade

0 Likes
4,187 Views
MVP & Luminary
MVP & Luminary

I think having a whole EDW stored into one concatenated fact table is very questionable..

4,187 Views
Employee
Employee

Hi Tim Webber

what I'm reading here seems to be a modelling issue.

If you don't want to link the production dates of the Cars and the Wheels, then simply don't include that date in the linking key.

You probably should have different calendars for parts and for cars, and a master calendar to use when you want to link everything by date.

Henric Cronström explained the principles of this approach in the following blog post:

http://community.qlik.com/blogs/qlikviewdesignblog/2012/10/16/the-master-calendar

We have a lot of customers that use several fact tables to cover several business processes of their company and I already made several applications based on datawarehouses. But it is true that it may take a lot of thinking to get the "right" model, i.e. the model that answers intuitively to the end user's questions (your example is a very good illustration of this).

Regards,

Brice

0 Likes
4,187 Views
bestofwest
Contributor II

Bad sample of Edit Script.

Avoid use it.

0 Likes
4,187 Views
joaquinlr
Valued Contributor II

Hi Henric:

I'm in a little mess, how can I resolve a synthetic key generated by an interval match ?

The interval match has been created following your IntervalMatch instructions.

Thank you,

Joaquín

0 Likes
4,187 Views

If it is a simple Intervalmatch with no additional keys, you do not need to remove it. It causes no problem, and it is in fact the most efficient data model.

If it is an extended Intervalmatch - with additional keys - then you should remove it. See Slowly Changing Dimensions.

In either case, you can remove it by first creating a temporary table, and then a composite key from the temporary table. E.g.

Intervals:
Load *,
Key & '|' & FromDate & '|' & ToDate as [Key+Interval]
From Intervals ;

Events:
Load *,
     Key & '|' & Date as [Key+Date]
From Events ;

TmpIntervaMatch:
IntervalMatch (Date,Key)
Load distinct FromDate, ToDate, Key
     Resident Intervals ;

IntervaMatch:
Load
     TmpKey & '|' & Date as [Key+Date],
TmpKey & '|' & FromDate & '|' & ToDate as [Key+Interval]
Resident Tmp_IntervaMatch;


Drop Field Key From Events;
Drop Table TmpIntervaMatch;

HIC

4,187 Views
joaquinlr
Valued Contributor II

Thank you Henric.

As usual it has been a very important lesson

0 Likes
4,187 Views
qlikviewwizard
Honored Contributor II

Hi hic ,

This is very nice post. Thank you.

0 Likes
4,187 Views
Not applicable

Excellent post!

0 Likes
4,187 Views
Not applicable

So I will re-evaluate this article and appreciate it in its entirety.

But for now, the main helpful part is this:


But a composite key is slightly different – there is no single symbol table that contains the relevant combinations of the


So the difference between a Synthetic Key table, and just "n" lines between two tables, is whether one of the two tables contains all the compound keys?


In other words, these tables will do "n" lines, because first table contains all compound keys (key fields are bolded)

[Company, Year, Employee Count

Widget, 2014, 10

Foobar, 2015, 20]


[Company, Year, Location

Widget, 2014, Swidgherzland]


Whereas these two tables need synthetic key, because the second table has a compound key  which the first table does not? (italicized)

[Company, Year, Employee Count

Widget, 2014, 10

Foobar, 2015, 20]


[Company, Year, Location

Widget, 2014, Swidgherzland,

Acme, 2013, Acmerica]

0 Likes
4,187 Views
Not applicable

Oh I made a comment that was confused -- the "difference" between "n" lines joining tables and a synthetic key is in fact not a real difference in data storage, just a difference in how it is represented to the user -- the user can change it using "Internal Table View" vs "Source View" in the Table Viewer dropdown:

same data different representation.png

0 Likes
4,187 Views
Not applicable

Tim

You raised a very important topic I had the feeling but without a clear picture before I saw your post.

I agreed with you and seems Qlikview's advantage of in memory association make it perform some unexpected result in the situation you mentioned.

I also came from the traditional DW experience and tend to think things in the old way and see if we can resolve the issue in the old way.

My understanding is that:

In case there are multiple data marts and the conformed dimensions of these data marts are different, for example, the mart you mentioned in the diagram is using two dimension, another fact linked to LHA may only using one dimension.

In this case the FILTER_IN_MART_A_IS_AFFECTING_FILTERS_IN_MART_B happens.

Really I don't see there is any perfect solution in Qlikview, several workarounds I can think of are:

1. Try to make the different data marts using the same number of conformed dimensions, in this case, make the other fact also link to the date dimension;

2. We all know this date dimension here is a fake dimension, but we have no choice;

3. In the date dimension, add one inclusive item such as '<ALL>';

4. create a bridge table for date dimension so all the dates happened in the other fact are having <ALL>;

5. now whenever you select a specific date in mart A, the LHA in mart B won't be affected.

A simple prototype is as below:

FACT1:
LOAD * INLINE
[
DATE_KEY, COUNTRY, VALUE1
2014_ALL, CHINA ,100
2015_ALL, CANADA ,200
2016_ALL ,USA ,300
];

FACT2:
LOAD * INLINE
[
DATE_KEY, COUNTRY, VALUE2
ALL_ALL, CHINA, 300
ALL_ALL, CANADA, 4000
ALL_ALL, USA ,555
];

DATE_LINK:

LOAD DATE_KEY,IF(YEAR='ALL',NULL(),YEAR) AS YEAR, YEAR_GROUP INLINE
[
DATE_KEY, YEAR, YEAR_GROUP
2014_ALL, 2014, ALL
2015_ALL, 2015, ALL
2016_ALL, 2016, ALL
ALL_ALL, ALL, ALL
ALL_ALL,    2014,   ALL
ALL_ALL,    2015,  ALL
ALL_ALL,    2016, ALL
];

4,187 Views
zippo_btl
New Contributor II

If i understand right, syn keys make all combination of involved fields, so the app with syn key should consume more memory?

As i've seen in someone's example  - 1000 men, 1000 products, 1000 regions synthetic key. The length of the syn table would be 1000*1000*1000 = 1 billion unique values. However in real life the length of the link table would be much less, cause not every man buys every product every day? So creating manual composite key is a must in this virtual situation.

Can someone confirm this?

Thx

0 Likes
4,187 Views

No, the Synthetic key does not create the Cartesian product of the involved fields.

If you have two tables - A and B - and they are linked by two fields - X and Y, then the synthetic key will contain all combinations of X and Y that exist in A, plus all combinations that exist in B. If you create a solution manually, then this will most likely contain exactly the same number of combinations as the synthetic key.

HIC

4,187 Views