Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in NYC Sept 4th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
Qlik_William
Contributor III
Contributor III

Concatenate vs link table app size

Hello!

 I wanted to try to see the difference in app size between using the concatenate method for 3 fact tables with few fields in common and the link table approach. In my example I load 20M rows of each fact table with only 3 common dimensions of 11 total between the tables.

So I made some test data, I'm not sure what conclusions I can make from this, if any, which is why I made this post. Maybe my example does not make sense. But in my example the Link table approach has an app size of  159MB and the concatenate approach an app size of 184MB. 

So the concatenate approach created a larger app which goes against the general idea that the link table approach creates a larger app due to the extra link table. I guess the reason is that I only use 10 distinct values for the Dim1-10 so my link table is small (only 110 rows), maybe this is not realistic.

But I'm still curious why the Concatenate approach created a larger app? Is it because of the null values? I thought the null values had insignificant impact on app size. Is is because the table is wider? 

The code:

---CONCATENATE---

Table1:
LOAD
'2023' as year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 10) + 1 AS Dim2,
Floor(Rand() * 10) + 1 AS Dim3,
Floor(Rand() * 10) + 1 AS Dim4,
Floor(Rand() * 10) + 1 AS Dim5,
Floor(Rand() * 10) + 1 AS Dim6,
Floor(Rand() * 10) + 1 AS Dim7,
Floor(Rand() * 10) + 1 AS Dim8,
Floor(Rand() * 10) + 1 AS Dim9,
Floor(Rand() * 10) + 1 AS Dim10,
Floor(Rand() * 100) + 1 AS Value1,
'Table1' as Source
AutoGenerate(20000000);

Concatenate(Table1)
LOAD
'2023' as year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 100) + 1 AS Value2,
'Table2' as Source
AutoGenerate(20000000);

Concatenate(Table1)
LOAD
'2023' as year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 10) + 1 AS Dim2,
Floor(Rand() * 100) + 1 AS Value3,
'Table3' as Source
AutoGenerate(20000000);

---LINK TABLE---

_Table1:
LOAD
'2023' as Year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 10) + 1 AS Dim2,
Floor(Rand() * 10) + 1 AS Dim3,
Floor(Rand() * 10) + 1 AS Dim4,
Floor(Rand() * 10) + 1 AS Dim5,
Floor(Rand() * 10) + 1 AS Dim6,
Floor(Rand() * 10) + 1 AS Dim7,
Floor(Rand() * 10) + 1 AS Dim8,
Floor(Rand() * 10) + 1 AS Dim9,
Floor(Rand() * 10) + 1 AS Dim10,
Floor(Rand() * 100) + 1 AS Value1
AutoGenerate(20000000);

_Table2:
Noconcatenate LOAD
'2023' as Year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 100) + 1 AS Value2
AutoGenerate(20000000);

_Table3:
Noconcatenate LOAD
'2023' as Year,
Floor(Rand() * 10) + 1 AS Dim1,
Floor(Rand() * 10) + 1 AS Dim2,
Floor(Rand() * 100) + 1 AS Value3
AutoGenerate(20000000);


LinkTable:
LOAD Distinct
Year & '|' & Dim1 & '|' & Dim2 as Table1CompositeKey,
Year & '|' & Dim1 as Table2CompositeKey,
Year & '|' & Dim1 & '|' & Dim2 as Table3CompositeKey,
Year,
Dim1,
Dim2
Resident _Table1;

Concatenate(LinkTable)
LOAD Distinct
Year & '|' & Dim1 & '|' & null() as Table1CompositeKey,
Year & '|' & Dim1 as Table2CompositeKey,
Year & '|' & Dim1 & '|' & null() as Table3CompositeKey,
Year,
Dim1,
null() as Dim2
Resident _Table2;

Concatenate(LinkTable)
LOAD Distinct
Year & '|' & Dim1 & '|' & Dim2 as Table1CompositeKey,
Year & '|' & Dim1 as Table2CompositeKey,
Year & '|' & Dim1 & '|' & Dim2 as Table3CompositeKey,
Year,
Dim1,
Dim2
Resident _Table3;

Table1:
LOAD
Year & '|' & Dim1 & '|' & Dim2 as Table1CompositeKey,
Dim3,
Dim4,
Dim5,
Dim6,
Dim7,
Dim8,
Dim9,
Dim10,
Value1
Resident _Table1;

Drop table _Table1;

Table2:
LOAD
Year & '|' & Dim1 as Table2CompositeKey,
Value2
Resident _Table2;

Drop table _Table2;

Table3:
LOAD
Year & '|' & Dim1 & '|' & Dim2 as Table3CompositeKey,
Value3
Resident _Table3;

Drop table _Table3;

Link table data model:

Qlik_William_0-1749849387923.png

 

Labels (2)
18 Replies
Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi @Vegar ,

Just to clarify your point about symbol tables and how they help reduce the size... The concept of Symbols, the lists of distinct values, that are not stored repeatedly like they are stored in relational databases, helps avoid any data duplication concerns. You can have the same long 40-character string repeated 100 millions times in the data set, and it will only take 40 characters in Symbols and 100 million very small 1-bit pointers in the Index table. However, a smaller 8-byte number could be stored next to it with 1 million distinct values, and all of a sudden, the Symbols will take 8 MB, and the Index will require 100 million rows multiplied by 20 bit (that's the size of the pointer that covers 1 million distinct values) - do the math yourself, and you will see how cardinality matters.

Cardinality gets you on both ends - your Symbols table is longer, and your Pointer size in the Index table is wider, and it get stored millions of times in large data sets.

Cheers,

Oleg Troyansky

 

Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi @Qlik_William ,

This is an interesting question! I'm going to test it in detail and will likely write a blog post about it.

Unrelated to the performance difference, allow me to make a comment about the way you build the Link Table. You can further improve the size and the performance of the Link Table by avoiding separate combo keys with identical combinations of fields. You don't have to have a unique key field for each fact table. If multiple tables share the same key fields, you can generate a single combo key and use it in all of these tables. In your example, tables 1 and 3 share the same combination of key fields. You can get by with a single combo key for them, cutting the memory consumption on these fields in half.

Cheers,

Oleg Troyansky

Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

I tested this script with 10, 1000 and 10000 distinct values.

With 1000 distinct values, the Link4 Table still looked smaller than the Concatenated Fact - I believe it's because out of 3 key fields, only two contribute to the performance/memory question - Dim 1 and Dim 2. The combination of two keys with 1,000 distinct values each produces 1 million of distinct values - that's bad, but I guess not bad enough.

However, when I took the number of possible values in the two keys up to 10,000 (the possible size of the combo key is up to 100 million values), the Link Table solution become much heavier than the Concatenated Fact - 1.8 GiB vs. 665 MiB.

So, clearly the cardinality of your key fields, and of your concatenated combo keys, determine which data model will require more memory. With relatively low cardinality, Link Tables perform better and offer other benefits like built-in associations between the facts. With high cardinality, the size of the combo key may become prohibitive and in that case, Concatenated Facts will require less memory and operate faster in RAM.

Cheers,

Oleg 

rwunderlich
Partner Ambassador/MVP
Partner Ambassador/MVP

The size numbers you posted for your data models are the size on disk, which is not meaningful to this discussion. What you are interested is the size in memory, or the "base memory size" in the QMC. 

rwunderlich_0-1750048272829.png

Using the memory size number we can see that the Concatenate model is still larger, but much larger. 

In your example the additional space used for the index space is the reason for the greater size, coupled with the fact that your link table has fewer values than a real world example is likely to have. 

Oleg gave a good overview and I'd also refer you to these posts:
https://qlikviewcookbook.com/2024/11/reducing-qlik-field-memory/
https://community.qlik.com/t5/Design/Symbol-Tables-and-Bit-Stuffed-Pointers/ba-p/1475369

I analyzed your models with QSDA Pro to measure the differences. In the Link Table model the Value1 field requires 17MB of index (row pointer) space. 

rwunderlich_1-1750048824910.png

The Concatenated model, having three times the rows for Value1, requires three times the index space for this same field.

rwunderlich_2-1750048920165.png

I don't think there is a general rule that says a concatenated or a link table model will be larger. It really depends on the data. 

I can think of a couple of rules that will be generally be true:

1. Concatenated vs Link Table approach may present different behavior (results) in the front end when selections are made. 

2. Concatenated models will typically perform faster, although you can get quite acceptable performance from a Link Table. 

3. Sparse (lots of nulls) tables are memory "expensive". Expensive in the sense the null values don't represent any data but they do take up index space.

-Rob
http://www.easyqlik.com
http://masterssummit.com
http://qlikviewcookbook.com

Qlik_William
Contributor III
Contributor III
Author

Good point and thanks for your feedback. I guess I wanted to test with 3 distinct combo keys but should have actually made them distinct, my bad. 

 

Qlik_William
Contributor III
Contributor III
Author

Thanks for your efforts!

I'm curious when you tested did you put autonumber on the key fields for the link method? As I understand then the keys will take to memory in the symbol tables, but maybe I am missing something. And shouldn't that improve memory performance of the link method vs the concatenate?

I think a takeaway for me here is that the null values actually can have some significant impact on memory performance, which when I browsed the forums at least I got the idea it would not be the case. 

 

Qlik_William
Contributor III
Contributor III
Author

Interesting, thanks for the reply.

So the null values will take space according to the number of distinct values in the same field? That's something I did not think about. In other words having null values in high cardinality fields is worse compared to nulls in low cardinality fields.

I would assume that Value1 would take up slightly more space with concatenate, but not three times as much

Oleg_Troyansky
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi William,

At first, I tested the models without using Autonumber. Then, I applied Autonumber and the size of the composite keys got reduced substantially.

As promised, I posted a blog article on this example with a detailed description of the issues. Happy reading!

https://www.naturalsynergies.com/q-tip-27-link-table-or-concatenated-fact/

Cheers,

Oleg

marcus_sommer

The storage and RAM size of a data-model is only one part of the performance. Depending on the requirements and the biggest bottleneck in the environment also the script-runtime and the UI calculation times needs to be considered and might be more important as the sizing.

Besides this there are more aspects which have usually an impact on a data-model decision, like the efforts to develop and maintain a data-model and it's complexity - especially in regard to the majority of data-models and applications. This aims to the question how to balance between a few general approaches and a lot of highly optimized ones?

This may lead to the next question how to start in general a new project? The simplest approach is to load a single fact-table and add one or two dimensions. Maybe the same is done with the next fact-table by qualifying both data-sets against each other. The next step then could be to concatenate both facts - maybe just the essential parts and not mandatory with a full harmonizing of field-names and data-structures - to evaluate the associations between the fact-parts and the dimensions. And then switching the dimensions or adding more and/or going further with the next facts.

Even at this point it's quite hard to estimate which kind of data-model may have more benefits to the sizing with the real data-set. But the concatenated facts within a star-scheme is already there and usually quite well working. Deriving any link-table means in each case to add extra efforts and complexity - at least compared to the star-scheme even if the previous hinted approach were skipped and it started directly with a link-table data-model.

In my experience fits the star-scheme to the most scenarios, is the simplest approach and usually the starting point - without any special requirement I wouldn't go with another one.

IMO there is also a logically point to consider and this is the kind of relation-ship between the facts. Not all of different fact-tables are really different data to each other. Quite common are scenarios with current + budget + forecast data or orders + payments + shipments which are essentially the same data - only their direction or point of journey is different. That they are kept separate within the ERP systems is suitable respectively mandatory but this doesn't mean that's sensible to keep them separate within a BI tool and especially not within Qlik.