Re: QVD Size increases exponentially (and data lin... - Qlik Community - 1697185

Skip to main content

Qlik.com | Qlik Help | Resources

Ask a Question

Woohoo! Qlik Community has won “Best in Class Community” in the 2024 Khoros Kudos awards!

Hello all,

We have some issues with the QVD Size. Some of our QVD are up to 20 Gb while they're build from QVD much smaller. Here is something to reproduce what happens :

We used the following script as an example :

set v_taille=10000000;

test:

Load recNo() as A1 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test1.qvd](qvd);

Concatenate (test) Load recNo() as A2 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test2.qvd](qvd);

Concatenate (test) Load recNo() as A3 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test3.qvd](qvd);

Concatenate (test) Load recNo() as A4 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test4.qvd](qvd);

Concatenate (test) Load recNo() as A5 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test5.qvd](qvd);

Concatenate (test) Load recNo() as A6 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test6.qvd](qvd);

Concatenate (test) Load recNo() as A7 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test7.qvd](qvd);

Concatenate (test) Load recNo() as A8 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test8.qvd](qvd);

Concatenate (test) Load recNo() as A9 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test9.qvd](qvd);

Logs result:

test << AUTOGENERATE(10000000)

Lignes extraites : 10 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 20 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 30 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 40 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 50 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 60 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 70 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 80 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 90 000 000

Création de l'index de recherche

L'index de recherche a bien été créé.

Application enregistrée

Opération terminée correctement

0 erreur(s) forcée(s)

0 clé(s) synthétique(s)

QVD result :

The final result is 3,5 Gb while it should be expected to be 78*9=702 Mb.

3,5 Go = 5 * 702 Mo expected with 9 iteration

Iteration :

Nb itérations :	Result QVD size	Expect QVD size	Ratio
1	78 127	78127	1,000
2	214 846	156254	1,375
3	468 753	234381	2,000
4	742 191	312508	2,375
5	1 171 879	390635	3,000
6	1 582 036	468762	3,375
7	2 187 505	546889	4,000
8	2 734 381	625016	4,375
9	3 515 632	703143	5,000

The ratio increase by 0,375, then by (1-0,375), et ainsi de suite.

On the other side, the application size with one iteraton : 20 Mb

And with 9 iterations : 242 Mb

1 -> 20,42

9 -> 185,56 = 9,08 * 20,42

12 -> 248,19 = 12,15 * 20,42

The application size increases linearly.

Of course, we have deactivated the data lineage on our server.

Some idea anyone? Bad management of the null values or something else? Why only the QVD is affected but not the application ?

Best regards,

Simon

Bi Consultant (Dataviz & Dataprep) @ Business & Decision

6 Replies

Hello,

Same exercice with less rows : 1000

Final qvd opened with Q-Eye gives someting strange with Bit Width :

Opened with EasyQlik QViewer for table metadata : correct values (= as expected) but physical size on disk isn't linear as explained by @simonaubert

thx for ideas !

¯\_(ツ)_/¯

Hi @simonaubert

I ran a similar test as yours, but my memory resources are not that impressive, so, it took almost 1 hour to generate the 9 QVDs; I changed the sample to just 1000 records, and extended the scope as I describe later; Qlik has a very peculiar way to optimize memory and storage resources.

My Tests:

Test Group	Scope
1	Your script; 9 QVDs having between 1 to 9 columns with different values
2	9 QVDs having between 1 to 9 columns, each column storing the same number
3	9 QVDs having one column, each row with the same number.
4	9 QVDs having one column, each row with a RowNo() value; the 9th QVD on this test will have 9 times the same RowNo()

Test 1: the cardinal number of these QVDs is very height, actually all the numbers in each columns are different, and for Qlik the same number appearing on column A and column B are different (actually, each column is a dimension); for this scenario Qlik will have a hard time optimizing the size of this QVD (I will disregards NULLs, as they repeat several times, Qlik does a good job optimizing the space they required)
Test 2: the cardinal number of these QVDs is very low, mainly the cardinal number of each dimension is 1 in the first QVD, and 2 for the rest of QVDs, the storage space for these QVDs is very small compared with the QVDs on Test 1. This test produced two unexpected jump in size, when the QVD has 5 and 9 columns. The growth of these QVDs is pretty stable.
Test 3: all the QVDs with one column (one dimension) each row containing the same number, e.g. 1; this dimension has a low cardinality, just 1, and these are the smaller QVDs generated.
Test 4: all the QVDs have one dimension (one column) each row containing the RowNo(), the cardinality of QVDs' dimension is always the number of iterations (1000 with my tests), the QVDs size reflects the Qlik's optimization, it is pretty constant.

The way Qlik stores the data is very interesting, I read about this in a book, and found these Qlik Community thread covering what I read (pretty close)

The Qlik cache
Symbol Tables and Bit Stuffed Pointers (you will find this thread interesting)

I attached the Excel file with my results.

Regards,

Arnaldo Sandoval
A journey of a thousand miles begins with a single step.

18 KB

I'm not really surprised that the qvd's didn't grow linear because they aren't created in this way else you force the table-pointer to grow up to 90 M.

I don't know the technically details how Qlik handles all kind of loads and how they are stored and so the following is just deduced from what I know and what I believe it happens here.

If you just think in the known data-logic of the data-table and the connected system-tables it shouldn't happens but they are only a simplified images from the internal handling. In this case here no field has a relation to any other field because everything else is NULL (which isn't stored in any way) but all parts of the table must have a valid relationship to any other part of the datamodel. Because of the fact that there is no classical connection Qlik creates at least a separate one, maybe also several ones - and this didn't grows linear else in some way exponential.

For me it's not quite clear what's the aim behind your test. I assume it may be the attempt to store completely different and independent tables within a single qvd - which is technically possible but has rather more disadvantages as benefits. I remember that I have already seen postings even with a routine about it but couldn't find it yet (I thought it was from Rob Wunderlich but maybe I was wrong or didn't find the right search-words).

What happens if you change your approach to something like this:

test:
load recno() as A, 'first load' as Source autogenerate $(v_taille);
store test into test1.qvd (qvd); concatenate(test)
load recno() as A, 'second load' as Source autogenerate $(v_taille);
store test into test2.qvd (qvd); concatenate(test)
load recno() as A, 'third load' as Source autogenerate $(v_taille);
...

I assume the result will be quite differently ...

- Marcus

Author

Hello all,

Thanks for the answers. So here are some food for thought:

1/ Our real model is a Fact Table with several fact types (such as sales/forecast) with some fields in common but not all and some difference in granularity.
It works well : easy to develop, easy to understand, and really fast. The only downside is sometimes the qvd size.

2/ I just did the same test with a field in common :

set v_taille=10000000;

test:
Load *, A1 as A;
Load recNo() as A1 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test1_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A2 as A;
Load recNo() as A2 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test2_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A3 as A;
Load recNo() as A3 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test3_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A4 as A;
Load recNo() as A4 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test4_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A5 as A;
Load recNo() as A5 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test5_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A6 as A;
Load recNo() as A6 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test6_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A7 as A;
Load recNo() as A7 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test7_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A8 as A;
Load recNo() as A8 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test8_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A9 as A;
Load recNo() as A9 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test9_avec_champ_commun.qvd](qvd);

exit script;

And here the result :

Nb itérations :	Result QVD size	Expect QVD size	Ratio	nb of different fields	nb of field
1	156 252	156 252	1,00000	0	2
2	361 331	312 504	1,15624	2	3
3	605 472	468 756	1,29166	3	4
4	986 332	625 008	1,57811	4	5
5	1 367 392	781 260	1,75024	5	6
6	1 923 834	937 512	2,05206	4	7
7	2 441 412	1 093 764	2,23212	7	8
8	3 173 835	1 250 016	2,53904	5	9
9	3 828 132	1 406 268	2,72219	9	10

Quite interesting, isn't it?

Bi Consultant (Dataviz & Dataprep) @ Business & Decision

Author

Hello all,

Thanks for the answer and the level of technicity there. I tried to reply with a further investigation but the message is marked as spam for no reason. I try to recover it.

Bi Consultant (Dataviz & Dataprep) @ Business & Decision

I think your second trial didn't changed much. Yes, now there is one field in common but in general it's quite the same approach and therefore the growth of the file-sizes is further in some kind exponential and not linear. I would also think it should be never linear - if you tries my suggestion you will also see an exponential growth but with a factor smaller as 1. A linear growth of the size could be expected by traditional databases or just txt-files but in Qlik with the logic of storing only distinct values within the system-tables and connecting them with a bit-stuffed pointer to the data-table the size-to-number-of-records-rate should be smaller the more records are included.

The above is valid for the "classical" recommended data model by using a star- or a snowflake-scheme with a single fact-table. As far as you differ from it you begin to make more or less obvious compromise in regard to the sizes, load-times, development efforts, UI performance, readability and maintainability - and especially the last four points here will increase the needs for the first two ones and reverse. This means if a data model is a bit more complex it comes quite often to larger file-sizes because it's usually the cheapest part within the environment and not the biggest bottleneck.

I see now that my suggestion didn't really fulfilled your use-case but nevertheless I think it shows in which direction your solution should be rather developed. Personally I use also quite often asynchronous concatenated fact-tables and yes I could admit they work usually very well and because of it I haven't really checked if it could be more optimized in some way respectively it's a matter which I usually only do if I need it to do. Whereby my data are usually a bit different to those in your example because the different fact-tables parts share most of the key-values. This means for example that keys like date, store-id, product-id and so on are in all parts to probably more as 90% the same and just the identifiers (source) and measures/kpi's (sales, budget, forecast, …) are different. In the end most of the records/values are "natively" well connected to each other so that there isn't so much extra overhead to connect them.

I assume that your real-case it's a bit different to my remarks because your example, although quite generic here, showed in another direction. Maybe a bit more elaboration would be helpful to understand it better. Whereby most of my statements aren't directly related to a qvd else to a qvw - and even if both use the same main-logic in the data-handling it's not mandatory exactly the same in the end.

Beside all of the above mentioned things I don't create such qvd's else all data-parts are separately loaded and transformed within a qvd-rawdata and a qvd-transform layer (the biggest parts with incremental logics) and only within the data-model layer they are merged (and the report layer just load them binary). I think something like this will in each case be applicable in your environment to reduce the qvd-filesizes. If all of this isn't helpful you will probably need a different approach within your data models by joining/mapping all respectively the essential parts instead of concatenating them.

- Marcus