Skip to main content
Announcements
Introducing Qlik Answers: A plug-and-play, Generative AI powered RAG solution. READ ALL ABOUT IT!
cancel
Showing results for 
Search instead for 
Did you mean: 
simonaubert
Partner - Specialist II
Partner - Specialist II

QVD Size increases exponentially (and data lineage is not active) : null values are badly managed?

Hello all,

We have some issues with the QVD Size. Some of our QVD are up to 20 Gb while they're build from QVD much smaller. Here is something to reproduce what happens :

We used the following script as an example :

set v_taille=10000000;

test:

Load recNo() as A1 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test1.qvd](qvd);

 

Concatenate (test) Load recNo() as A2 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test2.qvd](qvd);

 

Concatenate (test) Load  recNo() as A3 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test3.qvd](qvd);

 

Concatenate (test) Load  recNo() as A4 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test4.qvd](qvd);

 

Concatenate (test) Load  recNo() as A5 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test5.qvd](qvd);

 

Concatenate (test) Load  recNo() as A6 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test6.qvd](qvd);

 

Concatenate (test) Load  recNo() as A7 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test7.qvd](qvd);

 

Concatenate (test) Load  recNo() as A8 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test8.qvd](qvd);

 

Concatenate (test) Load  recNo() as A9 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test9.qvd](qvd);

Logs result:

test << AUTOGENERATE(10000000)

Lignes extraites : 10 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 20 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 30 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 40 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 50 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 60 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 70 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 80 000 000

test << AUTOGENERATE(10000000)

Lignes extraites : 90 000 000

Création de l'index de recherche

L'index de recherche a bien été créé.

Application enregistrée

Opération terminée correctement

0 erreur(s) forcée(s)

0 clé(s) synthétique(s)

 

QVD result :

simonaubert_0-1588014993102.png

The final result is 3,5 Gb while it should be expected to be  78*9=702 Mb.

3,5 Go = 5 * 702 Mo expected with 9 iteration

 

Iteration  :

Nb itérations :

Result QVD size

Expect QVD size

Ratio

1

                78 127  

78127

1,000

2

             214 846  

156254

1,375

3

             468 753  

234381

2,000

4

             742 191  

312508

2,375

5

          1 171 879  

390635

3,000

6

          1 582 036  

468762

3,375

7

          2 187 505  

546889

4,000

8

          2 734 381  

625016

4,375

9

          3 515 632  

703143

5,000

 

  • The ratio increase by 0,375, then by (1-0,375), et ainsi de suite.

On the other side, the application size with one iteraton : 20 Mb

simonaubert_6-1588015476071.png

 

And with 9 iterations : 242 Mb

simonaubert_5-1588015426622.png

1 -> 20,42

9 -> 185,56 = 9,08 * 20,42

12 -> 248,19 = 12,15 * 20,42

The application size increases linearly.


Of course, we have deactivated the data lineage on our server.


Some idea anyone? Bad management of the null values or something else? Why only the QVD is affected but not the application ?

Best regards,

Simon

Bi Consultant (Dataviz & Dataprep) @ Business & Decision
6 Replies
benoithervochon
Partner - Contributor III
Partner - Contributor III

Hello,

Same exercice with less rows : 1000

Final qvd opened with Q-Eye gives someting strange with Bit Width : 

benoithervochon_0-1588061254023.png

 

Opened with EasyQlik QViewer for table metadata : correct values (= as expected) but physical size on disk isn't linear as explained by @simonaubert 

benoithervochon_1-1588061350232.png

 

thx for ideas  ! 

¯\_(ツ)_/¯
ArnadoSandoval
Specialist II
Specialist II

Hi @simonaubert 

I ran a similar test as yours, but my memory resources are not that impressive, so, it took almost 1 hour to generate the 9 QVDs; I changed the sample to just 1000 records, and extended the scope as I describe later; Qlik has a very peculiar way to optimize memory and storage resources.

My Tests:

Test GroupScope
1Your script; 9 QVDs having between 1 to 9 columns with different values
29 QVDs having between 1 to 9 columns, each column storing the same number
39 QVDs having one column, each row with the same number.
49 QVDs having one column, each row with a RowNo() value; the 9th QVD on this test will have 9 times the same RowNo()

 

  • Test 1: the cardinal number of these QVDs is very height,  actually all the numbers in each columns are different, and for Qlik the same number appearing on column A and column B are different (actually, each column is a dimension); for this scenario Qlik will have a hard time optimizing the size of this QVD (I will disregards NULLs, as they repeat several times, Qlik does a good job optimizing the space they required)
  • Test 2: the cardinal number of these QVDs is very low, mainly the cardinal number of each dimension is 1 in the first QVD, and 2 for the rest of QVDs,  the storage space for these QVDs is very small compared with the QVDs on Test 1. This test produced two unexpected jump in size, when the QVD has 5 and 9 columns. The growth of these QVDs is pretty stable.
  • Test 3: all the QVDs with one column (one dimension) each row containing the same number, e.g. 1; this dimension has a low cardinality, just 1, and these are the smaller QVDs generated.
  • Test 4: all the QVDs have one dimension (one column) each row containing the RowNo(), the cardinality of QVDs' dimension is always the number of iterations (1000 with my tests), the QVDs size reflects the Qlik's optimization, it is pretty constant.

  The way Qlik stores the data is very interesting, I read about this in a book, and found these Qlik Community thread covering what I read (pretty close)

I attached the Excel file with my results.

Regards,

Arnaldo Sandoval
A journey of a thousand miles begins with a single step.
marcus_sommer

I'm not really surprised that the qvd's didn't grow linear because they aren't created in this way else you force the table-pointer to grow up to 90 M.

I don't know the technically details how Qlik handles all kind of loads and how they are stored and so the following is just deduced from what I know and what I believe it happens here.

If you just think in the known data-logic of the data-table and the connected system-tables it shouldn't happens but they are only a simplified images from the internal handling. In this case here no field has a relation to any other field because everything else is NULL (which isn't stored in any way) but all parts of the table must have a valid relationship to any other part of the datamodel. Because of the fact that there is no classical connection Qlik creates at least a separate one, maybe also several ones - and this didn't grows linear else in some way exponential.

For me it's not quite clear what's the aim behind your test. I assume it may be the attempt to store completely different and independent tables within a single qvd - which is technically possible but has rather more disadvantages as benefits. I remember that I have already seen postings even with a routine about it but couldn't find it yet (I thought it was from Rob Wunderlich but maybe I was wrong or didn't find the right search-words).

What happens if you change your approach to something like this:

test:
load recno() as A, 'first load' as Source autogenerate $(v_taille);
    store test into test1.qvd (qvd); concatenate(test)
load recno() as A, 'second load' as Source autogenerate $(v_taille);
    store test into test2.qvd (qvd); concatenate(test)
load recno() as A, 'third load' as Source autogenerate $(v_taille);
...

I assume the result will be quite differently ...

- Marcus

simonaubert
Partner - Specialist II
Partner - Specialist II
Author

Hello all,

Thanks for the answers. So here are some food for thought:

1/ Our real model is a Fact Table with several fact types (such as sales/forecast) with some fields in common but not all and some difference in granularity.
It works well  : easy to develop, easy to understand, and really fast. The only downside is sometimes the qvd size.

2/ I just did the same test with a field in common :

set v_taille=10000000;

test:
Load *, A1 as A;
Load recNo() as A1 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test1_avec_champ_commun.qvd](qvd);

Concatenate (test)
Load *, A2 as A;
Load recNo() as A2 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test2_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A3 as A;
Load recNo() as A3 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test3_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A4 as A;
Load recNo() as A4 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test4_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A5 as A;
Load recNo() as A5 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test5_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A6 as A;
Load recNo() as A6 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test6_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A7 as A;
Load recNo() as A7 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test7_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A8 as A;
Load recNo() as A8 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test8_avec_champ_commun.qvd](qvd);

 

Concatenate (test)
Load *, A9 as A;
Load recNo() as A9 AutoGenerate $(v_taille);

store test into [lib://DATA_02_RAW/TX/FLAT/test9_avec_champ_commun.qvd](qvd);

exit script;


And here the result :

Nb itérations :Result QVD sizeExpect QVD sizeRationb of different fieldsnb of field
1156 252156 2521,0000002
2361 331312 5041,1562423
3605 472468 7561,2916634
4986 332625 0081,5781145
51 367 392781 2601,7502456
61 923 834937 5122,0520647
72 441 4121 093 7642,2321278
83 173 8351 250 0162,5390459
93 828 1321 406 2682,72219910

 

Quite interesting, isn't it?

Bi Consultant (Dataviz & Dataprep) @ Business & Decision
simonaubert
Partner - Specialist II
Partner - Specialist II
Author

Hello all,

 

Thanks for the answer and the level of technicity there. I tried to reply with a further investigation but the message is marked as spam for no reason. I try to recover it.

Bi Consultant (Dataviz & Dataprep) @ Business & Decision
marcus_sommer

I think your second trial didn't changed much. Yes, now there is one field in common but in general it's quite the same approach and therefore the growth of the file-sizes is further in some kind exponential and not linear. I would also think it should be never linear - if you tries my suggestion you will also see an exponential growth but with a factor smaller as 1. A linear growth of the size could be expected by traditional databases or just txt-files but in Qlik with the logic of storing only distinct values within the system-tables and connecting them with a bit-stuffed pointer to the data-table the size-to-number-of-records-rate should be smaller the more records are included.

The above is valid for the "classical" recommended data model by using a star- or a snowflake-scheme with a single fact-table. As far as you differ from it you begin to make more or less obvious compromise in regard to the sizes, load-times, development efforts, UI performance, readability and maintainability - and especially the last four points here will increase the needs for the first two ones and reverse. This means if a data model is a bit more complex it comes quite often to larger file-sizes because it's usually the cheapest part within the environment and not the biggest bottleneck.

I see now that my suggestion didn't really fulfilled your use-case but nevertheless I think it shows in which direction your solution should be rather developed. Personally I use also quite often asynchronous concatenated fact-tables and yes I could admit they work usually very well and because of it I haven't really checked if it could be more optimized in some way respectively it's a matter which I usually only do if I need it to do. Whereby my data are usually a bit different to those in your example because the different fact-tables parts share most of the key-values. This means for example that keys like date, store-id, product-id and so on are in all parts to probably more as 90% the same and just the identifiers (source) and measures/kpi's (sales, budget, forecast, …) are different. In the end most of the records/values are "natively" well connected to each other so that there isn't so much extra overhead to connect them.

I assume that your real-case it's a bit different to my remarks because your example, although quite generic here, showed in another direction. Maybe a bit more elaboration would be helpful to understand it better. Whereby most of my statements aren't directly related to a qvd else to a qvw - and even if both use the same main-logic in the data-handling it's not mandatory exactly the same in the end.

Beside all of the above mentioned things I don't create such qvd's else all data-parts are separately loaded and transformed within a qvd-rawdata and a qvd-transform layer (the biggest parts with incremental logics) and only within the data-model layer they are merged (and the report layer just load them binary). I think something like this will in each case be applicable in your environment to reduce the qvd-filesizes. If all of this isn't helpful you will probably need a different approach within your data models by joining/mapping all respectively the essential parts instead of concatenating them.

- Marcus