Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello all,
We have some issues with the QVD Size. Some of our QVD are up to 20 Gb while they're build from QVD much smaller. Here is something to reproduce what happens :
We used the following script as an example :
set v_taille=10000000;
test:
Load recNo() as A1 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test1.qvd](qvd);
Concatenate (test) Load recNo() as A2 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test2.qvd](qvd);
Concatenate (test) Load recNo() as A3 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test3.qvd](qvd);
Concatenate (test) Load recNo() as A4 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test4.qvd](qvd);
Concatenate (test) Load recNo() as A5 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test5.qvd](qvd);
Concatenate (test) Load recNo() as A6 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test6.qvd](qvd);
Concatenate (test) Load recNo() as A7 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test7.qvd](qvd);
Concatenate (test) Load recNo() as A8 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test8.qvd](qvd);
Concatenate (test) Load recNo() as A9 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test9.qvd](qvd);
Logs result:
test << AUTOGENERATE(10000000)
Lignes extraites : 10 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 20 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 30 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 40 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 50 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 60 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 70 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 80 000 000
test << AUTOGENERATE(10000000)
Lignes extraites : 90 000 000
Création de l'index de recherche
L'index de recherche a bien été créé.
Application enregistrée
Opération terminée correctement
0 erreur(s) forcée(s)
0 clé(s) synthétique(s)
QVD result :
The final result is 3,5 Gb while it should be expected to be 78*9=702 Mb.
3,5 Go = 5 * 702 Mo expected with 9 iteration
Iteration :
Nb itérations : | Result QVD size | Expect QVD size | Ratio |
1 | 78 127 | 78127 | 1,000 |
2 | 214 846 | 156254 | 1,375 |
3 | 468 753 | 234381 | 2,000 |
4 | 742 191 | 312508 | 2,375 |
5 | 1 171 879 | 390635 | 3,000 |
6 | 1 582 036 | 468762 | 3,375 |
7 | 2 187 505 | 546889 | 4,000 |
8 | 2 734 381 | 625016 | 4,375 |
9 | 3 515 632 | 703143 | 5,000 |
On the other side, the application size with one iteraton : 20 Mb
And with 9 iterations : 242 Mb
1 -> 20,42
9 -> 185,56 = 9,08 * 20,42
12 -> 248,19 = 12,15 * 20,42
The application size increases linearly.
Of course, we have deactivated the data lineage on our server.
Some idea anyone? Bad management of the null values or something else? Why only the QVD is affected but not the application ?
Best regards,
Simon
Hello,
Same exercice with less rows : 1000
Final qvd opened with Q-Eye gives someting strange with Bit Width :
Opened with EasyQlik QViewer for table metadata : correct values (= as expected) but physical size on disk isn't linear as explained by @simonaubert
thx for ideas !
Hi @simonaubert
I ran a similar test as yours, but my memory resources are not that impressive, so, it took almost 1 hour to generate the 9 QVDs; I changed the sample to just 1000 records, and extended the scope as I describe later; Qlik has a very peculiar way to optimize memory and storage resources.
My Tests:
Test Group | Scope |
1 | Your script; 9 QVDs having between 1 to 9 columns with different values |
2 | 9 QVDs having between 1 to 9 columns, each column storing the same number |
3 | 9 QVDs having one column, each row with the same number. |
4 | 9 QVDs having one column, each row with a RowNo() value; the 9th QVD on this test will have 9 times the same RowNo() |
The way Qlik stores the data is very interesting, I read about this in a book, and found these Qlik Community thread covering what I read (pretty close)
I attached the Excel file with my results.
Regards,
I'm not really surprised that the qvd's didn't grow linear because they aren't created in this way else you force the table-pointer to grow up to 90 M.
I don't know the technically details how Qlik handles all kind of loads and how they are stored and so the following is just deduced from what I know and what I believe it happens here.
If you just think in the known data-logic of the data-table and the connected system-tables it shouldn't happens but they are only a simplified images from the internal handling. In this case here no field has a relation to any other field because everything else is NULL (which isn't stored in any way) but all parts of the table must have a valid relationship to any other part of the datamodel. Because of the fact that there is no classical connection Qlik creates at least a separate one, maybe also several ones - and this didn't grows linear else in some way exponential.
For me it's not quite clear what's the aim behind your test. I assume it may be the attempt to store completely different and independent tables within a single qvd - which is technically possible but has rather more disadvantages as benefits. I remember that I have already seen postings even with a routine about it but couldn't find it yet (I thought it was from Rob Wunderlich but maybe I was wrong or didn't find the right search-words).
What happens if you change your approach to something like this:
test:
load recno() as A, 'first load' as Source autogenerate $(v_taille);
store test into test1.qvd (qvd); concatenate(test)
load recno() as A, 'second load' as Source autogenerate $(v_taille);
store test into test2.qvd (qvd); concatenate(test)
load recno() as A, 'third load' as Source autogenerate $(v_taille);
...
I assume the result will be quite differently ...
- Marcus
Hello all,
Thanks for the answers. So here are some food for thought:
1/ Our real model is a Fact Table with several fact types (such as sales/forecast) with some fields in common but not all and some difference in granularity.
It works well : easy to develop, easy to understand, and really fast. The only downside is sometimes the qvd size.
2/ I just did the same test with a field in common :
set v_taille=10000000;
test:
Load *, A1 as A;
Load recNo() as A1 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test1_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A2 as A;
Load recNo() as A2 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test2_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A3 as A;
Load recNo() as A3 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test3_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A4 as A;
Load recNo() as A4 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test4_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A5 as A;
Load recNo() as A5 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test5_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A6 as A;
Load recNo() as A6 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test6_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A7 as A;
Load recNo() as A7 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test7_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A8 as A;
Load recNo() as A8 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test8_avec_champ_commun.qvd](qvd);
Concatenate (test)
Load *, A9 as A;
Load recNo() as A9 AutoGenerate $(v_taille);
store test into [lib://DATA_02_RAW/TX/FLAT/test9_avec_champ_commun.qvd](qvd);
exit script;
And here the result :
Nb itérations : | Result QVD size | Expect QVD size | Ratio | nb of different fields | nb of field |
1 | 156 252 | 156 252 | 1,00000 | 0 | 2 |
2 | 361 331 | 312 504 | 1,15624 | 2 | 3 |
3 | 605 472 | 468 756 | 1,29166 | 3 | 4 |
4 | 986 332 | 625 008 | 1,57811 | 4 | 5 |
5 | 1 367 392 | 781 260 | 1,75024 | 5 | 6 |
6 | 1 923 834 | 937 512 | 2,05206 | 4 | 7 |
7 | 2 441 412 | 1 093 764 | 2,23212 | 7 | 8 |
8 | 3 173 835 | 1 250 016 | 2,53904 | 5 | 9 |
9 | 3 828 132 | 1 406 268 | 2,72219 | 9 | 10 |
Quite interesting, isn't it?
Hello all,
Thanks for the answer and the level of technicity there. I tried to reply with a further investigation but the message is marked as spam for no reason. I try to recover it.
I think your second trial didn't changed much. Yes, now there is one field in common but in general it's quite the same approach and therefore the growth of the file-sizes is further in some kind exponential and not linear. I would also think it should be never linear - if you tries my suggestion you will also see an exponential growth but with a factor smaller as 1. A linear growth of the size could be expected by traditional databases or just txt-files but in Qlik with the logic of storing only distinct values within the system-tables and connecting them with a bit-stuffed pointer to the data-table the size-to-number-of-records-rate should be smaller the more records are included.
The above is valid for the "classical" recommended data model by using a star- or a snowflake-scheme with a single fact-table. As far as you differ from it you begin to make more or less obvious compromise in regard to the sizes, load-times, development efforts, UI performance, readability and maintainability - and especially the last four points here will increase the needs for the first two ones and reverse. This means if a data model is a bit more complex it comes quite often to larger file-sizes because it's usually the cheapest part within the environment and not the biggest bottleneck.
I see now that my suggestion didn't really fulfilled your use-case but nevertheless I think it shows in which direction your solution should be rather developed. Personally I use also quite often asynchronous concatenated fact-tables and yes I could admit they work usually very well and because of it I haven't really checked if it could be more optimized in some way respectively it's a matter which I usually only do if I need it to do. Whereby my data are usually a bit different to those in your example because the different fact-tables parts share most of the key-values. This means for example that keys like date, store-id, product-id and so on are in all parts to probably more as 90% the same and just the identifiers (source) and measures/kpi's (sales, budget, forecast, …) are different. In the end most of the records/values are "natively" well connected to each other so that there isn't so much extra overhead to connect them.
I assume that your real-case it's a bit different to my remarks because your example, although quite generic here, showed in another direction. Maybe a bit more elaboration would be helpful to understand it better. Whereby most of my statements aren't directly related to a qvd else to a qvw - and even if both use the same main-logic in the data-handling it's not mandatory exactly the same in the end.
Beside all of the above mentioned things I don't create such qvd's else all data-parts are separately loaded and transformed within a qvd-rawdata and a qvd-transform layer (the biggest parts with incremental logics) and only within the data-model layer they are merged (and the report layer just load them binary). I think something like this will in each case be applicable in your environment to reduce the qvd-filesizes. If all of this isn't helpful you will probably need a different approach within your data models by joining/mapping all respectively the essential parts instead of concatenating them.
- Marcus