Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi!
I am currently creating aggregated data files (sales grouped by week, month).
What I found is that function Sum (and, actually, any other aggregation function) works very slow in a load script.
I made simple test script. It creates a table "id - value" with 10 million lines and then calculates total sums per "id".
t:
LOAD
Mod(RecNo(), 10) as id,
Mod(RecNo(), 100) as sales
AutoGenerate 10000000;
total:
LOAD
id,
sum(sales) as total_sales
Resident t
Group by id;
Here is a fragment from the log-file generated by the script execution.
2018-09-28 15:46:58 0038 LOAD
2018-09-28 15:46:58 0039 Mod(RecNo(), 10) as id,
2018-09-28 15:46:58 0040 Mod(RecNo(), 100) as sales
2018-09-28 15:46:58 0041 AutoGenerate 10000000
2018-09-28 15:47:00 2 fields found: id, sales,
2018-09-28 15:47:00 10 000 000 lines fetched
2018-09-28 15:47:00 0043 total:
2018-09-28 15:47:00 0044 LOAD
2018-09-28 15:47:00 0045 id,
2018-09-28 15:47:00 0046 sum(sales) as total_sales
2018-09-28 15:47:00 0047 Resident t
2018-09-28 15:47:00 0048 Group by id
2018-09-28 15:47:10 2 fields found: id, total_sales,
2018-09-28 15:47:10 10 lines fetched
As we can see the execution of "LOAD - sum() - group by" statement takes about 10 seconds. But when it goes in the application in a straight table chart this the same "sum()" expression calculates instantly.
When aggregating real table (10 million records and 20 fields) with 7 sums and 1 item in "group by" clause script execution takes more than 2 minutes. But the calculation is still instant in a straight table!
Why my sums are so slow?
Please, tell me if it is something wrong with my Load. Is there any way to make my aggregations faster?
Thank you!
Victor.
Any object in QlikView is using values and pointers primarily (also vectors and states, etc.) as a result of the in memory data model that has been loaded.
During the execution of the script, though, all those values and pointers do not exist yet, as they are still being created while the script runs, so my wild guess goes to that the load script is doing a lot of other stuff -although not visible- that the object has already done.
I don't think anybody here but perhaps Henric or anyone from Qlik R&D can tell this for sure, but also likely, some operations in the script are single threaded for a longer period of time, as the script has to check all those temporary pointers and distinct values or create ones if they don't exist, while most of the things that happen when you use the QlikView app in memory are most often multithreaded.
Last but not least, until recently, reading RESIDENT was slower than reading from QVDs, for example, which could also impact how long it takes to just do the RESIDENT load, even without aggregations.
In general, I avoid doing aggregations and JOINs in the script when millions of rows are involved in either table, because QlikView is not a database (not in the sense we all are accustomed to) and likewise the script is not an ETL, although it can extract, transform and load.
Qlik is not so good in data aggregation on script level (it can't benefit it's engine yet) so it is better to do GROUP BY on the data warehouse side.
In some way you already noticed that it is not mandatory to aggregate the data within the script to get a fast performance within the UI. As far as you could use rather simple calculations by avoiding aggr() and if-loops a qlik application will be quite performant even with large datasets. Also the savings on the RAM side might not be so significantly to apply always an aggregation where you don't need the granularity below it.
Beside this you could also use an incremental load-approach by aggregating the data. On the end of the following posting you will find a lot of examples about incremental loadings by using exists(): Advanced topics for creating a qlik datamodel.
- Marcus
Marcus, thank you for providing me with the link!
These topics are very helpful for me and the other guys who are new with QlikView.
You have done a great job collecting all these useful links in one post.
Thank you!
Victor.
Well... I have noticed the same,
when I load data from DB it is much faster when I do aggregation on SQL lvl then it is done while I load the same data from QVD.
the difference on my data set exaple was 8 sec (SQL), 30 sec (QVD)