Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Join us in Toronto Sept 9th for Qlik's AI Reality Tour! Register Now
cancel
Showing results for 
Search instead for 
Did you mean: 
nstefaniuk
Creator III
Creator III

[10] Speed up group by operation in script

Hello All,

I am searching how to speed up a Group By operation in script.

In my opinion, the Group By operation in Qlikview is not really good:

- high RAM consumption

- only 1 CPU / Core used in server mode (it seems that in desktop all the CPU can be used in parallel. Someone can confirm this point ?)

- slow, especially regarding what a relational database can do, and how fast Qlikview can do a sum on the fly in a graph

In summary : if somebody knows how to speed up a Group By, please let me know. It could be

- avoid calculations

- avoid doing a group by direct from QVD

- avoid where clause

- do group by on ID, not on string

this kind of things.

Thanks

I have first suspected that the issue comes from the number of columns, so I have created a bench:

50 M rows

23 columns:

     id : the unique id of the row

     Col 1 : a column that contains 0 or 1, first row is 0, second is 1, third is 0, fourth is 1, etc.

     Col 2 : a column that contains 0 or 1, first row is 0, second is 0, third is 1, fourth is 1, etc.

     Col 3 : a column that contains 0 or 1, first row is 0, second is 0, third is 0, fourth is 0, fifth is 1, etc.

...

     Col 20 : same logic as above

     Col Key : an autonumber of Col 1 & Col2 & ... & Col 20 : about 1M distinct values

     Value : the value to sum : 1000 distinct values

The table looks like:

idCol1Col2Col3Col 20Col KeyValue
1000011
2100022
3010033
4110044
5001055
6101066
7011077
8111088
..
50M11111M1000

My test was, with Desktop:

     - do a sum(Value) Group By Col1,... Col 20 : 50M source, 1M in result, 4mn30s

     - do a sum(Value) Group By Col Key : 50M source, 1M in result, 5mn

the 2 Group BY have kept same result, same number of rows, so the data and the test are consistent.

Strangely, the Group By on 20 columns was FASTER than the Group By on only 1 column, even if the the order of the test was switched (to avoid bias from cache, if any)

So I have thought that maybe the algorithm used by Qlikview to do Group By in script was not the best one, maybe because some useful operators are allowed and wouldn't be usable with a more powerful algorithm.

So I have modified my tests:

     - Test in server, where I am sure only 1 CPU/Core is used for a refresh

     - do a sum(Value) Group By Col 1,... Col 20 : 50M source, 1M in result, 34mn

     - do a sum(Value) Group By Col 20,... Col 1 : 50M source, 1M in result, 29mn

     - do a sum(Value) Group By Col Key : 50M source, 1M in result, 26mn

My opinion is that the a Group By is more a job for a cube than for Qlikview, because all values are pre-calculated for all combinations of dimensions involved in the Group By. In a graph, Qlikview has identified which rows are involved in the calculation, thanks to the selection and the set analysis, and just sum these rows. It could explain why the sum is more efficient in a graph than in a Group By in script.

Moreover, I think that the number of distinct values of a column used in a Group By impact the Group By. I wonder if Qlikview do the Group BY row-by-row, with a recursive calculation on each column involved or if it uses a better algorithm.

Thanks

0 Replies