I am searching how to speed up a Group By operation in script.
In my opinion, the Group By operation in Qlikview is not really good:
- high RAM consumption
- only 1 CPU / Core used in server mode (it seems that in desktop all the CPU can be used in parallel. Someone can confirm this point ?)
- slow, especially regarding what a relational database can do, and how fast Qlikview can do a sum on the fly in a graph
In summary : if somebody knows how to speed up a Group By, please let me know. It could be
- avoid calculations
- avoid doing a group by direct from QVD
- avoid where clause
- do group by on ID, not on string
this kind of things.
Thanks
I have first suspected that the issue comes from the number of columns, so I have created a bench:
50 M rows
23 columns:
id : the unique id of the row
Col 1 : a column that contains 0 or 1, first row is 0, second is 1, third is 0, fourth is 1, etc.
Col 2 : a column that contains 0 or 1, first row is 0, second is 0, third is 1, fourth is 1, etc.
Col 3 : a column that contains 0 or 1, first row is 0, second is 0, third is 0, fourth is 0, fifth is 1, etc.
...
Col 20 : same logic as above
Col Key : an autonumber of Col 1 & Col2 & ... & Col 20 : about 1M distinct values
Value : the value to sum : 1000 distinct values
The table looks like:
id
Col1
Col2
Col3
…
Col 20
Col Key
Value
1
0
0
0
…
0
1
1
2
1
0
0
…
0
2
2
3
0
1
0
…
0
3
3
4
1
1
0
…
0
4
4
5
0
0
1
…
0
5
5
6
1
0
1
…
0
6
6
7
0
1
1
…
0
7
7
8
1
1
1
…
0
8
8
…
…
…
…
…
..
…
…
50M
1
1
1
…
1
1M
1000
My test was, with Desktop:
- do a sum(Value) Group By Col1,... Col 20 : 50M source, 1M in result, 4mn30s
- do a sum(Value) Group By Col Key : 50M source, 1M in result, 5mn
the 2 Group BY have kept same result, same number of rows, so the data and the test are consistent.
Strangely, the Group By on 20 columns was FASTER than the Group By on only 1 column, even if the the order of the test was switched (to avoid bias from cache, if any)
So I have thought that maybe the algorithm used by Qlikview to do Group By in script was not the best one, maybe because some useful operators are allowed and wouldn't be usable with a more powerful algorithm.
So I have modified my tests:
- Test in server, where I am sure only 1 CPU/Core is used for a refresh
- do a sum(Value) Group By Col 1,... Col 20 : 50M source, 1M in result, 34mn
- do a sum(Value) Group By Col 20,... Col 1 : 50M source, 1M in result, 29mn
- do a sum(Value) Group By Col Key : 50M source, 1M in result, 26mn
My opinion is that the a Group By is more a job for a cube than for Qlikview, because all values are pre-calculated for all combinations of dimensions involved in the Group By. In a graph, Qlikview has identified which rows are involved in the calculation, thanks to the selection and the set analysis, and just sum these rows. It could explain why the sum is more efficient in a graph than in a Group By in script.
Moreover, I think that the number of distinct values of a column used in a Group By impact the Group By. I wonder if Qlikview do the Group BY row-by-row, with a recursive calculation on each column involved or if it uses a better algorithm.