Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
One of our Qlik Community Buddy (Rupinder Singh) shared an awesome way to optimize the performance of group by statement but I don't think this got enough lime light which it should have. In order to give it better exposure, I am creating this document from this thread Optimize Group By
For below I am using QV version - 11.2 SR4
I have always had trouble with aggregating my data as "Group by" statements in scripts have always taken too long for my comfort.
For example, to group by 23 million records (2 columns - ID and Sales) it takes me 10 mins.
Table_Group:
Load ID,Sum(Sales)
Resident Table
Group By ID;
Instead of this, If I Order by my data by ID and then run the same Group by clause, I get results backs in less that 2 mins.
Table_Sort:
Load ID,Sales
Resident Table
Order by ID;
Table _Group:
NoConcatenate Load ID,Sum(Sales)
Resident Table_Sort
Group By Table_Sort;
I would have believed QV to internally perform this step, but explicitly implementing this has improved my query times by 500%.
I have tables with 40 columns (36 of which to be grouped by and 4 to be summed) and the query time has gone from 18 to 6 mins.
I have just ensured that my Order By column list is the same as Group By Column List.
So, If Order by is "Order By Column1,Column2,Column3" then Group by is "Group By Column1,Column2,Column3"
I would love to hear back if others in the community also have faced similar challenges and whether this or any other method has helped them improve Group by Performance.
Interesting. How long does it take to run the initial Order By portion of this?
I have not checked the individual component, but this is what I found
Only Group By - 15 minutes
Order By followed by Group By - 5 minutes
If you think you would want me to break it down, I can check the log files to see what the break up was, but seeing this improvement, I didn't even think it was necessary to check the Order By time vs Group By time
Not necessary Sunny, I can play and do some of my own analysis...I have some scenarios like this I should probably optimize this way. I find it very interesting though, and appreciate that I was reminded of this.
Hi,
I found this to be very interesting so I ran some tests on Qlik Sense November 2017. I used a 37 mil. dataset with 13 fields, 8 used in group by, 4 summed and 1 counted(distinct).
The Group By statement was really executed faster by almost a half, but the total time (with ordering the dataset first) was just a little less - 13:30 minutes instead of 15 minutes of just group by.
And, when I ran the two loads without the count(distinct) aggregation, the simple Group By was even faster then the total time with ordering first (10:10 instead of 10:30 minutes).
So I guess the more complex the aggregation (and more data), the more it makes sense to order first. And, there cannot be any other fields loaded, only those used in Group By or aggregated, because the ordering will actually take longer than what is "saved". And, any "Where" statement will kill the performance too.
Very good ! I'll test it !
Remind me of another post related to loading performance
Thanks for sharing this. I do use this technique a lot for doing qvw optimized loads.
Just want to understand working behind less time.
Why does taking a resident of sorted data before doing actual group by takes less time.
Below load statement
Table_Group:
Load ID,Sum(Sales)
Resident Table
Group By ID
order by ID;
will be performing the same operation won't it ?
I don't know what logic it takes, but there are some questions that can only be answered by testing or if somebody from Qlik can answer them. Let's see what hic thinks about this and if he can share some of his knowledge and expertise on this matter.
No.
As fas as I can see/find out, Order By in a Group By Load has NO effect.
Yes, I think that's strange too. But it would explain a lot of the mysteries surrounding this optimization.
Good documentstalwar1. I was also used this technique in my previous project.