Question about improving the Aggr function performance
We are currently working with a very large data set (more than 500 million rows) of Web Data. We want to recognize how often a unique visitor vists the website again. This works well but we have a performance problem with one of the elements within the document.
We have a problem with the frequency table. In this frequency table we want to display the number of times the visits have returned. The graph sometimes takes 50 seconds to load. We have a 24 core CPU with 200 gig of Memory, we can not upgrade this hardware due to support issues at our datacenter.
The frequency diagram consists out of one dimension:
We have already persued several different approaches to fix this. We first tried a group by user_id_impressions, but this caused a problem since we lost the ability to drill down on other dimensions.
The data field was also an string due to it sheer size (it are actually numbers but qlikview doesn't seem to auto detect this), we have tried to convert it using three different functions:
- We converted the values using the hash128() function, this produced another string and didn't improve the loading times
- We converted the values by using the num(num#) function, but this removed all the string data, thereby not creating the same result.
- We converted the values with using autonumberhash128() function and this didn't produce any performance gain and we think it will also cause problems in the incremental script (because the autonumber will start with 1 again, thereby creating issues with identifying unique users over time).
I don't think this is going to solve it, but I hope it can improve it:
The count (distinct something) is a very intense expression ( it's the same when you run something like this directly on a database). I'm not sure if this is the field you are trying to convert ( unique_impressions).
Buy you could create a field on the script to avoid the count distinct.
let us say that this field is part of a 'Fact' Table.
I would create another table like this:
load distinct unique_impressions, 1 as flag_impression
So there is only one value '1' for each distinct unique_impression. After this, you replace the count(distinct unique_impressions) by sum( flag_impression).