Skip to main content
Announcements
Global Transformation Awards! Applications are now open. Submit Entry
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Question about improving the Aggr function performance

Hi,

We are currently working with a very large data set (more than 500 million rows) of Web Data. We want to recognize how often a unique visitor vists the website again. This works well but we have a performance problem with one of the elements within the document.  

We have a problem with the frequency table. In this frequency table we want to display the number of times the visits have returned. The graph sometimes takes 50 seconds to load. We have a 24 core CPU with 200 gig of Memory, we can not upgrade this hardware due to support issues at our datacenter.

The frequency diagram consists out of one dimension:

=Aggr(Count(unique_impressions), user_id)

and

one expression:

=count(distinct unique_impressions)

We have already persued several different approaches to fix this. We first tried a group by user_id_impressions, but this caused a problem since we lost the ability to drill down on other dimensions.

The data field was also an string due to it sheer size (it are actually numbers but qlikview doesn't seem to auto detect this), we have tried to convert it using three different functions:

- We converted the values using the hash128() function, this produced another string and didn't improve the loading times

- We converted the values by using the num(num#) function, but this removed all the string data, thereby not creating the same result.

- We converted the values with using autonumberhash128() function and this didn't produce any performance gain and we think it will also cause problems in the incremental script (because the autonumber will start with 1 again, thereby creating issues with identifying unique users over time).

Maybe you guys have a solution to this problem.

Kind Regards,

Victor

2 Replies
erichshiino
Partner - Master
Partner - Master

I don't think this is going to solve it, but I hope it can improve it:

The count (distinct something) is a very intense expression ( it's the same when you run something like this directly on a database). I'm not sure if this is the field you are trying to convert ( unique_impressions).

Buy you could create a field on the script to avoid the count distinct.

let us say that this field is part of a 'Fact' Table.

I would create another table like this:

load distinct unique_impressions, 1 as flag_impression

resident Fact.

So there is only one value '1' for each distinct unique_impression. After this, you replace the count(distinct unique_impressions) by sum( flag_impression).

Hope it helps,

Erich Shiino

Not applicable
Author

Hi Erich,

The unique_impressions are indeed part of the fact table. So wouldn't this solutions kill the segmentation options (there are also 7 dimension ID's in the fact table)?

Victor