Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
gmenoutis
Partner - Creator II
Partner - Creator II

Set analysis vs precalculated column

I would like to know which is more performant to calculate an expression on a certain data subset:

  • Using set analysis \ set modifier to select the desired subset
  • Create a new, dedicated field in the data warehouse that has the data we need for the subset records and zero/null for the others.

Dummy example: Calculate the SUM of all [Units] sold which have [Color]=Green:

1) SUM({$<[Color]={Green}>} [Units]

2) On script load, have an additional field [Green Units] calculated as if([Color]=Green,[Units],0)

then the expression is simply SUM([Green Units])

Which is faster and/or lighter on resources?

1 Solution

Accepted Solutions
Miguel_Angel_Baeyens

Right, calculating additional columns and joining when possible is recommended if resources permit. The most expensive part here is the time it takes to JOIN, once it is done, the footprint in memory will not be much higher. But the JOIN operation is what is resource consuming, not the result.

The more tables you have which are part of the calculation, yes, it will require more CPU. Another word of caution here, some parts of the calculation are single threaded, so having more CPUs or more powerful ones will not guarantee a better result.

This is a very nice overview: The Calculation Engine

View solution in original post

13 Replies
bwisealiahmad
Partner - Specialist
Partner - Specialist

Hi,

I believe the second option would be more performance efficient as it's a sum and it doesn't have to do anything else.

That being said SET analysis are very efficient though so it really depends. If you have several needs to use a different color then I think it's fine to use the SET analysis expression instead of one column for each color.

Best,

Ali A

balabhaskarqlik

2nd one is the more faster approach, you're filtering the non required data in scripting & decreasing the application size as well as load time. So the next level expressions won't use more size.

With SET we used to do predefined calculations as per requirement after the data model created, only required data is loaded into application.

gmenoutis
Partner - Creator II
Partner - Creator II
Author

I'm not sure I agree with what you say.

Bala Bhaskar wrote:

2nd one is the more faster approach, you're filtering the non required data in scripting & decreasing the application size as well as load time

I am doing nothing to filter data. The rows are the same, and in a rather opposite function than what you describe, I am adding a column. I am not neglecting to load the original [Units] column, and neither am I loading WHERE [Green Units]>=0 or anything like that. The application size should obviously get bigger.

sunny_talwar

I think option 2 will work faster as it doesn't need to filter on anything and can get straight to the business of summing the field... but, remember that by adding new field (fields) you are probably increasing the size of the dashboard... this might be okay for few fields, but as things begin to increase... it might start to impacting performance. So, even though it option 1 might not be the best option, I would choose it over option 2.

Having said all this, I would love to hear what experts have to say about this

pcammaert‌, rwunderlich‌, hic

Peter_Cammaert
Partner - Champion III
Partner - Champion III

If only speed is concerned, option 2 must be the fastest. The difference depends on the amount of data to process, since the variable setup cost of figuring out the set analysis-controlled subset decreases with the number of rows but maintains the fixed setup cost of analysing the set analysis expression which becomes more important with fewer rows. But then we're talking about fractions of microseconds.

If performance in general is concerned, you should balance all effects just like Sunny did.

marcus_sommer

I think it depends on various things which way might be the most suitable but quite probably I would choose - like Sunny and Peter - the set analysis approach because it's usually very fast even with large datasets. Further I assume that there are other colors, too. And if you creates for each one a new field you won't not only clutter your datamodel else you will quite probably decrease the performance.

If you have no serious issues with the performance the set analysis will definitely work well. If you want to improve it a bit you could change the 'Green' string-value against a numeric value like 1.

- Marcus

Miguel_Angel_Baeyens

Subscribing Sunny, Peter and Marcus's comments: is performance good enough or does it required fine tuning?

The second is faster for the front-end developer (simpler expressions) and for the user, but has a tradeoff in loading time, assuming the operation is not as obvious as in the example and that the transformation happens in the loading script. Anyway, this tradeoff is usually preferable than longer response times for the users.

Size of the resulting tables or RAM footprint, unless we are talking here of hundreds of millions of rows and dozens of new columns, will not be an issue. As a rule of thumb, more columns are generally better performing that more complex expressions.

Even set analysis requires additional filtering to the aggregation, which takes time. More so, if the fields used in the set analysis modifiers are in different data tables than the fields on which the aggregation happens (e.g.: Color in a dimensional table and Units in a fact table). It also uses some more CPU than straight aggregation. Cache/RAM will not vary substantially in either case.

And definitely as suggested above, use numeric values as often as possible.

gmenoutis
Partner - Creator II
Partner - Creator II
Author

Nice info there. Furthermore, I wish you'd clarify the following:

assuming the operation is not as obvious as in the example    

     Which operation are you referring to? I guess the data warehouses' if() ? As a matter of fact, the example is very      close (Actually, only the field names change) to the fact; because in the case of set analysis, to make sure I use      only one flag, there is already such a field calculated from the data output source (SQL)

Even set analysis requires additional filtering to the aggregation, which takes time.

The "even" part confuses me. In the second case, nothing will be filtered, but internally there will be additions with zero performed...right?

Miguel_Angel_Baeyens

    Which operation are you referring to? I guess the data warehouses' if() ?

I'm referring to whichever transformations are required during or after the load of the data. Loading data as it is from the source without any transformation is faster than any transformation you have to do on the script.

The "even" part confuses me. In the second case, nothing will be filtered, but internally there will be additions with zero performed...right?

The "even" refers to the fact that set analysis, when properly used, is extremely fast, faster than it's equivalent If() clause:

Sum({< Color = {'Green'} >} Units)

will be faster than

Sum(If(Color = 'Green', Units))

Yet, set analysis is still a filter which has to be calculated and applied to the chart.

In the second case you mention, it's an If() in the script and a Sum(Units) in the front-end, there is no set analysis there. That's why most of us recommend the second approach if response times are an issue (saving the time it may take the set analysis to calculate), because if they are not, set analysis will work well enough and you avoid transformations on the data load.