Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
jonvitale
Creator III
Creator III

Replacing If/Else in Expressions

Hello everybody,

I have an app that is not performing too well. In large part I believe this is due to overly complex expressions. To some degree this is inevitable based upon our business rules, but I think there are opportunities to optimize.

In particular, I would like to remove as many if/else statements in the expressions as possible, because I've read that these are processing heavy. But, I'm having difficulty removing in some cases and wonder if anyone has a tip about these scenarios:

1. I have requirements to hide measure values if the size of the group I'm measuring falls below a threshold (i.e., these are survey results, so we don't want to make conclusions if the sample size is too low). I do something like this: 

   if (Count({SOME_SET}Respondents) < 10, Avg({SOME_SET} Score), Null())

2. I might have some broad categories for response levels, e.g., Excellent, Good, Fair, Poor, but I need to calculate them on the fly based upon the average scores calculated in a measure.  So, I might have a measure like this:

if (Avg({SOME_SET} Score) < 5, 'Poor',
If(Avg({SOME_SET} Score) < 7, 'Fair',
If(Avg({SOME_SET} Score) < 9, 'Good', 'Excellent'))) 

I typically have Avg({SOME_SET} Score) in a variable, so that might help with caching, but maybe there is a better approach to this. I also use this approach for color categories. Sometimes there can be 10+ classes.

 

Thanks

 

Labels (2)
1 Solution

Accepted Solutions
marcus_sommer

In general it's a good practice to use variables and applying parametrized variables widen the possibilities a lot - if they are support the main-aim from variables which are to reduce redundancy and to simplify things. The use of variables for (nearly) all fields, labels, expressions and so on and/or the attempt to combine or even nest multiple variables goes easily in the opposite direction by increasing the complexity. Even if something is technically possible it might not sensible at all - if the time to develop, maintain and document these logics is higher as using just the "normal" way of creating objects and expressions.

In your case is not the general use of parametrized variables unsuitable else the way how it's done. Like you assumed here all branches of the if-loop will be calculated/evaluated but you don't need this if-loop else you could directly define what should be calculated. For this you may change your variable like:

$1({<gender={'$3'}, ytd={$2}>} $4) / alt($1({<gender={'$3'}, ytd={$2}>} $5 $4), 1)

and then you call:

$(v_mastercalc(avg, 2020, male, [score], ERROR)

or

$(v_mastercalc(sum, 2020, female, [score], TOTAL)

means no extra query for the value of $1 else just using the value directly. Further also adding the field which should be aggregated and in the last the percent-part which might be realised with $5 which enabled a TOTAL calculation or returned a forced error which is captured with the alt().

The last part of the percentage-calculation is something what I meant above it's technically possible but not very sensible in doing it because it needs some extra logic and would always force an extra calculation/evaluation for all normal aggregations - even if the aggregation itself wouldn't be performed because of the invalid syntax. My recommendation here would be to use two specialized variables.

I'm not sure if variables will be cached at all. I never tried to measure it. If it would be only significantly by global calculated variables. This means variables which start with an equal-sign, like:

= sum(Field)

in which the variable contained the calculated value. As far as you are using only:

sum(Field)

the variable contained just a string. This string is replaced later within the object by activating respectively elsewhere by calling the variable. The variable is in this case just a place-holder for the real content. Further all this variable-stuff is evaluated before the real object is calculated - it becomes a part of the object. Therefore there is nothing what could be cached.

The most caching happened not between multiple objects/expressions because it would mean that they have redundant views/logics which is rather seldom else the caching is important for the multiple forth and back of the user-selections and/or activating the objects. The longer and the more intense an user worked within an application/session the more overlapping of already calculated views will be there.

- Marcus

View solution in original post

9 Replies
marcus_sommer

Your first example could be probably not much optimized within the UI but I think you could do this count within the script and/or creating an appropriate 0/1 flag-field.

But your second example could be replaced with something like this:

pick(floor(avg({SOME_SET} Score)) + 1,
   'Poor', 'Poor', 'Poor', 'Poor', 'Poor', 
'Fair', 'Fair', 'Good', 'Good', 'Excellent', 'Excellent', ...)

which calculates the average only once and not multiple times and picks then the appropriate return-value from the listing.

- Marcus

jonvitale
Creator III
Creator III
Author

Thank you @marcus_sommer ,

Yeah, that's a good suggestion for #2. Also, if I create a parameterized variable, which acts like a function, I might be able to create a reusable mapping from score => level.

Since I've got you here, in regards to #1, do you think that doing something like the following would be worthwhile? This would not nullify if lower than threshold but "zero-ify". But, maybe all these extra functions wouldn't achieve any performance benefit. I don't really have any conception of how much if-else hurts performance - i.e., is the juice worth the squeeze:

Min(1, Floor(Count({SOME_SET}Respondents)/ 10)) * Avg({SOME_SET} Score)

Explanation: I basically turn the "thresholding" expression into a 0 or 1 and multiply to the actual expression I care about. 

marcus_sommer

I think it doesn't change much - at least not from a performance point of view because it also executes both aggregations which are the heavy parts of the calculation. If the final-check is performed from a direct if-loop or with a boolean multiplicator is probably not significantly - whereby by larger datasets ... you may just try it.

Instead of your rangemin-approach  you could also use:

-(Count({SOME_SET}Respondents)>=10) * Avg({SOME_SET} Score)

Like above already mentioned if the count-check isn't mainly dependent from the selections you may transfer the logic into the script.

- Marcus

jonvitale
Creator III
Creator III
Author

@marcus_sommer , thanks, that's a smart expression - much cleaner, two quick follow-ups

1. How does expression caching work with parts of an expression? For example, if Count({SOME_SET}Respondents) is being calculated somewhere else on sheet (as it typically would be, we like to expose the denominators), does it cost less to use within a more complicated expression. Would it help if it was embedded in a variable like v_count_respondents that is used in multiple expressions?

2. Do you have any links that address this whole issue of why if-else causes performance issues? I just keep reading it stated verbatim, but I have no actual mental model of what's going on under the hood. My thought was that by creating a somewhat more complicated but singular expression it would produce a a single hypercube and therefore not be such a burden. But, the if-else statement creates multiple hypercubes and costs a lot more. That's a complete guess though.

 

thanks

marcus_sommer

The caching is only applied on complete expressions not on expression-parts, means if you had such check-count within multiple expressions it will be calculated multiple times. Further important here is the dimensional context on which the expression is applied and of course the selection-state - each change leads to a re-calculation unless it was already done (anywhere within the application and partly also within different applications - of course only on the server) and the hash-value is identically (that's also the reason why for example: =sum(Field) and sum(Field) and SUM(Field) are different expressions in regard to their hash-value). Therefore it's useful to use variables for redundant expressions to ensure that they in all places are the same but they are just a string-replacement of the expressions and won't save further performance.

A simple if-loop won't cause really performance-issues. But if nested if-loops are used it may change. The reason for it is that Qlik evaluates at first each branch within the if-construct and if it's done comes the second step which compared the results if they are TRUE or FALSE. This behaviour is quite different to other tools which calculates the first condition and checking then if it's true/false and exit the calculation if it's true otherwise comes the next condition and so on.

Both approaches have their pros and cons. The Qlik way of doing it has the benefit that all included calculations could be done in one workflow without the need to hop out to perform the check and in again to calculate the next ... and so on ... By just a few conditions it's an advantage but if there are dozens of conditions it could slowdown quite heavily. Just regard it as a friendly hint from Qlik that you shouldn't use (too) many conditions ... and quite often there are better ways to solve the task behind it.

- Marcus

jonvitale
Creator III
Creator III
Author

@marcus_sommer  thanks again for the attention. I appreciate your response but I'm not quite sure I get it, could you walk me through with this example below and then I'll stop bothering you!

If(E1, E2, If(E3, E4, If(E5, E6, E7)))

It seems like you are saying that Qlik will evaluate all the way down to the bottom of the tree (i.e., E6 and E7) before it ever evaluates E1, which determines which branch to use. Whereas other tools will evaluate E1 before descending into another level. Is that right?

 

marcus_sommer

Yes, at first E1 - E7 are calculated and then in a second step comes the evaluation which condition is true and returning the appropriate THEN part.

AFAIK whereby it's only from hearsay that most of the other tools (not only BI tools else databases, office, programming/script languages/tools) do it in a descending respectively iterative way which means they treat the if-loop really like a loop-routine. If you used in Qlik (within the script) for/while-loops or similar they will also run step by step through each iteration until the exit-condition is true.

The reason why Qlik treats an if-loop different is that the function is modular - the calculations are performed from the calculation-engine which is part of the Qlik core and the evaluation if a condition is true happens at least one software-layer above them. This means Qlik would need to hop back and forth between the layer which cost each time a few processing clocks until Qlik and the OS scheduled the next calculation (I'm far away from being an expert to explain it more clearly or detailed how the cpu-processing worked on the lowest levels - but I hope you could a bit comprehend the direction).

Therefore the behaviour from Qlik respectively their very early design decisions is the opposite from immature else it's completely dedicated to speed (quite the same why they use a binary number-system which only allows max. 15 digits as number). In many scenarios for which BI tools are used it's a real benefit and the reason why Qlik applications are usually much faster as other tools. And of course it has also disadvantages but no tool is really suitable for all use-cases.

- Marcus

jonvitale
Creator III
Creator III
Author

@marcus_sommer  (sorry, I said I'd stop pestering you, but...)

This is very helpful, one take-away I have from this is: we've been doing something in-house called a "Master Variable", where we essentially have one parameterized variable that is used almost everywhere. So, we might apply parameters about what general kind of calculation it is (e.g., a count, avg, pct-of-total), the timeframe (cytd, pytd, all), and some sort of subset of data (e.g., male, female). Like this:

variable def:

if ($1 = 'avg', avg({<gender={'$3'}, ytd={$2}>} [score]),

if ($1 = 'count', sum({<gender={'$3'}, ytd={$2}>} [scored_flag]),

if ($1 = 'pct-of-total', sum({<gender={'$3'}, year={'$2'}>} [score]) / sum(total {<gender={'$3'}, year={'$2'}>} [score]) )))

used like:

v_mastercalc(avg, 2020, male) 

It seems like, from what you are saying , this would be bad practice because all of these expressions are going to get evaluated, not just the ones for which the if-condition is TRUE. Is that right?

If so, it seems like our master variables are okay if they can use the parameters in the set analysis (e.g., gender, year), but not if they need to be separated into branches of an if-statement. True?

As a bit of an aside - something I've been wondering about parameterized variables (because the documentation on them is very minimal): how do parameterized variables get cached? For example let's say I have two measures in a chart which reuse the same parameterized variable, but with different parameters, like:

v_mastercalc(avg, 2020, male)

v_mastercalc(avg, 2020, female)

Does caching occur after all parameter placeholders ($1, $2, $3...) are substituted with actual values? In which case each of the above get their own caching hash-value, so if they are used in multiple locations they won't be calculated twice. Or maybe parameters throws the whole thing off and they don't get cached at all - in which case we really shouldn't use them.

Thanks again

marcus_sommer

In general it's a good practice to use variables and applying parametrized variables widen the possibilities a lot - if they are support the main-aim from variables which are to reduce redundancy and to simplify things. The use of variables for (nearly) all fields, labels, expressions and so on and/or the attempt to combine or even nest multiple variables goes easily in the opposite direction by increasing the complexity. Even if something is technically possible it might not sensible at all - if the time to develop, maintain and document these logics is higher as using just the "normal" way of creating objects and expressions.

In your case is not the general use of parametrized variables unsuitable else the way how it's done. Like you assumed here all branches of the if-loop will be calculated/evaluated but you don't need this if-loop else you could directly define what should be calculated. For this you may change your variable like:

$1({<gender={'$3'}, ytd={$2}>} $4) / alt($1({<gender={'$3'}, ytd={$2}>} $5 $4), 1)

and then you call:

$(v_mastercalc(avg, 2020, male, [score], ERROR)

or

$(v_mastercalc(sum, 2020, female, [score], TOTAL)

means no extra query for the value of $1 else just using the value directly. Further also adding the field which should be aggregated and in the last the percent-part which might be realised with $5 which enabled a TOTAL calculation or returned a forced error which is captured with the alt().

The last part of the percentage-calculation is something what I meant above it's technically possible but not very sensible in doing it because it needs some extra logic and would always force an extra calculation/evaluation for all normal aggregations - even if the aggregation itself wouldn't be performed because of the invalid syntax. My recommendation here would be to use two specialized variables.

I'm not sure if variables will be cached at all. I never tried to measure it. If it would be only significantly by global calculated variables. This means variables which start with an equal-sign, like:

= sum(Field)

in which the variable contained the calculated value. As far as you are using only:

sum(Field)

the variable contained just a string. This string is replaced later within the object by activating respectively elsewhere by calling the variable. The variable is in this case just a place-holder for the real content. Further all this variable-stuff is evaluated before the real object is calculated - it becomes a part of the object. Therefore there is nothing what could be cached.

The most caching happened not between multiple objects/expressions because it would mean that they have redundant views/logics which is rather seldom else the caching is important for the multiple forth and back of the user-selections and/or activating the objects. The longer and the more intense an user worked within an application/session the more overlapping of already calculated views will be there.

- Marcus