Evaluating "sets" in the context of a dimension

    Below is a converstion between myself and John Witherspoon, which describes and validates how to use set analysis in an expression that reconciles to the individual values of a charts dimensions.

    Gordon Savage:

    I've seen the comment a couple of times 'a set is only generated once for the entire table, not once per value of the dimensions'.

    I have a chart in a document which uses variable substitution which results in an expression containing set analysis effectively resolve to the corresponding dimension. The expression returns a count of a column in a table not directly related to the table in which the dimension exists. It's a bit complicated but the users want to be able to de-select the 'heads' from 'tails' in the data and vice versa so the tables are not directly linked by a common field.

    This is the chart definition:

    Dimension:

    =if ([Wty Group Type] = 'Standard', [Month and Year Received])

    Expression:

    sum([Total REQ]) / $(vLinkVars)

    where vLinkVars is generated in the script from the referenced values as:

    if (LinkVar = 'v40483S', v40483S, if (LinkVar = 'v40391S', v40391S, if...........

    LinkVar is a column within the same table ('tails') as [Month and Year Received] where the numeric part represents the numeric value e.g. v40483S:

    =ceil(sum(aggr(count({$<Link *= {'Nov 2010'}, [Wty Month Type] *= {'Standard'}>} distinct Machine)/Div, Div)))

    and where the columns 'Link' and '[Wty Month Type]' are within another table ('heads').

    How I think it works:

    Because the dimension [Month and Year Received] results in just one value of LinkVar in the expression, the conditional statement of vLinkVars will substitute the corresponding variable vNNNNNS for which Qlikview substitutes its associated expression. The expression definitively identifies the records to be counted where the ('hard coded') value of 'Link' reflects the corresponding dimension without any actual aggregation occurring to the dimension (ie it returns a number that is indirectly related to the dimension value).

    Note that the maximum number of dimension values is 36 as it represents a rolling 3 year period. The number of rows subject to the set analysis is around 14,000.

    John Witherspoon:

    I'm not sure if I'm following, since it seems like your example has a lot of extraneous features. But the basic solution seems to be using a nested if to decide which set expression to use, and thus getting around the normal restriction of a set only being evaluated once for the entire table.

    To try to strip it down to basics, let's start with a much simpler data model:

    Sales:
    LOAD * INLINE [
    Customer1, Sales
    A, 10
    B, 20
    C, 30
    ];

    Customers:
    LOAD * INLINE [
    Customer2, Name
    A, Amy
    B, Bob
    C, Carla
    ];

    So here, Customer1 and Customer2 are not connected, even though they have the same values. Let's assume there's a good reason for leaving them disconnected, but in some particular chart, we want to display the customer ID, name AND sales together. People often try to do the connection between Customer1 and Customer2 with set analysis, perhaps like this:

    sum({<Customer1=Customer2>} Sales)

    That doesn't work, of course, unless we select a specific Customer2. So we often suggest this alternative:

    sum(if(Customer1=Customer2,Sales))

    That works, but if you have a lot of data, it's slow because it has to compare every row to every row (well, technically it doesn't have to, and maybe doesn't, but at best it's o(n log n), I believe, in the general case). So you're suggesting instead that we create a nested IF to determine which set to use. Something like this for our example data and chart, if we use Customer2 as a dimension:

    if(Customer2='A',sum({<Customer1={'A'}>} Sales)
    ,if(Customer2='B',sum({<Customer1={'B'}>} Sales)
    ,if(Customer2='C',sum({<Customer1={'C'}>} Sales)
    ,sum(Sales)))) // final sum(Sales) to get total value

    This expression can and almost certainly should be generated in script, such as like this:

    Expression:
    LOAD ' ' & concat(distinct 'if(Customer2=|' & Customer1 &'|,sum({<Customer1={|' & Customer1 & '|}>} Sales)','
    ,') & '
    ,sum(Sales)' & repeat(')',count(distinct Customer1)) as Expression
    RESIDENT Sales
    ;
    LET vExpression = replace(fieldvalue('Expression',1),'|',chr(39));
    DROP TABLE Expression;

    That seems to work, and I don't think it's a happy accident. It looks pretty straightforward to me. It looks like a reasonable technique.

    Assuming I've understood what you're proposing, have you done any performance testing on it? I'm betting that for small numbers of values and large numbers of rows, this would be significantly faster than the normal sum(if(...)) approach, making it a true improvement over our standard advice, at least in some fairly common situations. I suspect that if the number of values is medium to large, it will be slower, but I think people typically want to do this when the number of values is small, such as on months, or customers, or types, or statuses - that kind of thing.

    The reason I think that is probably clear, but basically your new approach looks to be o(n), while the old approach is either o(n log n) or o(n^2) depending on how QlikTech coded it. Hmmm, it must be o(n log n) or it would simply fail on any reasonable size of data set. But there appears to be more overhead in the new approach, so assuming it's o(n) vs. o(n log n), it likely won't pull ahead until we get to some larger values of n.

    Using pick(match()) to select the set analysis expression might be slightly faster as it doesn't require nesting and should only need to evaluate the value of Customer2 once per row in the table. It's probably not different enough to show up in testing or ever be noticed, though.

    ,'if(dimensionality(),pick(match(Customer2,|' & concat(distinct Customer1,'|,|') & '|)'
    & concat(distinct '
    ,sum({<Customer1={|' & Customer1 & '|}>} Sales)') & ')
    ,sum(Sales))' as Expression2

    LET vExpression2 = replace(fieldvalue('Expression2',1),'|',chr(39));

    Gordon Savage:

    I dont know the mechanics of how Qlikview constructs data matching with set analysis….

    Unfortunately I am never able to do any meaningful performance testing as the server running Qlikview is not dedicated. I can say that before using set analysis I had a compound IF statement and this took around 10x longer to render the chart.

    John Witherspoon:

    As far as the mechanics of data matching with set analysis, I think of it like actually making the selections. My understanding from a conversation with one of the senior tech people is that this is a reasonable way to think about it. So in my example of your conditional set analysis, think of it as "if Customer2=A, select A in Customer1, then sum(Sales), else if customer2=B, select B in customer 1, then sum(Sales), else...". Since QlikView has a very fast engine for applying selections to the data set, it by default has a very fast engine for applying set analysis to the data set. But usually, the number of sets required to calculate a chart is small. That may not be the case here if our dimension has a lot of values, and QlikTech may not have anticipated using, say, 1000 separate sets to calculate a chart.

    John Witherspoon:

    I did some testing. At first, it was a performance disaster, and trying to figure out what was going wrong is what took me so long. Both approaches were clearly dominated internally by a loop within a loop - for each value, for all rows (not just matching rows), causing them to go VERY slowly as the number of values and rows increased.

    That IS the obvious way to process the sum(if()) internally:

    for each value
    set total to 0
    for each row
    if the row matches the value
    add to total
    end if
    end for
    display value and total in the chart
    end for

    There exist less obvious but faster ways, though, so this still took me by surprise.

    The conditional set analysis took me even more by surprise. The most obvious way to process the conditional set analysis shouldn't exibit this sort of behavior. Also, you were reporting a 10x improvement in speed, where I was initially seeing it going slower than the simple sum(if()).

    One feature of your solution that I hadn't used was variables. This seemed like an unnecessary complication on the surface, but it appears to be KEY to making your solution perform, even if I can't fully explain why. Once I started generating the variables, the performance picked right up. The more values and the more rows being processed, the better the conditional set analysis with variables compares to the sum(if()). It was going well over 10x faster as the numbers increased, right up to the point where my computer started choking on the size of everything. I was going up by factors of 10 on both values and rows. Performance massively degraded (though it still returned the answer eventually) with 1,000 values and 100,000 rows, or 100 values and 10,000,000 rows. Up to that point, I was seeing was sub-second response times on the chart.

    I did all of the final testing in version 10, but saw enough of it in version 9 SR6 to be pretty confident that it works OK in that version as well. Version 10 was just faster in every way, particularly at loading up large amounts of random data, so it was just more practical to use it for the testing.

    My script is below. I should note that I eliminated the match() from my expressions to improve performance. In my case, I was using a sequential integer customer ID so this was easy. Out in the real world, you wouldn't have an integer value for each of your dimension values, but you could easily assign one to every unique combination of the chart's dimensions in script, so I think it's practical to eliminate the match() if you have enough values to degrade the performance of the match().

    Sales:
    LOAD
    num(ceil(rand()*100)) as Customer1 // vary this number
    ,ceil(rand()*10) as Sales
    AUTOGENERATE 100000 // vary this number
    ;
    Customers:
    LOAD *
    ,'Customer ' & Customer2 as Name
    ;
    LOAD num(fieldvalue('Customer1',iterno())) as Customer2
    AUTOGENERATE 1
    WHILE len(fieldvalue('Customer1',iterno()))
    ;
    Expression:
    LOAD
    'pick(Customer2'
    & concat(distinct '
    ,sum({<Customer1={' & Customer2 & '}>} Sales)','',Customer2) & ')' as Expression
    ,'pick(Customer2' & concat(distinct ',vc' & Customer2,'',Customer2) & ')' as SetVarsExpression
    RESIDENT Customers
    ;
    LET vExpression = fieldvalue('Expression',1);
    LET vSetVarsExpression = fieldvalue('SetVarsExpression',1);
    DROP TABLE Expression;

    FOR I = 1 TO 1000
    LET vc$(I) = '=sum({<Customer1={' & $(I) & '}>} Sales)';
    NEXT

    John Witherspoon:

    One more thing worth mentioning:

    Conditional set analysis with variables was only marginally slower than using a separate expression for every set and having no dimension. Using a separate expression for every set and having no dimension represents (I believe) the lower bound on the amount of time required to get the data with any set analysis solution. I'm therefore confident that each set is only evaluated once, and that your solution is about as fast as any set analysis solution could possibly be.

    I think that's even more meaninful than saying "over 10x faster than sum(if()) for large numbers of values and dimensions". It isn't just faster, it seems to be about as fast as is possible. The only thing faster, I believe, would involve data model changes to directly connect the data, which is exactly what we're trying to avoid in this case.