Skip to main contentSign InHelp

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

Documents for QlikView related information.

- Qlik Community
- :
- Data Analytics Forums
- :
- QlikView
- :
- QlikView Documents
- :
- Evaluating "sets" in the context of a dimension

Options

- Article History
- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page
- Report Inappropriate Content

Not applicable

2011-01-04
08:10 AM

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

Evaluating "sets" in the context of a dimension

Below is a converstion between myself and John Witherspoon, which describes and validates how to use set analysis in an expression that reconciles to the individual values of a charts dimensions.

**Gordon Savage:**

I've seen the comment a couple of times 'a set is only generated once for the entire table, not once per value of the dimensions'.

I have a chart in a document which uses variable substitution which results in an expression containing set analysis effectively resolve to the corresponding dimension. The expression returns a count of a column in a table not directly related to the table in which the dimension exists. It's a bit complicated but the users want to be able to de-select the 'heads' from 'tails' in the data and vice versa so the tables are not directly linked by a common field.

This is the chart definition:

*Dimension:*

=if ([Wty Group Type] = 'Standard', [Month and Year Received])

*Expression:*

sum([Total REQ]) / $(vLinkVars)

where vLinkVars is generated in the script from the referenced values as:

if (LinkVar = 'v40483S', v40483S, if (LinkVar = 'v40391S', v40391S, if...........

LinkVar is a column within the same table ('tails') as [Month and Year Received] where the numeric part represents the numeric value e.g. v40483S:

=ceil(sum(aggr(count({$<Link *= {'Nov 2010'}, [Wty Month Type] *= {'Standard'}>} distinct Machine)/Div, Div)))

and where the columns 'Link' and '[Wty Month Type]' are within another table ('heads').

*How I think it works:*

Because the dimension [Month and Year Received] results in just one value of LinkVar in the expression, the conditional statement of vLinkVars will substitute the corresponding variable vNNNNNS for which Qlikview substitutes its associated expression. The expression definitively identifies the records to be counted where the ('hard coded') value of 'Link' reflects the corresponding dimension without any actual aggregation occurring to the dimension (ie it returns a number that is indirectly related to the dimension value).

Note that the maximum number of dimension values is 36 as it represents a rolling 3 year period. The number of rows subject to the set analysis is around 14,000.

**John Witherspoon:**

I'm not sure if I'm following, since it seems like your example has a lot of extraneous features. But the basic solution seems to be using a nested if to decide which set expression to use, and thus getting around the normal restriction of a set only being evaluated once for the entire table.

To try to strip it down to basics, let's start with a much simpler data model:

Sales:

LOAD * INLINE [

Customer1, Sales

A, 10

B, 20

C, 30

];

Customers:

LOAD * INLINE [

Customer2, Name

A, Amy

B, Bob

C, Carla

];

So here, Customer1 and Customer2 are not connected, even though they have the same values. Let's assume there's a good reason for leaving them disconnected, but in some particular chart, we want to display the customer ID, name AND sales together. People often try to do the connection between Customer1 and Customer2 with set analysis, perhaps like this:

sum({<Customer1=Customer2>} Sales)

That doesn't work, of course, unless we select a specific Customer2. So we often suggest this alternative:

sum(if(Customer1=Customer2,Sales))

That *works*, but if you have a lot of data, it's slow because it has to compare every row to every row (well, technically it doesn't *have* to, and maybe doesn't, but at best it's o(n log n), I believe, in the general case). So you're suggesting instead that we create a nested IF to determine which set to use. Something like this for our example data and chart, if we use Customer2 as a dimension:

if(Customer2='A',sum({<Customer1={'A'}>} Sales)

,if(Customer2='B',sum({<Customer1={'B'}>} Sales)

,if(Customer2='C',sum({<Customer1={'C'}>} Sales)

,sum(Sales)))) // final sum(Sales) to get total value

This expression can and almost certainly should be generated in script, such as like this:

Expression:

LOAD ' ' & concat(distinct 'if(Customer2=|' & Customer1 &'|,sum({<Customer1={|' & Customer1 & '|}>} Sales)','

,') & '

,sum(Sales)' & repeat(')',count(distinct Customer1)) as Expression

RESIDENT Sales

;

LET vExpression = replace(fieldvalue('Expression',1),'|',chr(39));

DROP TABLE Expression;

That seems to work, and I don't think it's a happy accident. It looks pretty straightforward to me. It looks like a reasonable technique.

Assuming I've understood what you're proposing, have you done any performance testing on it? I'm betting that for small numbers of values and large numbers of rows, this would be significantly faster than the normal sum(if(...)) approach, making it a true improvement over our standard advice, at least in some fairly common situations. I suspect that if the number of values is medium to large, it will be slower, but I think people typically want to do this when the number of values is small, such as on months, or customers, or types, or statuses - that kind of thing.

The reason I think that is probably clear, but basically your new approach looks to be o(n), while the old approach is either o(n log n) or o(n^2) depending on how QlikTech coded it. Hmmm, it *must* be o(n log n) or it would simply fail on any reasonable size of data set. But there appears to be more overhead in the new approach, so assuming it's o(n) vs. o(n log n), it likely won't pull ahead until we get to some larger values of n.

Using pick(match()) to select the set analysis expression might be slightly faster as it doesn't require nesting and should only need to evaluate the value of Customer2 once per row in the table. It's probably not different enough to show up in testing or ever be noticed, though.

,'if(dimensionality(),pick(match(Customer2,|' & concat(distinct Customer1,'|,|') & '|)'

& concat(distinct '

,sum({<Customer1={|' & Customer1 & '|}>} Sales)') & ')

,sum(Sales))' as Expression2

LET vExpression2 = replace(fieldvalue('Expression2',1),'|',chr(39));

**Gordon Savage:**

I dont know the mechanics of how Qlikview constructs data matching with set analysis….

Unfortunately I am never able to do any meaningful performance testing as the server running Qlikview is not dedicated. I can say that before using set analysis I had a compound IF statement and this took around 10x longer to render the chart.

**John Witherspoon:**

As far as the mechanics of data matching with set analysis, I think of it like actually making the selections. My understanding from a conversation with one of the senior tech people is that this is a reasonable way to think about it. So in my example of your conditional set analysis, think of it as "if Customer2=A, select A in Customer1, then sum(Sales), else if customer2=B, select B in customer 1, then sum(Sales), else...". Since QlikView has a very fast engine for applying selections to the data set, it by default has a very fast engine for applying set analysis to the data set. But usually, the number of sets required to calculate a chart is small. That may not be the case here if our dimension has a lot of values, and QlikTech may not have anticipated using, say, 1000 separate sets to calculate a chart.

**John Witherspoon:**

I did some testing. At first, it was a performance disaster, and trying to figure out what was going wrong is what took me so long. Both approaches were clearly dominated internally by a loop within a loop - for each value, for all rows (not just matching rows), causing them to go VERY slowly as the number of values and rows increased.

That IS the obvious way to process the sum(if()) internally:

for each value

set total to 0

for each row

if the row matches the value

add to total

end if

end for

display value and total in the chart

end for

There exist less obvious but faster ways, though, so this still took me by surprise.

The conditional set analysis took me even more by surprise. The most obvious way to process the conditional set analysis shouldn't exibit this sort of behavior. Also, you were reporting a 10x improvement in speed, where I was initially seeing it going slower than the simple sum(if()).

One feature of your solution that I hadn't used was variables. This seemed like an unnecessary complication on the surface, but it appears to be KEY to making your solution perform, even if I can't fully explain why. Once I started generating the variables, the performance picked right up. The more values and the more rows being processed, the better the conditional set analysis with variables compares to the sum(if()). It was going well over 10x faster as the numbers increased, right up to the point where my computer started choking on the size of everything. I was going up by factors of 10 on both values and rows. Performance massively degraded (though it still returned the answer eventually) with 1,000 values and 100,000 rows, or 100 values and 10,000,000 rows. Up to that point, I was seeing was sub-second response times on the chart.

I did all of the final testing in version 10, but saw enough of it in version 9 SR6 to be pretty confident that it works OK in that version as well. Version 10 was just faster in every way, particularly at loading up large amounts of random data, so it was just more practical to use it for the testing.

My script is below. I should note that I eliminated the match() from my expressions to improve performance. In my case, I was using a sequential integer customer ID so this was easy. Out in the real world, you wouldn't have an integer value for each of your dimension values, but you could easily assign one to every unique combination of the chart's dimensions in script, so I think it's practical to eliminate the match() if you have enough values to degrade the performance of the match().

Sales:

LOAD

num(ceil(rand()*100)) as Customer1 // vary this number

,ceil(rand()*10) as Sales

AUTOGENERATE 100000 // vary this number

;

Customers:

LOAD *

,'Customer ' & Customer2 as Name

;

LOAD num(fieldvalue('Customer1',iterno())) as Customer2

AUTOGENERATE 1

WHILE len(fieldvalue('Customer1',iterno()))

;

Expression:

LOAD

'pick(Customer2'

& concat(distinct '

,sum({<Customer1={' & Customer2 & '}>} Sales)','',Customer2) & ')' as Expression

,'pick(Customer2' & concat(distinct ',vc' & Customer2,'',Customer2) & ')' as SetVarsExpression

RESIDENT Customers

;

LET vExpression = fieldvalue('Expression',1);

LET vSetVarsExpression = fieldvalue('SetVarsExpression',1);

DROP TABLE Expression;

FOR I = 1 TO 1000

LET vc$(I) = '=sum({<Customer1={' & $(I) & '}>} Sales)';

NEXT

**John Witherspoon:**

One more thing worth mentioning:

Conditional set analysis with variables was only marginally slower than using a separate expression for every set and having no dimension. Using a separate expression for every set and having no dimension represents (I believe) the lower bound on the amount of time required to get the data with any set analysis solution. I'm therefore confident that each set is only evaluated once, and that your solution is about as fast as any set analysis solution could possibly be.

I think that's even more meaninful than saying "over 10x faster than sum(if()) for large numbers of values and dimensions". It isn't just faster, it seems to be about as fast as is possible. The only thing faster, I believe, would involve data model changes to directly connect the data, which is exactly what we're trying to avoid in this case.

Comments

eran_dayan

Contributor III

2012-03-08
03:52 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

Well done!! Very high level post

richard_pressan

Creator

2012-12-17
06:59 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

I don't known the perf implications of my solution but you can realize the set analysis inside a aggr and insert your expression inside an aggregation fonction e.g.

sum(

aggr(

sum {$<Customer2=P(Customer1)>} Sales

,

Customer1

)

)

Not applicable

2013-11-27
10:53 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

richard_pressan

Creator

2013-11-27
02:57 PM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

johnw

MVP

2014-04-17
09:47 PM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

Not applicable

2014-08-11
12:18 PM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

Note that the concat(distinct ...) function in the Expression table load will concat it's parameter in alphabetical order rather than the source table load order.

This, in my case, caused a problem for the pick() order for set expressions loaded from an Excel sheet. I solved this by adding a counter field comment in the script, the resulting solution looks like:

**Set Analysis Expression Source Table**

BudgetGroupId | BudgetHoursSetExpression |

1 | DepartmentCode=, OperationCode={'W015','W017','W018'}-{'W9*'}, EmployeeID= |

2 | DepartmentCode=, OperationCode={'W011','W014','W016'}-{'W9*'}, EmployeeID= |

3 | DepartmentCode={'W120'}, OperationCode-={'W9*'}, EmployeeID= |

**Load script**

Expression:

LOAD

'pick(BudgetGroupId'

& concat(distinct '

,**/* '& num(BudgetGroupId,'00000') &' */ **sum({1<' & BudgetHoursSetExpression &'>}ProductionHours)') & ')' as Expression

RESIDENT BudgetGroups

;

LET vBudgetHoursSetExpression = fieldvalue('Expression',1);

**Resulting variable**

pick(BudgetGroupId

,/* 00001 */sum({1<DepartmentCode=, OperationCode={'W015','W017','W018'}-{'W9*'}, EmployeeID=>}ProductionHours)

,/* 00002 */sum({1<DepartmentCode=, OperationCode={'W011','W014','W016'}-{'W9*'}, EmployeeID=>}ProductionHours)

,/* 00003 */sum({1<DepartmentCode={'W120'}, OperationCode-={'W9*'}, EmployeeID=>}ProductionHours)

)

Without the numerical comment, the line with W011 would come first and W015 second, messing up the pick order, because the line with W015 belongs to budgetgroup #1 and should be picked if BudgetGroupId=1.

Generally you can use the first pick parameter (pick(<firstparam>,etc) ) as the variable to supply to num().

(Unless this parameter does not have continuous values in the dataset)

jrdunson

Creator

2014-10-14
10:09 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

I have a follow-up question... I wanted to asked it here...but there is no 'attach' option, when adding comments...

Here is my question:

nizamsha

Specialist II

2015-10-31
05:45 AM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

Hi Gordon Savage,

This is the best post that i ever seen using set analysis it was awesome . like this kind of post i am searching for a long time i cant able to identify not even one thing ,kindly post some more post so that we can understand how and all we can use set analysis in different ways

b_garside

Partner

2017-06-05
08:34 PM

- Mark as Read
- Mark as New
- Bookmark
- Permalink
- Email to a Friend
- Report Inappropriate Content

Does this Aggr() allow it to evaluate more than once or per dimension?

Version history