Solved: Re: Optimizing sums with fields from two tables - Page 2 - Qlik Community

Report Inappropriate Content · ‎2015-05-25

Greetings everyone,

I am looking for ways to optimize a sum expression that uses fields from two tables.

As an example, take the expression Sum( NoOfUnits * UnitCost ), where NoOfUnits is in one fact table, and UnitCost is in another. As explained in this article,

https://community.qlik.com/blogs/qlikviewdesignblog/2013/08/20/the-calculation-engine

QlikView will generate all applicable combinations of the field values before summing them.

The problem in my case is that both tables can be very large, even though the field that links them such as ProductID, does not have that many distinct values. Generating all combinations will result in a very large amount of combinations to summarize.

Take for example this scenario: The first fact table contains order lines, with a number of dimensions such as product ID, customer ID, sales channel, and the number of units sold for that order line. The number of order lines can be many millions, even though the number of distinct products is only a few hundred.

The second fact table contains the unit cost for each product, broken down into many components that form the product, and other dimensions such as component supplier.

It would be possible to combine these tables in the load script, but that would create one extremely large fact table, with many billions of rows. The idea with two fact tables is that these represent a more normalized view of the data. In the example, the breakdown of product cost onto components is the same regardless of which customer buys that specific product, but it will be different for different products.

An example of a simple analysis is that the user selects a set of order lines, for example by choosing a customer, a customer group, or a sales channel. Based on the selected order lines, we want to display the component cost per component.

Both intuitively and mathematically, it should yield the same result, but in a more efficient way, if we first summarize NoOfUnits per product for the selected order lines, and then combine that with the fact table that contains unit cost per product and other dimensions, instead of taking each separate order line and combine with the fact table with unit cost. This has to be done after the selection has been made, since we would like the user to have full flexibility in how they choose the order lines.

For the most simple analysis, which just shows the total amount per product for the selection, it seems to work well to replace the slower Sum( NoOfUnits * UnitCost ) with the following expression:

Sum (aggr(sum(NoOfUnits)*sum(UnitCost), Product))

It summarizes the number of units per product and the unit cost per product, and then multiplies them.

However, I have been unable to create an expression that always works if I want to include dimensions from one or both of the fact tables in the analysis.

I'll try to attach a sample document that uses randomly generated data that can be used to show the problem. On my machine, when selecting one product group, it takes a few seconds to load the "Amount per product slow" analysis, while "Amount per product fast", which uses the more complicated expression with aggr above, loads more or less instantly. Below that is an "Amount per component" which I have not been able to get faster. The load script can easily be changed to see the effects of larger or smaller number of rows in the tables.

Here are some workarounds that I already know of, if it simply is not possible to create a more efficient expression:

Creating several documents for different kinds of analyses, where the fact tables are summarized to only contain the dimensions necessary for that particular analysis. For example, one analysis where the user must select a customer group. We can summarize the order lines per product and customer group, which will reduce the amount of data. This limits the flexibility of the user though.
Increase the computation resources (more/faster CPU cores, more memory, adjusting working set settings and so on). The computations are not performed more efficiently, but they can be done faster with more expensive hardware.
Pre-cache common selections after reloading the document.

Thanks in advance for any suggestions you may have,

Robert

Report Inappropriate Content · ‎2015-06-10

That makes sense, thanks for the explanation.

Report Inappropriate Content · ‎2015-06-10

Thank you for all your efforts, Oleg!

In a simple chart that only uses component as dimension, it did seem to work to use the expression I posted earlier. In that case, the dimensions from the order line fact table are not displayed in the chart, but used to filter the order lines in various ways. The chart then breaks down the selected order lines on the components that have been used by all products sold on those order lines. In the real application I have in mind, this kind of analysis makes sense.

Sum (aggr( NODISTINCT sum(NoOfUnits), Product)*aggr( sum(UnitCost), Product, Component))

I agree with your conclusion that these kind of expressions probably only work in limited scenarios. For some other scenarios where charts and selections only need fields from the order lines fact tables, it makes sense to add the total cost to the order lines fact table as you and Henric suggested. I think this is enough for the problem I need to solve now, even though it is always nice to know solutions that are as generic as possible.

Oleg_Troyansky · ‎2015-06-10

Yes, I came exactly to the same conclusions. I'm curious - for your chart by Component, would it be faster or slower to calculate the same using this formula:

Sum (aggr( sum(NoOfUnits) * sum(UnitCost), Product, Component))

Could you please compare the two formulas and post the result?

cheers,

Oleg

Oleg Troyansky
Call me for expert support, performance tuning, advanced training and more!

Report Inappropriate Content · ‎2015-06-10

Hi Oleg,

here are some test results for that chart. I did try both a stopwatch and checking the CalcTime on the sheet properties.

Unoptimized expression sum(NoOfUnits*UnitCost)
12 seconds / 11390 CalcTime
Sum (aggr( NODISTINCT sum(NoOfUnits), Product)*aggr( sum(UnitCost), Product, Component))
< 0.5 seconds / 109 CalcTime
Sum (aggr( sum(NoOfUnits) * sum(UnitCost), Product, Component))
20 seconds / 19594 CalcTime

As far as I can see, all formulas produce the same result for the test data. In my test, the formula you suggested is slower.

Robert

Oleg_Troyansky · ‎2015-06-10

This is so interesting! Thanks for sharing!

Oleg Troyansky
Call me for expert support, performance tuning, advanced training and more!

Report Inappropriate Content · ‎2015-06-24

After some more experimentation, I discovered that the following expression also seems to work well in charts that only display dimensions related to the second fact table. Having one less AGGR might be good for performance.

Sum (aggr( NODISTINCT sum(NoOfUnits), Product) * UnitCost)

Essentially this performs what I tried to explain in the original post, but did not know how to create an expression for.

Both intuitively and mathematically, it should yield the same result, but in a more efficient way, if we first summarize NoOfUnits per product for the selected order lines, and then combine that with the fact table that contains unit cost per product and other dimensions, instead of taking each separate order line and combine with the fact table with unit cost. This has to be done after the selection has been made, since we would like the user to have full flexibility in how they choose the order lines.

The limitation here is that the chart can only have dimensions from one of the fact tables, but it will use the selection in the other fact table to calculate the result, which is what I mainly needed.

I know it is difficult to understand the problem, but the performance difference is huge compared to joining the two fact tables when loading and producing a multi billion row fact table, or having QlikView produce all those combinations when creating a chart.

I'll mark this question as answered, since this solves the problem as I stated it. However, if anyone has further input now or in the future, feel free to add comments.Thanks to everyone that contributed to the discussion!