Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Greetings everyone,
I am looking for ways to optimize a sum expression that uses fields from two tables.
As an example, take the expression Sum( NoOfUnits * UnitCost ), where NoOfUnits is in one fact table, and UnitCost is in another. As explained in this article,
https://community.qlik.com/blogs/qlikviewdesignblog/2013/08/20/the-calculation-engine
QlikView will generate all applicable combinations of the field values before summing them.
The problem in my case is that both tables can be very large, even though the field that links them such as ProductID, does not have that many distinct values. Generating all combinations will result in a very large amount of combinations to summarize.
Take for example this scenario: The first fact table contains order lines, with a number of dimensions such as product ID, customer ID, sales channel, and the number of units sold for that order line. The number of order lines can be many millions, even though the number of distinct products is only a few hundred.
The second fact table contains the unit cost for each product, broken down into many components that form the product, and other dimensions such as component supplier.
It would be possible to combine these tables in the load script, but that would create one extremely large fact table, with many billions of rows. The idea with two fact tables is that these represent a more normalized view of the data. In the example, the breakdown of product cost onto components is the same regardless of which customer buys that specific product, but it will be different for different products.
An example of a simple analysis is that the user selects a set of order lines, for example by choosing a customer, a customer group, or a sales channel. Based on the selected order lines, we want to display the component cost per component.
Both intuitively and mathematically, it should yield the same result, but in a more efficient way, if we first summarize NoOfUnits per product for the selected order lines, and then combine that with the fact table that contains unit cost per product and other dimensions, instead of taking each separate order line and combine with the fact table with unit cost. This has to be done after the selection has been made, since we would like the user to have full flexibility in how they choose the order lines.
For the most simple analysis, which just shows the total amount per product for the selection, it seems to work well to replace the slower Sum( NoOfUnits * UnitCost ) with the following expression:
Sum (aggr(sum(NoOfUnits)*sum(UnitCost), Product))
It summarizes the number of units per product and the unit cost per product, and then multiplies them.
However, I have been unable to create an expression that always works if I want to include dimensions from one or both of the fact tables in the analysis.
I'll try to attach a sample document that uses randomly generated data that can be used to show the problem. On my machine, when selecting one product group, it takes a few seconds to load the "Amount per product slow" analysis, while "Amount per product fast", which uses the more complicated expression with aggr above, loads more or less instantly. Below that is an "Amount per component" which I have not been able to get faster. The load script can easily be changed to see the effects of larger or smaller number of rows in the tables.
Here are some workarounds that I already know of, if it simply is not possible to create a more efficient expression:
Thanks in advance for any suggestions you may have,
Robert
After some more experimentation, I discovered that the following expression also seems to work well in charts that only display dimensions related to the second fact table. Having one less AGGR might be good for performance.
Sum (aggr( NODISTINCT sum(NoOfUnits), Product) * UnitCost)
Essentially this performs what I tried to explain in the original post, but did not know how to create an expression for.
Both intuitively and mathematically, it should yield the same result, but in a more efficient way, if we first summarize NoOfUnits per product for the selected order lines, and then combine that with the fact table that contains unit cost per product and other dimensions, instead of taking each separate order line and combine with the fact table with unit cost. This has to be done after the selection has been made, since we would like the user to have full flexibility in how they choose the order lines.
The limitation here is that the chart can only have dimensions from one of the fact tables, but it will use the selection in the other fact table to calculate the result, which is what I mainly needed.
I know it is difficult to understand the problem, but the performance difference is huge compared to joining the two fact tables when loading and producing a multi billion row fact table, or having QlikView produce all those combinations when creating a chart.
I'll mark this question as answered, since this solves the problem as I stated it. However, if anyone has further input now or in the future, feel free to add comments.Thanks to everyone that contributed to the discussion!
Another way to try to describe what I would like to accomplish:
I'm happy to clarify anything in this question, just ask.
Robert,
your problem is one of the toughest ones, because there is no easy "book answer" to prescribe.
It is a known fact, and you mentioned it, that QlikView doesn't perform well when two operands within the same aggregation are coming from different tables:
sum(Field1*Field2).
In your case, because of the many-to-many relationship between the two tables, it's not trivial to bring the two fields together.
Your suggested solution with AGGR() is perhaps the best "proxy" for a better performance. I'd like to understand your comment where you mentioned that you weren't able to formulate a more advanced analysis involving other dimensions.
Other than that, I'd suggest to look for compromises and "work around" solutions:
- If total cost is used in many calculations (charts), I'd suggest to pre-calculate the total cost field in the fact table.
- If cost components can be summarized into a few commonly used buckets (for example, the famous Material, Labor, and Overhead), I'd also suggest pre-calculating these three fields in the fact table.
- Hopefully, these work around solutions can cover a large part of your needs. For truly dynamic calculations that allow selecting any combinations of cost components and can't be pre-calculated, you'd have to take the user to the "slow" sheet with a warning that this calculation may take longer than usual. Even here, I'd suggest finding a way of using your AGGR() solution at the appropriate level of detail. Just watch for the number of distinct values that you need to generate.
cheers,
Oleg Troyansky
Masters Summit for QlikView - take your QlikView skills to the next level!
This is a challenging one. I think you are stuck in "workaround" land, pre-calc as much as possible etc.
One modeling thing you can do is make sure all your DImension tables (including the ComponentFacts table) are DISTINCT.
-Rob
I agree with everything Rob and Oleg have said - this is not easy...
Most cases like this - also the example with Sum( NoOfUnits * UnitCost ) - are many-to-one relationships. Then the obvious answer is: Make a copy of UnitCost in the fact table. Then the above aggregation is fast. Your example with OrderLines is such a relationship: Each order line only has one cost. Then do what Oleg suggests: calculate the total cost in the OrderLines table, already in the script.
But if you have a true many-to-many relationship, I don't have a good answer. Your suggestion with Aggr() may be a good one, although my experience of Aggr() is that it is CPU-expensive...
HIC
Thanks a lot for your quick answers! I'll make a note of the additional compromises/work arounds that have been suggested.
I had already tried various aggr expressions, but I gave it another go today after your replies, and this expression actually seems to work faster while giving the same result, for a simple chart that uses the component dimension from the other fact table with unit cost:
Sum (aggr( NODISTINCT sum(NoOfUnits), Product)*aggr( sum(UnitCost), Product, Component))
I don't really understand why NODISTINCT makes a difference, but without it, the result is usually completely wrong. Conceptually this kind of calculation is what I had in mind - summarize the tables on only the dimensions necessary for linking them and for the chart dimensions before the multiplication, but until now didn't manage to get working. Do you think this kind of expression makes sense? It feels a little bit scary to have different dimensions in the two aggr parts, even though that is what theoretically is necessary.
Here are some of the other aggr expressions I've been trying:
Sum (aggr( sum(NoOfUnits), Product)*aggr( sum(UnitCost), Product, Component))
Fast, but gives completely wrong result.
Sum (aggr(sum(NoOfUnits)*sum(UnitCost), Product, Component))
Sum (aggr(sum(NoOfUnits)*sum(UnitCost), Component, Product))
Seems to give the same result but is not faster than the original simple expression. I assume that it still needs to generate the combinations.
Sum (aggr( sum(NoOfUnits), Product, Component)*aggr( sum(UnitCost), Product, Component))
Seems to give the same result but is slower than the original simple expression.
It should be noted that I tried to extend the working expression to include one more dimensions from the table with NoOfUnits, but that gave incorrect results so far.
I believe the NODISTINCT is needed because you have the mixed granularity between your two AGGR-generated "cubes." That is, you have one row per Product & Component for UnitCost; but only one row per Product for the NoOfUnits. The NODISTINCT is telling the engine to take NoOfUnits results and propagating them across every row of the more granular UnitCost "cube."
I've had a few challenges like your situation and I've typically used ONLY() as my inner aggregation function for the cost. Knowing that the aggr() is taking me down to the lowest level of granularity in my item master, then there could only be one value for that cost.
I've never tested the performance impact of the only() vs. a sum() ... perhaps Oleg or Henric could comment on that.
- Bill
Hi Robert,
I think you are on the right path, but the actual formulas that you listed seems to be incorrect. Here are some thoughts and some suggestions:
- I don't think you need to include Component into your AGGR dimensions, unless your chart has Component as a chart dimension. In all other cases, Component doesn't need to be there. If users select certain components, the available (associated) components will get aggregated
- In your aggr(), you definitely need to list Product and all chart dimensions for each chart. For example, in a chart by Customer, you should include Customer as the dimension of the AGGR().
I picture something like this:
sum(AGGR( sum(NoOfUnits) * sum(UnitCost), Product, ChartDim1, ChartDim2, ChartDim3, ...))
Now, this approach may or may not resolve the issue of building the possible combinations, but logically it should be correct.
cheers,
Oleg Troyansky
QlikView Your Business - check out my new book, coming out in August!
I have to post a correction about NODISTINCT - that was a dumb moment... AGGR() is distinct by default.
I did some testing on a somewhat similar data structure, and I have mixed results...
In one specific instance, I managed to gain a huge improvement in performance (from 15 sec down to 1.2 sec) when I used the AGGR() function, similar to the one that I suggested, with a single dimension that was close to your Product Dimension:
sum(AGGR( sum(NoOfUnits) * sum(UnitCost), Product))
However, as soon as I added another dimension, the AGGR() solution performed much worse than the original. The original formula showed a marginal increase in calculation time (19 sec.) while the AGGR() solution showed much worse results (82 sec. and worse).
Conclusion: You might have limited opportunities to improve performance with AGGR(), but only in some limited scenarios. Other than that, unless others offer some clever ideas, I'd revert to work-around solutions...