Solved: Duplicate sum values in a one-to-many join - Qlik Community

Report Inappropriate Content · ‎2009-08-05

Hello

I have two files, the first with unique claim number (one record per claim) and cost, and the second with non-unique claim numbers (possibly many records per claim) and a code. When I load them in to QV and sum the cost by claim number, the values are multiplied up by the number of duplicates on the second file. For example:


status:
load * inline [
claim, cost
A, 100
B, 150
C, 120
D, 90
];

status2:
LEFT JOIN load* inline [
claim, rule
A, 1
B, 1
B, 2
C, 1
D, 2
D, 3];

I this example if you create a table to sum cost by claim, claim B will show 300 and D 180.

How do I resolve this?

Many thanks!

Anonymous · ‎2009-08-05

You can try this:
sum(aggr(cost, claim))
(Using distinct is dangerous - what if different claims have the same cost?)

View solution in original post

Report Inappropriate Content · ‎2009-08-05

Did you try to use sum(DISTINCT ...) in your expression ?

Philippe

Anonymous · ‎2009-08-05

You can try this:
sum(aggr(cost, claim))
(Using distinct is dangerous - what if different claims have the same cost?)

Report Inappropriate Content · ‎2009-08-06

Good shot Michael,

I missed that one .....

Regards,

Philippe

Anonymous · ‎2009-08-06

Thanks Philippe,
I've stepped on these racks too...
the best way is not to join the tables, so the expressions are clean and simple. But it's not always the available option.

johnw · ‎2009-08-07

I'm with Michael in regards to the "right way" to do this. I definitely suggest not joining the tables. Your two inline tables represent a clean data model. If you loaded these as separate tables, you could just sum(Cost) and it would work. By joining the tables, you broke the data model, duplicating the cost across multiple rows. Your table would look like this after the left join:

claim,cost,rule
A,100,1
B,150,1
B,150,2
C,120,1
D,90,20
D,90,3

To me, that table of data is simply wrong. That's why QlikView stopped reporting the right result. Now, you CAN still get the right result with the sum(aggr(cost,claim)). But to me, that's like breaking your own arm (the left join), putting a cast on it to straighten it out (the expression), and calling it good as new. Better to not break the data model in the first place.

(And yeah, I've messed this up myself in the past, and briefly did the sum(aggr()) approach until I realized it was a data modeling error. And by "the past" I mean "yesterday", and on a claims system as well.)

Report Inappropriate Content · ‎2009-08-07

Hi John

What you say makes a lot of sense. Usually when I try to just load the tables in (not join) I get an out of memory error. The tables I'm using are quite large. Have you ever had a similar problem?

Thanks

johnw · ‎2009-08-07

I get my share of out of memory errors. I do most of my development on my 32 bit PC while targetting a 64 bit server. I'll develop on a 64 bit machine if I have to, but it's a little less convenient. And making everything fit on my own PC helps with my memory-management skills anyway.

I haven't noticed much difference between memory usage with unjoined and joined tables. My original expectation of QlikView was that a highly-normalized model with lots and lots of tables would work best. QlikView DOES deal very well with that sort of data model. But over time, I've found it simpler in many cases to just join smaller tables into larger ones. This duplicates data, but QlikView's compression appears to keep that from having any significant effect on memory usage.

And QlikView's compression is, I think, the main point here. Regardless of how you model your data, of how many tables you use, as long as you're modeling correctly, every model contains the same exact information. Compression algorithms push the memory requirements down towards the minimum number of bits required to store that information. So a perfect compression algorithm would, I believe, require the exact same number of bits to store your data regardless of how you modeled it. There's probably no such thing as a perfect compression algorithm, so I'm sure you'll see differences in practice. But if you find that a correct data model with joined tables is using substantially less memory than a correct data model using separate tables, I would find that very surprising and counter to my own experience. I suspect that what's actually happening is that you're changing the meaning of the data when you join. In the simplistic example in this thread, the join DOES change the information. It actually adds (mis)information, so I would expect the memory requirement to go up slightly for that case.

Still, if your experience is that joins reduce the memory requirement, that's not a problem either. You just have to make sure you join carefully. In the claims application I was working on earlier this week, for instance, I DID join much of the information into a single table. It doesn't change anything fundamental, for instance, if I duplicate 10 different product-defining fields across every single item of that product that we sold. It's a poor data model for a relational database, but a perfectly good way of staging data in QlikView. But two of my QUANTITY fields needed to be stored separately on little side tables containing nothing other than a key and a quantity (claim ID vs. cost of claim, item ID vs. weight of the item).

Anonymous · ‎2009-08-07

John,
I have to add that different versions of QV handle memory in a different way. Some of my applications with large number of logical tables were working perfectly well before 8.5. In 8.5, I've got severe problems with memory, and had to join the tables to change the design to star schema. As a result, the applications size got bigger, the expressions more complex, but the run-time performance was fine and no memory issues.
(BTW, with the old normalized data model, using set analsys in expressions made performance in 8.5 even worse.)

johnw · ‎2009-08-07

The more I learn, it seems the less I understand about how QlikTech designs their software. It seems strange to me that in 8.5, having a large number of tables would suddenly start behaving badly for memory usage. Other than a programmer or designer "oops", I can't understand why that would be.

Same thing with set analysis. Had set analysis been designed the way I expected (I'm sure there was a good reason it wasn't), there would again be no reason for their performance to be worse on an old normalized data model. But tech support has told me that it isn't designed at all the way I expected. So I suppose performance problems with set analysis should no longer take me by surprise.

I suppose I didn't notice the change in 8.5 because by the time 8.5 came out, I was doing a lot more joining and a lot less normalization, and generally changing old applications along those lines whenever I maintained them. So I apparently naturally just started doing what 8.5 wanted.

Sounds like joining is often the way to go, then, and it just has to be done carefully. Have you looked at how joined and unjoined tables compare in 9.0? I have it downloaded, but haven't done much poking yet. We're not planning to upgrade for at least another month, probably more.