Solved: Script Sum If With Groups Causes Duplication - Qlik Community

ben2r · ‎2020-01-28

I have a large table of transactional like data (15m rows) over which I am trying to efficiently pre-calculate some of the client level sum fields I want to use in the dashboard.

There are broadly 3 sums I am trying to do, sum where product = A, sum where product = B and sum where product = A or B.

The only way I have been able to get this to work so far is to create 3 loads of the master table, saving the result in a results table. This is one for product A, one for B and one for A or B. Example code is below (for A and B):

LOAD
%CLIENT_KEY,
SUM(AMOUNT) as AFIN_iCNR_2019
RESIDENT FACT_CROSS //This is the source table
WHERE ASPECT = 'CNR' AND FACT_YEAR = '2019' AND (%PROD_LINE = 'PROD A' OR %PROD_LINE = 'PROD B')
GROUP BY %CLIENT_KEY;

I repeat this for the other 2 products individually and left join the back to this table to give one table with 3 new sum columns.

This takes quite a long time to run and slows down my script execution time considerably. Could anyone advise of a better(faster) way to accomplish this? I have tried doing this all in one go, using IF(Prod = A, SUM(AMOUNT)) as SUM_A, IF(PROD = B, SUM(AMOUNT) as SUM_B but these creates duplicate rows in the results table (but is much much faster)

Using the example data attached - the result should be:

CLIENT_KEY	PROD A	PROD B
CLIENT 1	473	360

If I try to do all in one table I get a result something like this:

CLIENT_KEY	PROD A	PROD B
CLIENT 1	473	0
CLIENT 1	0	360

I presume the group by is causing it not to combine into the same row.

Appreciate any guidance on this, im sure there is a simple answer I am missing!

ben2r · ‎2020-01-29

@MikeW thanks for the input. I was not aware of that on the ifs. I have updated in my load script and it reduced the load time slightly.

Front end certainly works but in the past we have seen quite slow dashboard speeds when clicking through different cuts of the data. I am not certain how much of this was due to a badly designed DM (uber snowflake as mentioned above).

Your multi-step suggestion works a treat here - second pass brings all on one result per client rel and keeps my data model looking nice and clean!

Thanks for the help!

View solution in original post

Rodj · ‎2020-01-28

Hi @ben2r ,

Left join performance issues are fairly common and there's a few ways around them. The first thing I like to consider is "do I really need to join these tables". The Qlik engine is really optimised to work in a different way, if in creating your mini sum tables it sounds like you'll have associations available to you on client_key, you could just leave them be. There's a whole lot of data modelling religious arguments that might be started with what I've just said but I think it's worth thinking about, there's plenty of evidence in articles here on community that it will perform well and meet your needs.

Leaving the above aside, there are various options discussed here on community if you search on "left join performance" (I prefer to just search in Google prefixing my search with "Qlik"). I'd take a look at the Keep (there is a left keep) function and mapping load, either of which might meet your needs. There's a telling line in the help for the Keep statement: "Explicit joins should therefore generally be avoided in Qlik Sense scripts".

I'm not suggesting an explicit solution here as everyone's data and environment is different and you might need to test a few things for yourself to see what your performance is like. I also don't know how fixed you are on having to have everything in the one table.

Cheers,

Rod

MikeW · ‎2020-01-28

What's in your Group By? It shouldn't create duplicates if you just have %CLIENT_KEY.

Rodj · ‎2020-01-28

It isn't creating duplicates, each source record only has a single value in %PROD_LINE which can be "PROD_A" or "PROD_B". conceptually what needs to happen is a de-pivoting of the data, i.e. pivot %PROD_LINE into PROD_A and PROD_B columns.

There's a number of ways this can be dealt with as I alluded to above, but if we need to create separate sum_a and sum_b columns they will probably need to be created separately. That said we don't have any information on the grain of the original fact table etc. My approach would be to simply create separate result tables, e.g.:

Sum_A:

Load

%CLIENT_KEY

Sum(AMOUNT) as Sum_A

resident FACT_CROSS

group by %CLIENT_KEY;

They don't need to be joined back to the fact, it is probably unnecessary work as the QIX engine will handle the simple association easily and without any of the potential complications that would arise in a SQL based world. Using Keep or a mapping load would then combine the tables either logically or physically if desired.

alex00321 · ‎2020-01-28

Hi I would suggest to not use Sum Group in Qlik Script Part given the high volume data you have. I once had similar situation and it caused long time to execute the scripts. What I do is to not use sum in back end, instead, use it in front when you code chart. Thanks!

Rodj · ‎2020-01-28

That's a good call too. Pre-calculating isn't often of real benefit. The volumes being talked about here still aren't very big unless you are just developing on desktop.

alex00321 · ‎2020-01-28

True

MikeW · ‎2020-01-29

It might be caused by the If being outside the sum rather than inside the sum.

If you have a high aggregation rate from this sum (ie from 15m records to 10k records after the sum) then the fastest way is to do the ifs inside the sums and sort by the group by key first. This part is multi threaded. Then do a sum group by on the sorted table using resident load. Group by is single threaded for some weird reason and can be helped by fronting the work (ie any calculations inside sums, sorting, where filtering, combining many group by fields in a single key) when it is multi threaded.

For only 15m records, you might be better off doing the sum using set analysis on the front end, which is essentially a multi threaded group by. If you are using the sum in nested aggrs and reused many times in a noncacheable way, then precalculation might make more sense. Definitely try and see if the precalculation actually improves anything. There are lots of options for optimization on the front end as well.

ben2r · ‎2020-01-29

@Rodj thanks for your thoughts on this.

This is part of a broader project to build a simple data model from many different sources and I was trying to keep things in a simple star schema kind of approach (in the past we have created monster snowflake DMs which seem to be terrible on dashboard performance).

I like the idea of the mapping load - I will give that a try.

Running separate tables has not made a massive difference - still adds 5 mins onto a script otherwise loading in ~45 seconds

Elsewhere there was a question about granularity. The file I sent was slightly abridged, the main table of 14m rows ish contains a row for each month for the last 24 months and within each month multiple products per client and value traded for each client. The unique list of clients from this 14m is about 60k.

ben2r · ‎2020-01-29

@MikeW thanks for the input. I was not aware of that on the ifs. I have updated in my load script and it reduced the load time slightly.

Front end certainly works but in the past we have seen quite slow dashboard speeds when clicking through different cuts of the data. I am not certain how much of this was due to a badly designed DM (uber snowflake as mentioned above).

Your multi-step suggestion works a treat here - second pass brings all on one result per client rel and keeps my data model looking nice and clean!

Thanks for the help!

Script Sum If With Groups Causes Duplication

General Question

Script