Aggregation on huge data - Qlik Community

dusasuman · ‎2016-09-13

Hi Community..

I have 11Mn of records in one qvd with 54 fields. My requirement is aggregating the data at all dimensional fields. I have 12 amount fields and other all are dimension fields. Now i have taking sum of all amount fields from table group by all dimension fields.. by this way it is hanging. Can any one suggest the right approach..

paulferguson80 · ‎2016-09-13

Hi,

Personally i would do any aggregation that is possible within the load script so its processed while loading the figures in, however you also need to handle nulls within the load when aggregating as they may cause problems.

This will allow you to keep the expression as simple as possible when displaying the results within your dash.

There is a secondary option of breaking the data down into smaller chunks based on the dimensions and using a key table to link but this over complicates what should be a simple task.

I would also use the debugger while loading the data to make sure the code works before committing to the full reload.

Hope this helps

Ta Paul

effinty2112 · ‎2016-09-13

Hi Suman,

You'll need to have a good hard think and reduce the number of dimensions you want to aggregate over. If you aggregate 11M records over 43 dimensions it's possible that you will hardly reduce the number of rows if at all and doing that is a LOT of work.

11M isn't such a lot of records. I would pick a few dimensions (start with only one, two or three) with the highest number of distinct values, aggregate on them and do any further aggregations in charts. If the chart calculation time is still too long for you aggregate a couple more in the load until you find a happy medium between load time and chart calculation time.

Good luck

Andrew

miguelbraga · ‎2016-09-13

Check out his post:

Large amount of data (120 million rows) ... anyone got experience?

Regards,

MB

marcus_sommer · ‎2016-09-13

What is your aim? Why not simply loading the table in a qvw and aggregate there? Will be there really all fields needed? How will be the rate of no aggregated records to the aggregated records?

Beside them I agree with Andrew that you should slice heavy loadings horizontally and/or vertically (maybe each YearMonth or something similar).

- Marcus

dusasuman · ‎2016-09-13

My data is at hour:minuts level.. and i have data with there will be lot of records happens at customer, item, manufacturer.. all these are dimension fields (nearly 30 fields).. and there are also sales, discount, cost, e.t.c are amount fields (13) amount fields.. all these amount fields should be summed at all dimension fields so has to bring the size of 1.5 gb of qvd data to half of it and store it to new qvd...

marcus_sommer · ‎2016-09-13

Is really the size of 1.5 gb a problem? I hope not ...

Beside an aggregation you could reduce the filesize if you could split high cardinality fields like a timestamp (then into date and time). Further take a look if there are any descriptive fields like names or addresses which should be removed or replaced with their ID's.

Another point is to remove formattings from fields, for example to use num(Date) instead of date(Date, 'AnyDateFormat'). Also a rounding of fields to a certain number of digits which may occur if fields are created by any multiplication/division could minimize the size.

- Marcus

dusasuman · ‎2016-09-13

My requirement itself is a reducing data by aggregation and reduced table store it in separate qvd.. finally need to use the reduced qvd in the application. 1.5 gb is of only two periods (July2016, August2016) data..

marcus_sommer · ‎2016-09-13

An aggregation (especially over many dimensions) must not reduce the filesize very much. If your rate of aggregated records to not aggregated records by about 50% you won't have als a filesize reduction of 50%. I would assume the reduction will be only by 10% - 20% - and this for the cost of aggregating over hours (whereby incremental approaches could reduce them).

I suggest to reduce the filesize with the above mentioned approaches, then testing the response-times from the target-application and only if they are really too slow to aggregate the data.

- Marcus