Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

how to 'batch' load for 192mio lines table / with group by

Hello,

I am looking to make a group by (4 dimensions) on a relatively large table of 192mio lines. Computer says no.

So I thought to break up in chuncks of 10mio lines each, using RowNo (). However it takes more than 10 hours to complet that command.

Is there a simple way, to 'batch' every 10mio lines, repeat the process after finishing the batch until all 192mio lines are processed in the group by statement?

Experience by trying learns that 10mio lines can be handled. Script looks like:

Left join MainTable:

Load

ID,

KeyOne,

KeyTwo,

KeyThree,

sum(A*B) as NewVar

Resident MainTable

Group by ID, KeyOne, KeyTwo, KeyThree;

Another solution may be to create a MainTableNew and 'batch' with a concatenate. The 'group by' adds within each line a NewVar, so in total 192mio lines will be delivered.

Looking forward to your suggestions.

Kind regards, Ed.

5 Replies
Anonymous
Not applicable
Author

if you split the incoming data into discrete chunks, Group by the individual chunks and concatenate them together you will get a different result as opposed to doing a single Group by over all the data.  Would this cause an issue ?

When "Computer says no" does it give an error message ?

Not applicable
Author

Hi Bill,

Thanks for responding.

a. Don’t think (but not sure) that it makes a difference, since the group by is also on the smallest variable, meaning it will calculate the sum for each individual line, with values of all four dimensions in it.

b. Laptop just stops. 8Gig Mem is not enough and disk also is too small. Might be something to consider pc, but need to be certain that it will function / perform to my needs.. Looking into it.

Thanks again!

Kind regards,

Ed Borsboom

Brinklaan 120 | Bussum | The Netherlands

www.brandbooming.com <http://www.boomtoo.com/> | info@brandbooming.com <mailto:info@boomtoo.com> | 0031 6 34560448

IBAN NL38ABNA0556315731 | VAT (BTW) 821513941B01 | KVK 32162131

BrandBooming is a registered trade name of Silver Bullet BV

swuehl
MVP
MVP

If you calculate the sum(A*B) on each line, why do you need the GROUP BY and the JOIN, then?

Both will consume a lot of memory.

You should be able to calculate the product A*B in your original MainTable already.

marcus_sommer

I think bill.markham is right and it will be make bigger differences how do you handle such a group by load. I would probably take this way:

- load and store as qvd 10M chunks from your source without transformations and then delete the table

- load and store as qvd each chunk with your group by and then delete the table

- load all group-by qvd's and concatenate them

This are good prerequisite for incremental loads: Incremental Load

Maybe there are further possibilities, for example to flag your "A" and "B" field if they are NULL or 0 or there are other criterias which you could filter out by a where-clause. Further do you need each key / each field in this table - maybe you could switch your fields/keys to other tables or could use other (numeric) keys or ...

By the way: a laptop with 8GB is not really a well suited tool for these amount of data

- Marcus

maxgro
MVP
MVP

can you split the group by using KeyOne values (one KyeOne value or a group of KeyOne values)? this shouldn't change the result of the group by

I think something like

loop

     create a table with some KeyOne values

     read from qvd where exists KeyOne values and group by

     store in qvd

end loop

at the end you have some qvd with the group by result