Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello,
I am looking to make a group by (4 dimensions) on a relatively large table of 192mio lines. Computer says no.
So I thought to break up in chuncks of 10mio lines each, using RowNo (). However it takes more than 10 hours to complet that command.
Is there a simple way, to 'batch' every 10mio lines, repeat the process after finishing the batch until all 192mio lines are processed in the group by statement?
Experience by trying learns that 10mio lines can be handled. Script looks like:
Left join MainTable:
Load
ID,
KeyOne,
KeyTwo,
KeyThree,
sum(A*B) as NewVar
Resident MainTable
Group by ID, KeyOne, KeyTwo, KeyThree;
Another solution may be to create a MainTableNew and 'batch' with a concatenate. The 'group by' adds within each line a NewVar, so in total 192mio lines will be delivered.
Looking forward to your suggestions.
Kind regards, Ed.
if you split the incoming data into discrete chunks, Group by the individual chunks and concatenate them together you will get a different result as opposed to doing a single Group by over all the data. Would this cause an issue ?
When "Computer says no" does it give an error message ?
Hi Bill,
Thanks for responding.
a. Don’t think (but not sure) that it makes a difference, since the group by is also on the smallest variable, meaning it will calculate the sum for each individual line, with values of all four dimensions in it.
b. Laptop just stops. 8Gig Mem is not enough and disk also is too small. Might be something to consider pc, but need to be certain that it will function / perform to my needs.. Looking into it.
Thanks again!
Kind regards,
Ed Borsboom
Brinklaan 120 | Bussum | The Netherlands
www.brandbooming.com <http://www.boomtoo.com/> | info@brandbooming.com <mailto:info@boomtoo.com> | 0031 6 34560448
IBAN NL38ABNA0556315731 | VAT (BTW) 821513941B01 | KVK 32162131
BrandBooming is a registered trade name of Silver Bullet BV
If you calculate the sum(A*B) on each line, why do you need the GROUP BY and the JOIN, then?
Both will consume a lot of memory.
You should be able to calculate the product A*B in your original MainTable already.
I think bill.markham is right and it will be make bigger differences how do you handle such a group by load. I would probably take this way:
- load and store as qvd 10M chunks from your source without transformations and then delete the table
- load and store as qvd each chunk with your group by and then delete the table
- load all group-by qvd's and concatenate them
This are good prerequisite for incremental loads: Incremental Load
Maybe there are further possibilities, for example to flag your "A" and "B" field if they are NULL or 0 or there are other criterias which you could filter out by a where-clause. Further do you need each key / each field in this table - maybe you could switch your fields/keys to other tables or could use other (numeric) keys or ...
By the way: a laptop with 8GB is not really a well suited tool for these amount of data
- Marcus
can you split the group by using KeyOne values (one KyeOne value or a group of KeyOne values)? this shouldn't change the result of the group by
I think something like
loop
create a table with some KeyOne values
read from qvd where exists KeyOne values and group by
store in qvd
end loop
at the end you have some qvd with the group by result