Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hello guys,
I have a csv file that use only 4 columns but the volume is quite big - 21 millions lines.
The first two column are the identifier.
ID1 | ID2 | INFO1 | INFO2 |
AA | 12 | Pop | 550 |
AA | 12 | Tim | 600 |
AA | 12 | Luck | 720 |
AA | 12 | Tom | 950 |
AA | 12 | Nina | 450 |
BB | 23 | Duke | 932 |
BB | 23 | Rod | 72 |
BB | 23 | Yub | 560 |
BB | 23 | Anna | 432 |
BB | 23 | Paul | 453 |
All i want to do the to group all the Info1 and Info2 with its corresponding identifier and it works.
But the real problem is the amount of data.
Local Machine 8Gb RAM
I have increase the JVM - Xmx to 4096 M and Xms to 2048 M
Do you have any idea how it can be done or optimise please?
Thank you.
Best regards,
asadasing
Change the "Buffer size of external sort" to something like the number of rows you are working with. It is set to 1000000 by default. Maybe change this to 25000000.
I'll be honest and say it doesn't look like you are doing anything immensely hard there. I believe the problem is entirely down to the last flow that runs. Have you tried running that on its own? I suspect it will fail, but can you test it? The next thing to try is to add a tSortRow after the tMap and then a tAggregateSortedRow after that to carry out the list operation. This *might* help. It will break the act of sorting and aggregating down into two tasks instead of one. You can use the "Sort on disk" option of the tSortRow (Advanced settings). This will remove the sorting from memory and should allow you to get through this.
Hello @rhall
Thank you for your response.
As suggested, i have tried running the subjob on its own but it failed on 7 millions rows (same as before).
I have also tried the option of adding the tsortrw (stored on disk) and taggregatesortedrow, but it stopped at 5 millions approx with OutOfMemory error.
Is there any other way to solve this out?
Thank you.
asadasing
Change the "Buffer size of external sort" to something like the number of rows you are working with. It is set to 1000000 by default. Maybe change this to 25000000.