topic Split source table based on record count for processing big Oracle tables in Talend Studio

Split source table based on record count for processing big Oracle tables

Anonymous — Tue, 06 Nov 2018 15:55:07 GMT

Hi,

I have a very massive table whose record count will be more than 10 billion.

1. I wanted to read the data from input table for first 2 billion, process it and write to output table.

Then fetch next 2 billion and do the same process and write to output table.

how could I design my job so that it loops to fetch 2 billion at a time. This design should also consider restartability at record level if it errors(checkpoint at each row if this is the best approach).

2. Can I run as a seperate process each 2 billion records?

Thanks

Re: Split source table based on record count for processing big Oracle tables

TRF — Tue, 06 Nov 2018 22:42:53 GMT

You may also read the table at once and redirect the content to delimited files with the desired number of records for each.
On a separate process, you can start to read the 1rst file as soon as the 2nd one is created and so on.
This way you can start to process the result before the query is finished and restart on the file of your choice in case of.
Restartability at record level is an other story and I'm not sure checkpoint for each row is an option with such a large volume.
Paralelization for insert should be possible in append mode to avoid locks on table (as far as I remember) but you probably need to limit the number of parallel processes.
That's just some ideas. Let us know the rest of the story. Good luck.

Re: Split source table based on record count for processing big Oracle tables

Anonymous — Thu, 08 Nov 2018 11:59:21 GMT

Hi,

Thanks for the suggestion.

I have used table to store restart and record split details which will be input to the child job.

And I have created parallel flow creating multiple job reading from the same table.

Thanks.

Re: Split source table based on record count for processing big Oracle tables

Anonymous — Sat, 10 Nov 2018 13:06:15 GMT

Hi TRF,

Like tCheckpoint customised component, is there any customised component that support parallel execution as well as restartability at job-level?

Thanks,

Revathy.