Skip to main content
Announcements
SYSTEM MAINTENANCE: Thurs., Sept. 19, 1 AM ET, Platform will be unavailable for approx. 60 minutes.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

Parallel execution with restart machanism

Hi,

 

There around 200 tables whose certain fields require masking. Of these 200 tables there are around 10 tables whose size is around 10 billion rows.

So I would like to understand how to split the data and execute in parallel and when it fails restart from the last record processed.

 

e.g. say customer table where id is primary key and there are 10 billion rows. 

      The job will have a tDBInput component followed by tMap, tDataMasking, tDataShuffling,tConvertType components and ends with tDBOutput. The job has tPrejob and tPostjob for connection and commit.

     Now I wanted to design such that I can split 10 billion rows and run in parallel. If any of the parallel run fails on-restart should start with the last processed id.

 

Could you suggest the best approach in Talend. This job will be run in Oracle server as cron job.

I have read the "How to launch parallel iterations to read data" but does not suggest re-start capability.

 

Thanks. 

Labels (3)
1 Solution

Accepted Solutions
Anonymous
Not applicable
Author

Hello,

For DB parallel execution, usually "use parallel execution" option is supported on the t<DB>output component which is used to perform high-speed data processing, by treating multiple data flows simultaneously. Note that this feature depends on the database or the application ability to handle multiple inserts in parallel as well as the number of CPU affected.

And there is also tParallelize component which allows you to synchronize the execution of a subjob with the execution of other subjobs in your main Job.

Would you mind posting your current job design screenshots on forum which will be helpful for us to get more information?

Best regards

Sabrina

View solution in original post

1 Reply
Anonymous
Not applicable
Author

Hello,

For DB parallel execution, usually "use parallel execution" option is supported on the t<DB>output component which is used to perform high-speed data processing, by treating multiple data flows simultaneously. Note that this feature depends on the database or the application ability to handle multiple inserts in parallel as well as the number of CPU affected.

And there is also tParallelize component which allows you to synchronize the execution of a subjob with the execution of other subjobs in your main Job.

Would you mind posting your current job design screenshots on forum which will be helpful for us to get more information?

Best regards

Sabrina