Parallelize the subjob

Anonymous · ‎2018-09-27

I need to create a job which will ingest list of tables by sqooping data from source (RDBMS) to hadoop then to hive.

I put list of tables in a file, read and iterate it to ingested.

Because I have about 300+ tables to ingest, so it will take time if it ingested just by a process. So I need to parallelize it.

What I currently think is, the job will read the list of tables then split it to 10 tables per array. Each arrays then passed to the subjob to processed.

I already do this logic by implement it in spark scala code. The problem is, we need to move it to be a talend job so it will be more easy to monitor and maintenance by operation team since the only familiar with talend while I don't know how to implement this logic in talend.

I will appreciate any help. Thanks.,

Anonymous · ‎2018-10-18

Hi mahadi-siregar
You can try checked the option 'Enable parallel execution' on the basic settings panel of iterate link, and check 'Use an indepentent process to run subjob' option on tRunJob (call the child job and pass the current table name to child job).
Let me know if this way could improve the performance。

Regards
Shong

Big Data

Talend Data Integration

v7.x