Passing data between jobs

Anonymous · ‎2016-02-08

Hi,

I googled and found some info around this topic, could not find the best matching answer to my question and so finally though of posting it here.

Problem: Our ETL is going to be split into multiple jobs. So we want to understand what would be the best way to pass data from job1 to job2 and so on.

Considerations:
1. Some cases data volumes would be 10s of GBs
2. Some cases data volumes would be 100s of MBs
3. We could have jobs that are running in parallel and then once they are finished, a single job to consolidate these all to load to Redshift

Questions:
1. Will it be better to stage intermediate output in Amazon Redshift or will take the performance down?
2. Is something available using which data can be made available to next job without landing on the disc?
3. Also what would be restartability options with the solutions? I mean can we achieve something like from the design of say 10 jobs, if first 5 fail, next time we only run aborted job and remaining jobs?

Any inputs will be appreciated.

- Kirti

Anonymous · ‎2016-02-08

In general the best way to process data in parallel is to have a criteria to separate the data into chunks.
If you find a dimension like day, country or similar (also combinations) this would be a great possibility to spilt the data.
Now create a job wich iterates through these chunks and build also a (worker-) job wich is dedicated to process a given chunk of data.
Triggering a tRunJob is usually done be an Iterate trigger. The advantage is this iterate trigger it self can configured to trigger in parallel with a configurable number of threads.
The handover of the chunk identifier(s) should be done be context variables of the worker job.
Lets assume an simple example:
tMysqlInput --flow--> tFlowToIterate --iterate (parallel)--> tRunJob (for the worker).
the first input selects the various chunks of data (only the id like dates or countries etc.) and these Ids are handover in the tRunJob in the context settings and filled with the variables of the tFlowToIterate.

Talend Data Integration

v6.x