Skip to main content
Announcements
A fresh, new look for the Data Integration & Quality forums and navigation! Read more about what's changed.
cancel
Showing results for 
Search instead for 
Did you mean: 
Anonymous
Not applicable

what are the differences between Multi thread Execution and Parallelization with respect to Job performance in Talend?

Multi Thread Excecution

0683p000009LufA.jpg

 

Parallelization

 

0683p000009LuI3.jpg

Labels (2)
1 Reply
Anonymous
Not applicable
Author

Hi,

 

For the multi thread execution option, see https://help.talend.com/reader/rvo3qR70LGgNn7uawMvxeQ/zOtMbhdnTMKLpMaXfIMJUA

For Pipeline Parallelization, see https://help.talend.com/reader/Z6nEoVnqAU2j~MFxYHYeOg/BD4MmNBB4ZSG8cESdmH6Sw

 

The first is about multi threading, i.e. running logic in parallel, for instance in Java, it leverages threads and threading capability.  For example the parallel execution on a tMySQLOutput of 4, means 4 database connection is established and you are writing to 4 different connection.  Depending on your hardware, network, database, most of my personal experience is that any number higher than 4 produce negligible improvement on speed because the database starts struggling with the throughput thrown at it.

 

The pipeline parallelization uses hash keys to bucket your data in to partitions and uses partitioning technique when processing the data.  Even through it sounds interesting, there are many operations which requires the data to be recollected, and then repartitioned.  In your example below, to output the data into the tFileOutputDelimited, you need to recollect the data to write them in the write order based on your tSortRow.  You will need to benchmark the performance.  Parallelization doesn't mean faster processing! In general, it should! But you are adding extra overhead at potentially each component for the job to bucket the data and do partitioning and departitioning of the data. Hence, this can sometimes have adverse outcome, i.e. your job runs slower for small data sets.  It depends on memory, cpu cores, cpu cycles available, the actual data between each run of the job, i.e. how many partitions are being created, and the components/operations are in your job logic.

 

Another way  to parallelize is to create multiple tasks in TAC and run them at the same time.  I generally prefer this approach with a combination of multi thread execution.  I will build my job to work on ranges of data.  Then deploy many instances of this job as tasks in TAC conductor.  Each task working on a different range.  Also I will build intelligence in my jobs to know how to restart, where they have reach with data processing, etc.  This way the jobs can be re-run and they will pick up where they have reached.