
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
what are the differences between Multi thread Execution and Parallelization with respect to Job performance in Talend?
Multi Thread Excecution
Parallelization

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi,
For the multi thread execution option, see https://help.talend.com/reader/rvo3qR70LGgNn7uawMvxeQ/zOtMbhdnTMKLpMaXfIMJUA
For Pipeline Parallelization, see https://help.talend.com/reader/Z6nEoVnqAU2j~MFxYHYeOg/BD4MmNBB4ZSG8cESdmH6Sw
The first is about multi threading, i.e. running logic in parallel, for instance in Java, it leverages threads and threading capability. For example the parallel execution on a tMySQLOutput of 4, means 4 database connection is established and you are writing to 4 different connection. Depending on your hardware, network, database, most of my personal experience is that any number higher than 4 produce negligible improvement on speed because the database starts struggling with the throughput thrown at it.
The pipeline parallelization uses hash keys to bucket your data in to partitions and uses partitioning technique when processing the data. Even through it sounds interesting, there are many operations which requires the data to be recollected, and then repartitioned. In your example below, to output the data into the tFileOutputDelimited, you need to recollect the data to write them in the write order based on your tSortRow. You will need to benchmark the performance. Parallelization doesn't mean faster processing! In general, it should! But you are adding extra overhead at potentially each component for the job to bucket the data and do partitioning and departitioning of the data. Hence, this can sometimes have adverse outcome, i.e. your job runs slower for small data sets. It depends on memory, cpu cores, cpu cycles available, the actual data between each run of the job, i.e. how many partitions are being created, and the components/operations are in your job logic.
Another way to parallelize is to create multiple tasks in TAC and run them at the same time. I generally prefer this approach with a combination of multi thread execution. I will build my job to work on ranges of data. Then deploy many instances of this job as tasks in TAC conductor. Each task working on a different range. Also I will build intelligence in my jobs to know how to restart, where they have reach with data processing, etc. This way the jobs can be re-run and they will pick up where they have reached.
