parallel executions

Anonymous · ‎2008-08-22

I need to read data from 4 tables , each hosted at 4 separate DB's. I want this to happen parallely ??
1. Should i create 4 DB connections for ech DB and then run 4 separate jobs ? Bad option!!
2. Does context help me somewhere.

It would be good if i can have just one job and configure it to run for different DB connections parallely??

Anonymous · ‎2008-08-22

Hi aviator,
do you have to melt data from this 4 tables or are you going to make 4 different jobs?
bye

Anonymous · ‎2008-08-25

There is no general answer to your question with only the information you provided.
If you like to process the data from the 4 sources in one flow then there is no possibility to run the operation in parallel (but they will run semi parallel ? interleaved).
If you retrieve the data, cache it somewhere and process it afterwards, the data fetch can be done in parallel with 4 different unconnected flows in one job (if you enable the Multi thread execution).
You need to keep in mind that accessing the global variables and sub job execution is synchronized and limit the possibilities how much you can multi thread and parallelize without running totally independent processes.
To understand what you are looking for, you need to provide some additional details on what you like to process and what you expect from parallel.

Anonymous · ‎2008-08-25

You can use the 2.4 new feature "iterate parallel" (see an example on my blog : Parallel executions on iterate links). Describe your database connections in a text file : 1 connection = 1 line, tFileInputDelimited --row--> tFlowToIterate --iterate--> tOracleInput --row--> tFileOutputDelimited for example.

You need to keep in mind that accessing the global variables and sub job execution is synchronized and limit the possibilities how much you can multi thread and parallelize without running totally independent processes.

It's not deeply documented yet, but what's true for Java is not for Perl. As described in another post on my blog multithreading for Perl jobs, Java uses threads for parallelization while Perl uses processes. As a consequence : Perl parallel subjobs share nothing, it avoids problems but also brings another kind of limitations.

Anonymous · ‎2008-08-26

How would you get the data streams together afterwards?
If I have to cache the results in a file I can do it more simple with one component by connection, I think.
Maybe Perl is doing much better with parallel iterative links. With Java I?m really struggling hard, I?ve created some bug reports and got interesting feedback about limitations already.

Anonymous · ‎2008-08-26

How would you get the data streams together afterwards?

You have to use a concurrency compatible kind of output, a database for example. But not a single file. You often don't need to merge the streams, flows are most of the time independant when talking about iterate links (let's say you have 1000 files to load in a database table, or 1000 files to read, transform and output 1000 corresponding files).

Anonymous · ‎2008-08-26

It would be really nice to see some praxis relevant examples in a blog or an article (and not just about parallelism). I checked you link; it is very brief and straight forward.

I think TOS has lots of potential but it is sometimes difficult to have the right idea to solve issues efficient. Especially when different aspects come together and the jobs get a bit more complicated.
For your example: if I have a list of files and need some additional information from the database, can I use an additional iterate link inside the parallelism?
If I do the transformation with a tMap inside the parallelism, will the lookup loaded several times?
Would Java and Perl behave different for these two cases?
I?m usually spending hours to try out several scenarios to answer above questions and I may not be able to even explore all possibilities because I just don?t get the right idea how to do it best. Browsing through some best practises would be extreme helpful, I think.

Anonymous · ‎2008-08-27

My problem statement:
I have a job A , which reads data from a input database D1 and does some aggregation (in a subjob)and outputs the data in another table in Datawarehouse-1
Now this job has to be performed for different input databases(D1,D2,D3,D4) , and aggregated data outputted in the same DatawareHouse-1.
Question:
Do i need to write the same job 4 times , configuring my jobs for each of the input DB's ?
And then have 4 tRunJob's in a separate job , pointing to each of the 4 above jobs created , and enable the multithreading option so that they can run paralelly?
(thats a pretty easy and obvious option)
OR
Can i configure the same job for 4 different input DB's , make them run parallely?
Am i understandable?

Anonymous · ‎2008-08-27

Also if i have 4 tRunJobs like in the image :
multithreading option is ON
does it mean that they are running parallely.
I hope thats not a stupid Q

Anonymous · ‎2008-08-27

The answer to your question depends on the language you choose for your jobs.
With Perl you can implement it in the way plegall described and you would not need to handle the connection separate. It works in Perl because it will run in separate processes for the different iterations. I?ve never done a job in Perl but I think plegall knows very well how it works.
If you select Java, you can call the same Child Job from different threads but they will not run in parallel because they are synchronized.
If you build 4 complete separate jobs they will run in parallel (access to global variables is still synchronized between the jobs therefore you can not expect 4 times the performance).
The jobs in your picture of your other post will run in parallel (3 threads) if:
A: the language you use is Perl
B: your language is Java and all 6 jobs are different. If any job is shared, it will sequence the execution of this job and the rest will still run parallel (with exception of the synchronized sections like the global variable access).
The question is not stupid, I spend hours trying to find out what works better and what does not work at all.
If you like to have maximum parallelism in Java: build you job for one data source. Put the db configuration into a context and execute the java cmd with the different contexts 4 times. This should rarely be necessary, I think.
The choice is up to you. You have several options.

Other

Talend Data Integration