Partitioning in Talend?????????????

ankit7359 · ‎2018-11-20

Hi,

I was going through the Talend Documentation and i came across "Set Parallelization".

I went throught the documentation on this "Set Parallelization" but i really didnt understand much...

Can anyone pls help me in this.... ???

Also when i click on "Set Parallelization" i get this as "No Need for this Job"...may i know why???

also i see that there are 4 components relating to this tcollector,tpartitioner,tdepartitioner,trecollector.... are they similar to steps for implementing partitioning....????

in the row settings....can anyone explain about basic and advanced settings,breakpoint and parallelization???

how do i utilize parallelization tab in row settings????

how do i configure them???

There is also a demo scenario that i have tried to implement -

the scenario is load input with the count of the records to the output..

i have encountered 2 things while doing this scenario while returning the count of the records i find a warning on top of taggregaterow which says... the partitioning keys should be same with its partitioning connection.

and once i remove the count function and when i run the job it executes successfully but i dont see the output...

Can anyone pls help ???

Thanks in advance,

Ankit

tfixedflowinputJob viewrow1 schema settingsrow1 advanced settingsrow1 parallelization settingstaggregaterow settingsProblems Tab in PreferencesEnd_result in TlogrowBreakpoint settings

Anonymous · ‎2018-11-20

These components (i.e. tPartitioner, etc.) let you break up a large record set into chunks so you can process it in parallel. Basically, the components handle the "bookkeeping" involved in splitting up the records; that's why the workflow is somewhat complex.

What problem are you trying to solve with these components?

The 2 errors you're getting are type conversion errors; you can probably fix these by updating your schema.

ankit7359 · ‎2018-11-20

hi @DVSCHWAB,

pls consider the warnings in the problems tab because..those are for different jobs and also the name of the job is files30....

and i basically want a clear explanation on this????

Pls help

THanks in advance

Ankit

Anonymous · ‎2018-11-20

I am not convinced of the parallelization within a flow. It cause a lot of complexity you never will be aware of and there are a lot of use cases in which this kind of parallelization never can work. Your example is such a kind of.

The aggregation cannot be done in parallel without using the key column as partition - but where to setup this?

I always recommend doing mostly nothing in parallel in jobs except calling other jobs via the iterate flow parallel setting.

ankit7359 · ‎2018-11-21

Hi @lli,

You say that parallelization never works for any job... or only certain jobs.....????

If it works for certain jobs then how must i design my job where i can perform parallelization ???

is there any pre-requisite while enabling "Set Parallelization" in the job level.....

Pls help...

THanks in advance...

Ankit

Talend Data Integration

v7.x