topic Re: performance improvement in Talend Studio

performance improvement

RakeshKumar1 — Fri, 15 Nov 2024 23:39:09 GMT

Reading 21 .dat files which holds 22million records and writing into two target tables and reject files after schema check. This is a migration project and we are trying to match the job run time of datastage which is 1.3mins whereas in talend it is taking 2.6mins.

Job design

1) tFileList (reading 21 files) --> tFileInputDelimited --> tMap1

2) tDBInput1--> tMap1

3) tMap1 --> splits into 2 flow --> flow 1 --> tSchemaComplaianceCheck1 --> tDBOupt1 and tFileOutputDelimited1

flow 2 --> tSchemaComplaianceCheck2 --> tDBOupt2 and tFileOutputDelimited2

To achieve 2.6mins below is the configuration,

1) In Iterate link - enabled parallel execution to 4

2) Fetch Size in tDBInput1 - 10000

3) tDBOupt1 and tDBOupt2

BATCH_SIZE - 100000

COMMIT EVERY - 50000

Parllel Execution - 12

Can anyone please suggest, performance improvement steps?

Re: performance improvement

RakeshKumar1 — Fri, 08 Oct 2021 09:18:24 GMT

Please help on this question? how can I get the performance.

Re: performance improvement

gjeremy1617088143 — Fri, 08 Oct 2021 09:42:03 GMT

Hi @Rakesh Kumar ,

you can try to allocate more memory to the jvm : run tab --> advanced settings --> use pecific JVM arguments

-Xms number M memory allocated a the launch of the job

-Xmx number M max memory allocated.

in tDBOutput do you use Insert or Update ?

Also if the tSchemaComplianceCheck are the same make the split after it

Send me Love and kudos

Re: performance improvement

Anonymous — Fri, 08 Oct 2021 12:00:36 GMT

The tDBInput1 looks like a lookup table. If the datasets are always the same you can write the content into a tHashOutput before and reuse it with the tHashInput for the actual lookup to tMap_1

Re: performance improvement

Anonymous — Fri, 08 Oct 2021 12:01:49 GMT

Parallelisation in tDBOutput is often not a performance win, instead it could potentially kill performance because of deadlocks