What are Performance Tuning steps one should follo... - Qlik Community

Anonymous · ‎2013-07-09

Hello All,
When dealing with huge data sets, What are the performance improvements steps you follow when developing Talend DI Job?
Need your expert inputs.

Anonymous · ‎2013-07-09

Hi,
I think it depends on job design.
1. If there is t<DB>output in job design, we usually suggested use "commit every" option to improve job performance.
2. If you DB support for bulk mode, talend also provide bulk component for user.
3. sql query is faster than tMap, because sql query is executed on DB engine(If your job refer to main, look up on huge data). In addition, ELT component is also a good option.
4. "use connection option" in t<DB>connection affect the job performance, for the batch mode is no longer activated and without this option performance significantly drops.
Hope it will be useful.
Best regards
Sabrina

Anonymous · ‎2013-07-09

Hi Sabrina,
your first hint is - on my opinion - totally wrong. Committing every single dataset is a performance pitfall!
I would also suggest to use the Cursor size greater then 1000 (or switch on the Streaming mode for MySQL) in input components.
The hint with joins in the database is also not a good rule of thumb. It depends on how large is the main table and how large is the lookup table. I would always recommend to do the lookup within them tMap is the amount of lookup datasets is not larger than 5 Mio datasets (per lookup). Joining huge tables with (even smaller) tables in the database can cause various problems like overrunning the shared memory area or overload the temporary tablespaces and so on.
ELT is a good option but has the disadvantage like all other database internal procedures: you never see any progress and therefore it feels much slower than doing that in a ETL mode (probably a bit slower) but with the option to be able to calculate the duration time. Mostly it is not the goal to get the last percent of performance, mostly it is more important to be able to calculate the needed duration time.
For huge datasets it is always a good thing to separate the datasets into smaller groups (like a partitioning key) and proceed the groups in parallel jobs.

Anonymous · ‎2013-07-10

Really it is Great TOPIC to discuss
-- 1. If there is t<DB>output in job design, we usually suggested use "commit every" option to improve job performance
Not sure what was exactly meant by that
By default it is 10000
To improve performance in Oracle (in Amazon)
I put to it smaller things like 50,100 for 2 reasons:
1) Transaction was too big if 10000 - and it affect performance badly - really it was NOT able to work at all
2) When 50 - I can see progress by queries like Select count(*) - to see speed
One more thing - let us say from 10 000 entries one is bad
if 10 000 - then whole transaction will fail
if 50 - then latest portion failed (previous will be saved)
It depends - I prefered 2-nd option.
ps I would read some best practise on performance
if there are articles - pls give references to them

RajeevKumarMeh · ‎2018-08-28

Here are some guidelines I have followed :

(1) Load data in chunks : Use cursor size while selecting data , it help in reducing memory requirement for job . Use commit size say 10K at output component . Keeping uncommitted data will need too much memory while job processing . Setting commit size too low like 50 or 100 will lead to many database hit .

(2). Avoid heavy lookups ,instead join the lookup table in source input component (+setting of cursor size).

(3). Create index on join attributes before data processing

(4). Load and Process only incremental data wherever possible

(5). Gather tables and index statistics , please have a scheduler which will gather stats in off business time

(6). Load jobs in parallel after identifying dependencies among them

.......and many more

Jesperrekuh · ‎2018-08-28

My 2 cents on this:

1 - Reduce the amount of needed data... if you want to join mln of records with mln of records, focus on having the matched correct output and manage the amount of (un)needed columns, minimum required columns... varchar/strings are memory drainers.

2 - And for matching I sometimes upload the (lookup) data to the database and join it in the database and then get the output. Use your db engine wise.

3 - On the hardware part, No wifi connections, prefer cable...

4 - Insert statements are the best performing, Insert/Update... you can plan your trip to mars, dickup some sand and get back and still have time for coffee

What are Performance Tuning steps one should follow?

Talend Data Integration

v5.x