topic Re: SQL in a tflowtoiterate performance degrading in Talend Studio

SQL in a tflowtoiterate performance degrading

manueld — Fri, 30 Nov 2012 03:08:12 GMT

Hello all,
Using Open Studio TOS_DI-Win32-r78327-V5.0.2
----------------------
Context of the source
----------------------
I'm stuck with a poor designed postgresql 8.3 table
- 14 millions rows covering only the last 3 months
- typically betwwen 150k et 200k per business day, less on the weekend
- 55 columns as well as 23 index.
- vaccum is not well tuned (if tuned at all)
- every day data are removed and added to keep only 3 month
- not index rebuild with enormous holes in the beginning of the table.
Eventually all those bad desgins will be handeled. Until then, I need to work with what I have
----------------------
The purpose of the ETL
----------------------
I have to evaluate the integrity on a daily basis but sometimes, past rows are being
updated. Not much but at least few percent. Also, for the initial release of my sanity check,
i'll have to run the damn thing for 90 days. After that, I'll have to go back 10 days every day.
----------------------
Solution so far
----------------------
I tried to grab the whole thing in single tinput... no luck there... so
tinput ( generate dates from / to )
--> tFlowToIterate
--> tInput ( with dynamic where clause on indexed timestamp - tried with buffer to 10k or 20k)
--> tOuput ( with a drop table if exist in a tmp schema - tried with buffer to 10k or 20k)
--> several other temp table with aggregated content few final small thashoutput
I dont use an existing connxexion but instead, use context
variables for the all postgresql component
----------------------
results
----------------------
When the date range is only 1 day of processing, I can reach 3500 / 4000 rows per seconds
witch is fine, dealing with a minute per day. That's my base line
The problem is that when I enlarge the date range (nb of flowtoIterate up) the perf is degrading fast.
For 2 days, the initial one is same but I drop to 1000 rows per sec on the second
Last test on 6 days, the last one cant reach 500 rows per sec.
I did not even tried de needed 10 days
----------------------
now what !
----------------------
I dont know what to do to improve the Extract perf.
Should I call a complete subjob after the flowtoiterate instate of keeping all in the etl
There surely a buffer not emptied righ?
Any help would be appreciated!!
Manuel

Re: SQL in a tflowtoiterate performance degrading

Anonymous — Fri, 30 Nov 2012 11:16:45 GMT

Hi
Can you please upload some screenshots of job? So that we will see what we can do to optimize the job design.

Re: SQL in a tflowtoiterate performance degrading

manueld — Fri, 30 Nov 2012 16:02:10 GMT

Here is the asked screen shots
The red arrow represent the process that is degrading fast
The black arrow lead to the (a), (b), (c) portiojn wich is irrelevant since it's very fast
The green arrow is a OnSubJobOk that will finalize the etl

Re: SQL in a tflowtoiterate performance degrading

manueld — Mon, 03 Dec 2012 19:07:28 GMT

Idea? Someone?

Re: SQL in a tflowtoiterate performance degrading

Anonymous — Tue, 04 Dec 2012 04:54:45 GMT

Hi
I have the following points to improve the performance:
1. Enable parallel execution on iterate link, click the iterate link between tFlowToIterate and tJava component, check the box 'Enable parallel execution' and set the nb of parallel execution.
2. Why do you use iterate link after tjava component? Do you really need iterate? use OnComponentOK replace iterate.
3. Output the result to temporary file instead of memory for large of data set.
Shong