Performance issue with below design

Anonymous · ‎2016-02-16

Gurus,
I'm new to talend. Got struck with the performance issue. Kindly help me, to fix it.
I have records in million. No chance to extract the data from db, all were from file.

>>Removing duplicates and retaining the record based on max date(used tsort and tuniq component)
>>Used different filter conditions on tmap and tfilterrow.

Job failed due to "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
>>Increased the VM Argument to -Xmx4096M
But still, I got the same error.
>>Written to temp file in tmap and sorted on disk in tsort,
Got the same error.

My questions:
-->Sort is the main culprit. Any other possible ways to sort the data(don't have staging db to sort)?
-->I'm reading the same reference file twice,why because I cannot redirect the single tinputdelimited to two tmap reference. Is there is any way to read the file only once?
-->How the overall design can be improved?
Some guidance will be greatly helpful.
Thanks

Anonymous · ‎2016-02-16

I am struggling with your job design. tMap_3 and tMap_4 have no output at all and therefore useless.

Anonymous · ‎2016-02-17

In reference files,I have 8 columns. I'm doing some filter & removing few columns from tmap3 & tmap4(cannot hold all the unwanted columns in lookup buffer). It can be integrated in tmap1 and tmap2, also but for the sake of debugging(trace count from each link), I have used tmap3 & tmap4. Do you think, that will affect the performance?
Thanks

lvsiva · ‎2016-02-17

Hi rajmhn,
You don't require tmap3 and tmap4 as well you can filter those columns in tmap1 & 2 and enable sort on disk option in tsort advanced settings.
You can also try by removing filter step and do filter in tmap(i don't think so will get performance but try once).
Thanks,
Siva.

Anonymous · ‎2016-02-17

Thanks siva.
You don't require tmap3 and tmap4 as well you can filter those columns in tmap1 & 2
>>In reference, I have 8 columns. Only few required. I cannot take all the columns into tmap buffer. That's the reason, why I used tmap and also I'm filtering records based on few conditions(though can be implemented in tmap1 & tmap2).So I incorporated both functionalities in tmap3 & tmap4.
Do you think, without using tmap2 & tmap3, it can be accomplished?
sort on disk option in tsort advanced settings
>>I already enabled it.
Job was very resource consuming. I'm getting 3 million records from source & 2.5 million from each reference. Allocated Xmx16384. Job completed in 6 minutes.
One general question, what is the sort algorithm used in tsortrow?
Thanks

Anonymous · ‎2016-02-18

Someone please help me out.

Anonymous · ‎2016-02-23

Hi rajmhn,

In reference, I have 8 columns. Only few required. I cannot take all the columns into tmap buffer. That's the reason, why I used tmap and also I'm filtering records based on few conditions(though can be implemented in tmap1 & tmap2).So I incorporated both functionalities in tmap3 & tmap4.

It can be achieved in tMap_1 and tMap_2 without using tmap3 and tmap4.
What's the current rows/s during the data processing(row rate)?
Best regards
Sabrina

Anonymous · ‎2016-02-23

Thanks Sabrina.
It can be achieved in tMap_1 and tMap_2 without using tmap3 and tmap4.
>>I have 8 columns like A,B,C,D,E,F,G,H. Filtering the records on C,D,E,F,G(tmap3 & tmap4) and I'm taking only A,B,H to reference buffer(tmap1 & tmap2). It can be accomplished without using tmap3 & tmap4, but with the cost of taking all the columns A,B,C,D,E,F,G,H to reference buffer. Correct me, if I'm wrong.
What's the current rows/s during the data processing(row rate)?
>>It was around 5000 rows/sec
Solutions to consider:
>>Split the jobs into two, one till tmap2 and other job for sorting and remove duplicates.
>>Writing temp data on disk and assigning less JVM Xmx memory
>>Assigning more JVM Xmx memory
Which one would be feasible one?
Thanks

Anonymous · ‎2016-02-23

I am struggling with your job design. tMap_3 and tMap_4 have no output at all and therefore useless.

Thanks.I have 8 columns like A,B,C,D,E,F,G,H. Filtering the records on C,D,E,F,G(tmap3 & tmap4) and I'm taking only A,B,H to reference buffer(tmap1 & tmap2). It can be accomplished without using tmap3 & tmap4, but with the cost of taking all the columns A,B,C,D,E,F,G,H to reference buffer. Correct me, if I'm wrong.

Anonymous · ‎2016-03-01

Hi,
t Map is a cache component consuming two much memory. For a large set of data, try to store the data on disk. Did you get any outofmemory issue on your end?
Have you already checked the document about:TalendHelpCenter:Exception outOfMemory
Would you mind uploading your tmap3 and tmap4 map editor screenshot into forum?
Best regards
Sabrina

Java

Talend Data Integration

v6.x