tRuleSurvivorship & tMatchGroup performance issue

npatel · ‎2015-07-16

Hello,
As part of our ETL import we wanted to identify duplicates in the file. We are using tMatchgroup? and ?tRuleSurvivorship to achieve this and were successful in identifying duplicates and create a new row for the survivor for each duplicate group.
While running this job on TAC, we are facing performance issue with these components. We ran a file with 2600 records and it was successful but sluggish(took 5 mins to process it). But when we run a file with 120K records, it just gets stuck on this subjob which has tMatchgroup? and ?tRuleSurvivorship and doesn't process the data at all.
We cannot even set up parallelization on this sub job due to these components. After adding a level of logging we have identified that these components are the bottleneck. Can someone suggest how to improve the performance of these components.
We are using Talend Platform for Big Data 5.5.1.r118616, the jvm parameters for this job on TAC are set to (-Xms1024M, -Xmx24576M)
Any advice on performance improvement or way around this logic will be highly appreciated.
Thanks in advance.

Anonymous · ‎2015-07-20

Hi npatel,
Could you please report a ticket on Talend Support Portal?
In this way, we can give you a remote assistance on your performance issue through support cycle with priority?
https://support.talend.com/otrs/customer.pl
Best regards
Sabrina

Data Quality

v5.x