Performance of running deduplication using tMatchGroup and Swoosh

Luis_Simoes — Sat, 16 Nov 2024 03:15:29 GMT

Hi,

Is anyone using tMatchGroup to deduplicate customer master data on Talend?

I am currently using Swoosh on top of a 2M rows dataset and I am constantly getting out of memory errors...

My server is an Azure VM with 16GB of RAM and 4vCores that are used pretty much at 100% through the whole processing time.

I am wondering why the performance is so poor, and how I will be able to scale it since I have 13 sources yet to be added to the same dataset...

Any suggestions and considerations?

How to size and estimate the processing requirements for this task?

What can be tuned in terms of settings of the Run or Component?

Thank you

Regards,

topic Performance of running deduplication using tMatchGroup and Swoosh in Talend Studio