TRecordMatching is very slow

Anonymous · ‎2019-11-25

Hi,

I am matching 25000 records to 120000 records (reference file) with TRecordMatching component.

I have defined province for my Blocking section. You can see the rest of configuration in the picture.

It is running for 4 hours and still 12000 records from 25000 records are processed.

What I should do to increase performance?

dprot · ‎2019-11-26

hi,

IMO it could be related to two things:

- have you looked at the size of each of your block? If you have only a few provinces (let's say 10 for example), then you will still have many comparisons to do (each record would be compared to approximately 12000 reference records, hence you will have around 300,000,000 comparisons)

- how many tokens do you have in your address field? If you have more than 10 tokens, I think it's a bit risky to use "Any Order" tokenized measure, because it is a quite complex method (you can see comment of https://jira.talendforge.org/browse/TDQ-12121 for more details)

Anonymous · ‎2019-11-26

Thank you for the reply.

I changed "any order" to No and selected "store to disk" option and the time reduced from 9 hours to 5 hours which is still very long. I thought about changing blocking from "province" but I couldn't find any other combination that would work for my case. I have first name, last name, address, province and postal code. What is your suggestion? Could changing memory heap increase the speed?

My ini file is as below

-vm
C:\Program Files\Java\jre1.8.0_231\bin
-vmargs
-Xms4G
-Xmx8G
-Dfile.encoding=UTF-8
-Dosgi.requiredJavaVersion=1.8
-XX:+UseG1GC
-XX:+UseStringDeduplication

config.png

Data Quality

Other

v7.x