Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi folks,
Background:
We're using tRecordMatching to match incoming files with our existing database. We're matching based on a company name provided by the client to company names in our database, using any address info provided to aid in our matching. Any cleansing, parsing and standardization of company names and addresses on both the incoming and lookup data is done prior to this job.
The lookup data contains about 1.5m rows.
The file is matched multiple times over multiple jobs. All jobs look similar to the attached, the difference between the jobs being the blocking keys used to match the data. So for example, job 1 will attempt a match on company names using City & Country combinations as blocking keys (created by the tGenKey components), and job 2 will match using State and City. Job 3 will match on country only.
At each stage, matches and possible matches (within a reasonably strict distance - We currently are testing with Levenshtein) are sent to a MySQL db, and the non-matches are sent to a tBufferOutput, which sends the data on to the next job.
Issue:
When dealing with a large number of rows in the incoming files (say 50,000+), if the incoming file contains limited information to use as a blocking key (i.e, country only instead of a full address) we're seeing performance issues with the matching process - About 10 rows/second max. Each matching job takes an hour to process, and the whole 20 stage match can take half the night.
If the files we receive contain address info and we can block the records into smaller pots , performance issues are not a problem. However in reality we aren't always going to receive any more than country to use for blocking.
Is there anything we can do about this? Is it just because of the blocking issue?
The match is currently being done on a local machine and not via job server - will deploying to server help? Believe local machine and job server have similar specs (similar processors and 8GB RAM)
Currently using JVM arguments -Xms1024M, -Xmx5550M, -XX:+UseConcMarkSweepGC, -XX:+UseParNewGC - Are these the best options?
Is it just a case of throwing more processing power at it?
Many thanks!