Hello,
I need help with my project. In my project I have 2 tables:
CUSTOMERS table (id, name, surname, street, street number, zip code, phone, email) about
100 000 records.
USERS table (id, name + surname, street, street number, postal code) about
1 million records.
I need to
pair with the
CUSTOMERS table in the
USERS table.
When a given
customer found in the table of
USERS , I set him a particular attribute. P
roblem is, even the street and names are not completely identical in both tables. Within the pairing needs to do a
fuzzy match, follow these steps:
postal code must match
street must have a levenshtein distance value max 3
The name must have a levenshtein distance value max 5
It is possible to do Fuzzy match on multiple columns? as adding a condition when equal psc && fuzzy match on the street is max 3 and then come on fuzzy match on the name of the CUSTOMERS ?
Respectively, I tried to make the columns in both tables as the
concatenation psc + street + street number + surname + number. Problem is the 100-000 records seeking one record in a table of size 1 million record. It's awfully slow, it would take about 5 days.
Is there any way to dynamically take all the records from the CUSTOMERS table with the value of a PSC, and to Him do dynamic table inner join USERS with the same psc, and from then on the table to do those little fuzzy join, and iteratively continue like this for all possible values listed psc in the CUSTOMERS table?
If you could, can you explain how to do this dynamically, respectively ja stock and set the component?
I use Talend Open Studio for Data Integration and PostgreSQL 9.3
Thanks for your help, I am at a loss.
Hi pantolik,
I don't thin that all of your 1 M records needs a fuzzy match... In order to optimize the process what you can do
- First you have simple inner join using tMap for all of your records in join...
Above step may give you some x records and 1M-x records are rejected due to non match... keep it in separate table of buffer...
- Next step to use this table and again join using your fuzzy logic.
In order to improve the joining performance, I would recommend to create proper indexes on database table
use file system in tMap for storing the lookup data than the memory...
Try this, certainly performance of your job would be enhanced... once done we will try some other techniques as well.
Vaibhav
Hi, I did it as follows, but not to restrict users, only around 8%. It would not somehow make iterative sent CUSTOMER into Fuzzy Match based on its postal_code and dynamically on CUSTOMER postal code made a inner join with USER postal code and use it like lookup?
There was little acceleration, but not very significant.
Maybe I did not fully understand how you previously thought.
Can you please explain how to connect components for example, at least in text style:
compA ---actionA--->
compB ---actionB--->
compC compD ---actionC---^