Identify duplicates and merge records

Anonymous · ‎2018-07-02

Hi, I am fairly new to talend and first time using this forum, so apologies in advance.

I use Talend open studio Big Data (free version) and as part of my job I have to carry certain integrations.

I have searched and tried several solutions mentioned by other users, but unfortunately it does not serve the purpose of my need

Currently I am working on a Project where I need to merge records where if duplication occurs, however, if the master and child record contain same data then no merge is required, this is based on the personal data.

We are using maria DB and our database contain few hundred tables and from those table, we are only interested in few tables that will help identify duplicated record. The concept of this to eliminate duplication.

So, we have a person table(personID, firstname, suranme), which links to the person address table (personaddid, personID, town, city, county etc), person email table (personemailID, personID, email, emailtype(home, private, work) and person telephone number (personTelephoneID, personID, telephone, telephonetype(home, private, work).

Based on these tables, I first need to identify if the record is duplicated, if so, only merge missing data to master record.

For example,

Person Table
PersonID	Firstname	Surname
1	Jo	Blog
2	Jo	Blog

Person Telephone
PersonTelephoneID	PersonID	Telephone
1	1	(+44)01234567890
2	1	01234567890
1	2	(+44) 01234567890
2	2	(+44)01234567891
3	2	01234567890

These two candidate are classed as duplicate because candidate 2 contains all data same as candidate 1 except with one unique number
So, I want to be able to identify these two candidates are duplicates and then based on I want to merge the candidate 2 data into the candidate 1, but that also needs to take in consideration the regular expressions, for example, candidate 1=(+44)01234567890 is same as candidate 2= 01234567890, therefore do not insert this number into candidate 1 (candidate 2 telephone number will read "merged" or even additional rows deleted)

Outcome candidate 1
PersonTelephoneID	PersonID	Telephone
1	1	(+44)01234567890
New id will be created when inserted in table	2	(+44)01234567891

Outcome candidate 2
PersonTelephoneID	PersonID	Telephone
1	2	"merged"

Anonymous · ‎2018-07-04

Hello,

If we understand your requirement very well, are you trying to insert in your database only the new data? If so, you can compare your tables by using tMap and set the "Catch lookup inner join reject" as true to get the changed data.

Feel free to correct us if we misunderstand your requirement.

Best regards

Sabrina

Anonymous · ‎2018-07-04

@xdshi Thank you for taking your time to reply.

Yes, you are correct to a certain degree. I want to only insert new data in the person telephone table if the duplicated candidate contains unique values. So, I will first need to identify the duplicate records based on telephone numbers, then from this duplication list I will need to establish which record from the duplication will be master and which will be child- depending on various scenarios, but as an example, let say latest activity against the record is classed as master. If the record contains same value in this case telephone number, then skip insert, otherwise insert new data in Master record (person telephone table), then remove child data or perhaps overwrite one of the values to "merged" in order to carry out deletion at later stage.

In order to identify duplicates I would have written query in the talend, however, I do not think this will help, as this will only pull through records where the numbers are same, I want duplication but also to see additional numbers against the record.

select
if(act.actDate >b.duplicateDate,'Master',
if(b.duplicateDate > act.actDate,'Master','Child')) Classification
,if(act.actDate >b.duplicateDate,a.PersonID,
if(b.duplicateDate > act.actDate,b.duplicateID,'Unsure')) ClassificationID
,concat(p.firstname,' ',p.surname) name
,a.PersonID
,a.CleanedTelephoneNumber
,b.duplicateName

,b.duplicateID

,b.duplicateTele
, act.actDate
,b.duplicateDate

from person_telephone a
join person p on p.PersonID = a.PersonID
join candidate c on c.Candidate_ID = p.PersonID
join
(
select
t.PersonID duplicateID,CleanedTelephoneNumber duplicateTele,concat(p.firstname,' ',p.surname) duplicateName,p.firstname,p.surname, act.actDate duplicateDate
from person_telephone t
join person p on p.PersonID = t.PersonID
join candidate c on c.Candidate_ID = p.PersonID
left join
(
select
j.JournalItemId
,j.ItemCreationDate actDate
,mx.can
from journal_item j
join
(
select max(je.journalitemid) ID
,je.actionid can
from journal_entity je
join journal_item j on j.JournalItemId = je.JournalItemId
where EntityType = 1
#and je.ActionId = 87009
group by je.ActionId
) mx on mx.id = j.JournalItemId
) act on act.can = c.candidate_id

) b
on b.duplicateTele = a.TelephoneNumber and b.duplicateID <> a.PersonID and b.surname = p.Surname and b.firstname = p.FirstName

left join
(
select
j.JournalItemId
,j.ItemCreationDate actDate
,mx.can
from journal_item j
join
(
select max(je.journalitemid) ID
,je.actionid can
from journal_entity je
join journal_item j on j.JournalItemId = je.JournalItemId
where EntityType = 1
#and je.ActionId = 87009
group by je.ActionId
) mx on mx.id = j.JournalItemId
) act on act.can = c.candidate_id

limit 10
;

Now from that I want only the unique phone number from child to go into master record in the person telephone table.

**Note: all our tables have auto incremented PK therefore the two duplicate records will contain different ID**

I hope this makes sense.

Anonymous · ‎2018-07-05

why dont you do a second query as a lookup to tmap

SELECT col1, col2, col3, col4

FROM table

GROUP BY col1, col2, col3, col4

HAVING COUNT(*) > 1

and in tMap have the condition row1.phoneumber != lookup.phonenumber

Anonymous · ‎2018-07-05

the above will wipe out all the duplicates but it gives you a path to follow or a way to cleanse your database

Anonymous · ‎2018-07-06

@jcruie Thank you for your reply.

@jcruie and @xdshi are both saying once I have written a script to identify the duplicate, I write the same query again and then use the tmap to do inner join , then make outcome action true? I am not sure i understand how that can be achieved as i need to insert all numbers that are unique from he duplicate record.

there is a merge query written in mysql,however this merges all the data rather than ignoring duplicate and inserting only the unique records.

//please not this code is not mine, it was written by ex employee

BEGIN

DECLARE iscont BIGINT default null;
DEClARE candloc BIGINT default null;
DECLARE candjob BIGINT default null;
DECLARE emptycursor BIGINT default 0;

DECLARE contcheck cursor for (select count(clientcontactid) from client_contact_relationship ccr
where ccr.clientcontactid=pchild and relationshipstatus = 8653);

DECLARE curloc cursor for (select candidate_preferred_location.location
from candidate_preferred_location
where candidate_preferred_location.CandidateiD=pchild
and candidate_preferred_location.location not in
(select location from candidate_preferred_location where candidate_preferred_location.candidateid=pmaster)
);

DECLARE jobcat cursor for (select candidate_job_categories.jobcategory
from candidate_job_categories
where candidate_job_categories.candidateid=pchild
and candidate_job_categories.jobcategory not in
(select jobcategory from candidate_job_categories where candidate_job_categories.candidateid=pmaster));

DECLARE EXIT HANDLER FOR SQLEXCEPTION ROLLBACK;
DECLARE EXIT HANDLER FOR SQLWARNING ROLLBACK;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET emptycursor = 1;

open curloc;

loc_loop: LOOP
FETCH curloc into candloc;
IF emptycursor = 1 THEN
LEAVE loc_loop;
else
update candidate_preferred_location set candidateid=pmaster where candidateid=pchild and candidate_preferred_location.location=candloc;

END IF;
END LOOP;

close curloc;

open jobcat;

job_loop:LOOP
set emptycursor = 0;
FETCH jobcat into candjob;
IF emptycursor = 1 THEN
LEAVE job_loop;
else
update candidate_job_categories set candidateid=pmaster where candidateid=pchild and jobcategory=candjob;
END IF;
END LOOP;

close jobcat;

START TRANSACTION;

open contcheck;
fetch contcheck into iscont;
if iscont > 0 then

select 'This Candidate is a live Contact, cannot be merged.';

else

update journal_entity set actionid=pmaster where actionid=pchild;
update arc_journal_entity set actionid=pmaster where actionid=pchild;
update requirement_candidate_application set Candidateid=pmaster where candidateid=pchild;
update candidate_assignment set candidateid=pmaster where candidateid=pchild;
update candidate_industries set PersonID=pmaster where PersonID=pchild;
update candidate_certificates set CandidateID=pmaster where CandidateID=pchild;
update candidate_companies_no_approach set CandidateID=pmaster where CandidateID=pchild;
update candidate_education set PersonID=pmaster where PersonID=pchild;
update candidate_reference_web_sites set CandidateID=pmaster where CandidateID=pchild;
update candidate_references set CandidateID=pmaster where CandidateID=pchild;
update candidate_experience set CandidateID=pmaster where CandidateID=pchild;
update candidate_lanquages_spoken set CandidateID=pmaster where CandidateID=pchild;
update candidate_payroll_detail set Candidate_ID=pmaster where Candidate_ID=pchild;
update candidate_ppe_issued set CandidateID=pmaster where CandidateID=pchild;
update candidate_qualifications set CandidateID=pmaster where CandidateID=pchild;
update candidate_unspent_convictions set CandidateID=pmaster where CandidateID=pchild;
update person_email set PersonID=pmaster where PersonID=pchild;
update person_address set PersonD=pmaster where PersonD=pchild;
update person_emergency_contact set PersonID=pmaster where PersonID=pchild;
update person_skills set PersonID=pmaster where PersonID=pchild;
update person_telephone set PersonID=pmaster where PersonID=pchild;
update document_attached_entities set entityid=pmaster where entityid=pchild and entityType_id=1;
update user_favourite set entityID=pmaster where entityID=pchild and entity_type=1;
update search_entry set entityId = pmaster where entityId=pchild and entityType=1;
update candidate set candidatestatus=9338,
timestamp=NOW()
where candidate_id=pchild;

insert into
merged_entities (entityid,entitytype, date_merged)
values
((select candidate_id from candidate where candidate.candidate_id=pchild),
'1',
NOW());

delete from selected_results where entityid=pchild and entityType=1;
delete from search_entry where entityid=pchild and entityType=1;
delete from sb_candidate_lookup where candidateid=pchild;
delete from candidate_permissions where candidateid=pchild;
delete from candidate_preferred_location where candidateid=pchild;
delete from candidate_job_categories where candidateid=pchild;
delete from user_favourite where entityid=pchild and entity_Type=1;

COMMIT;

END if;

if its easy could I put this script in talend and enter parameters then talend steps through the code and only inserts the unique record?

Anonymous · ‎2018-08-06

I have looked at insert ignore, however, that does not work for me as each email has a unique ID.

Anonymous · ‎2018-08-06

Hi Shaf,

This is a classic case of creating Master Data Management flow but the current rules seems to be joined only based on first name and last name. After looking at the phone country code (+44), I assume you are trying to create the customer master data for a UK based company.

There are many possible pitfalls if you are going to do hard match based on first name and last name.

a) A married woman can use Mrs. Adam Smith as her name in social circles. So if you are not considering titles, you may end up merging husband and wife data as single record.

b) Another scenario is both father and son are having same name with different suffix (eg:- Adam Smith Senior, Adam Smith Junior). Current rule will merge these records also.

c) Third scenario is that a person can give his shortened name and full name as two records (eg:- Robert Smith and Bob Smith). So you will require a matching Algorithm like Jaro-Winkler algorithm to match the names based on match percentage.

d) Date of Birth is another key data which has to be used while matching customer records.

There are thousands of common names in UK and if the match is not fine tuned based on multiple match rules, you may end up matching non related persons and it will end up in GDPR data breach.

I have touched the surface of the problem only and I would strongly suggest you to sit with your Data Owners and create an extensive plan to do data matching rules. You should ideally use MDM version of Talend which will help to create automatic match rules based on advanced components like tmatchgroup.

Another important step you will have to do is to standardize the child data like address, email and telephone. There are specific components (like taddressrowcloud, tStandardizePhoneNumber etc) in Talend to standardize these data before match process. You will have to employ these components also before going for the matching stage.

Warm Regards,

Nikhil Thampi

Anonymous · ‎2018-08-07

@nthampi

Thank you for your response. I agree with what you have said and these things are taken in consideration when looking at duplicate records, the examples I created was just to show what i was trying to achieve from what the data structure is.

My initial query takes in consideration all aspects in order to identify duplicates, I wrote a basic quick query to show what the data will be like. Starting with basic, as that will help to expand further, hence, why I am just concentrated with email address for now. this will expand to address, telephone number etc

I am just unsure how to insert unique rows into a record based on the identification of the duplication as insert ignore does not work as each email will have a primary key (Auto Incremented). I cannot use two input query and then do a tmap based on inner join as everything is within one query.

I want to be able to insert unique records from one record to another record in the same table and ignoring duplication.

REST

Talend Data Integration

v7.x