Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
DrB1
Contributor III
Contributor III

Measure Percentage of Duplicated Fields

I have these 10 data fields and would like to create a formula to identify potential duplicates with an index score of how strong the match is between any 2 records. For example if 10 of 10 match the two records would be flagged as 100% match. Here are the fields.

IndividualId (UniqueID),
FirstName,
MiddleName,
LastName,
PreferredName,
BirthDate,
PrimaryEmail,
Home_PhoneNumber,
Mobile_PhoneNumber,
Address1_PrimaryAddress

I would like to do this in a Qlik table during the load script if possible.  I've tried, and tried, and tried, this concept of creating the Individuals_Data table, a second table with same fields as in IndividualIdDup, FirstNameDup, ...etc. and then trying to join the tables with a where clause (where IndividualId<>IndividualIdDup). I've tried that in several different ways, but the where clause always fails - unable to find IndividualId or whichever field used to keep it from potential duplicate matching of a record to itself. 

Tried it with record numbers or row numbers using similar where clause, but it fails.  If anyone has an idea, I would love it!

// Load your data
Individuals_Data:
LOAD
RowNo() as RecNo,
IndividualId,
FirstName,
MiddleName,
LastName,
PreferredName,
BirthDate,
PrimaryEmail,
Home_PhoneNumber,
Mobile_PhoneNumber,
Address1_PrimaryAddress
FROM [myfile]
(txt, utf8, embedded labels, delimiter is ',', msq);

// Use Resident load to compare each record with all other records
DuplicateCheck:
LOAD
RowNo() as RecNo_Dup,
IndividualId as IndividualIdDup,
FirstName as FirstNameDup,
MiddleName as MiddleNameDup,
LastName as LastNameDup,
PreferredName as PreferredNameDup,
BirthDate as BirthDateDup,
PrimaryEmail as PrimaryEmailDup,
Home_PhoneNumber as Home_PhoneNumberDup,
Mobile_PhoneNumber as Mobile_PhoneNumberDup,
Address1_PrimaryAddress as Address1_PrimaryAddressDup
Resident Individuals_Data;

// Join the original data with the DuplicateCheck table to compare each record with its duplicates
Join (Individuals_Data)
LOAD
RecNo,
IndividualIdDup,
FirstNameDup,
MiddleNameDup,
LastNameDup,
PreferredNameDup,
BirthDateDup,
PrimaryEmailDup,
Home_PhoneNumberDup,
Mobile_PhoneNumberDup,
Address1_PrimaryAddressDup,
RecNo_Dup
Resident DuplicateCheck
Where RecNo <> RecNo_Dup and (
IndividualId <> IndividualIdDup or
FirstName <> FirstNameDup or
MiddleName <> MiddleNameDup or
LastName <> LastNameDup or
PreferredName <> PreferredNameDup or
BirthDate <> BirthDateDup or
PrimaryEmail <> PrimaryEmailDup or
Home_PhoneNumber <> Home_PhoneNumberDup or
Mobile_PhoneNumber <> Mobile_PhoneNumberDup or
Address1_PrimaryAddress <> Address1_PrimaryAddressDup
);

// Calculate the percentage of matching fields
Individuals_Data:
LOAD
*,
(
(If(IndividualId = IndividualIdDup, 1, 0) +
If(FirstName = FirstNameDup, 1, 0) +
If(MiddleName = MiddleNameDup, 1, 0) +
If(LastName = LastNameDup, 1, 0) +
If(PreferredName = PreferredNameDup, 1, 0) +
If(BirthDate = BirthDateDup, 1, 0) +
If(PrimaryEmail = PrimaryEmailDup, 1, 0) +
If(Home_PhoneNumber = Home_PhoneNumberDup, 1, 0) +
If(Mobile_PhoneNumber = Mobile_PhoneNumberDup, 1, 0) +
If(Address1_PrimaryAddress = Address1_PrimaryAddressDup, 1, 0)
) / 10
) * 100 as MatchPercentage
Resident Individuals_Data;

// Drop intermediate tables
Drop Table DuplicateCheck;

// Output the final data
Data_Final:
LOAD *
Resident Individuals_Data;

// Drop the original Data table
Drop Table Individuals_Data;

Labels (3)
2 Replies
marcus_sommer

I think I would go another way with checking the field-values with interrecord-functions like peek() or previous() within a sorted resident load, like:

load *,
        rangesum(-(F1=previous(F1)), -(F2=previous(F2)), -(F3=previous(F3)), ...) / 10 as Flag
resident X order by F1, F2, F3, ...;

It needs a pre-load within Qlik to be able to define an order by statement which needs to be suitable.

Simpler than this may be just to compare the PrimaryMail because mail-addresses are unique unless you have really a duplicated record. 

DrB1
Contributor III
Contributor III
Author

These records contain families so mail or email or phone is not unique. I have not worked with interrecord-functions before, so I will need to look into that. What happens here is that persons can register for various events as a "guest" in our system, which immediately assigns a new IndividualId to someone who may already be in the system records. I'm looking for a way to prioritize the review based on percent of matching fields.