Re: Dulicate Entries - Qlik Community

sunil-kumar5 · ‎2020-09-22

Hi Community,

Please see the below sample data where I need to find out the duplicate entries against the ID_Number. I have Name, Mobile_Number, Address and Email_ID.

Conditions:

** For Sunil the country written in capital letters but address is same. (See Sunil Kumar below)

** In case of Rohit mobile number is different, address is same and email has some numeric change. (See Rohit Verma below)

** In these column one value could be different like in case of Sohan mobile number is duplicate but address have zip code included that means this is duplicate and our system didn't capture it just because of zip code. (See Sohan Rajput below)

So on the basis of these information I have to identify how many new generated ID number have issued on same customer?

Customer Name	ID_Number	Mobile_Number	Address	Email_ID
Sunil	880277	975255096	Eindhoven, Noord-Brabant, Netherlands	Sunil.975255096@gmail.com
Rohit	883890	713290843	Roodepoort, Gauteng, South Africa	Rohit Verma.290331901@gmail.com
Sohan	482397	560111187	Chula Vista, CA, United States	Sohan.560111187@gmail.com
Shakti	114165	699024270	Riverside, NJ, United States	Shakti.699024270@gmail.com
Brijesh	181252	540748379	Julianstown, Meath, Ireland	Brijesh.540748379@gmail.com
Vicky	341775	125312405	Ottawa, Ontario, Canada	Vicky.125312405@gmail.com
Gaurov	657679	561511307	Hyderabad, Andhra Pradesh, India	Gaurov.561511307@gmail.com
Vishal	466095	383604275	London, England, United Kingdom	Vishal.383604275@gmail.com
Ravi	657857	966732561	Salt Lake City, UT, United States	Ravi.966732561@gmail.com
Ravi Nagar	139596	810888360	Manchester, England, United Kingdom	Ravi.Nagar.810888360@gmail.com
Sunil Kumar	509182	549076680	Eindhoven, Noord-Brabant, NETHERLAND	Sunil.Kumar.549075580@gmail.com
Rohit Verma	616393	290338765	Roodepoort, Gauteng, South Africa	Rohit.Verma.1986@gmail.com
Sohan Rajput	773090	560111187	Chula Vista, CA, United States, 87654	Sonu.Rajput@gmail.com
Parveen	735457	126601444	Flossmoor, IL, United States	Parveen.126601444@gmail.com`
Sonu	560252	498047665	Sugar Land, TX, United States	Sonu.498047665@gmail.com
Shyam	785482	863231142	New York, NY, United States	Shyam.863231142@gmail.com
Suraj	453526	378575123	London, England, United Kingdom	Suraj.378575123@gmail.com
Sameer	835841	332489377	Morton, IL, United States	Sameer.332489377@gmail.com
Sagar	751409	126389799	Los Gatos, CA, United States	Sagar.126389799@gmail.com
Rohan	449042	667105703	New York, NY, United States	Rohan.667105703@gmail.com
Akash	267057	487972340	Miami, FL, United States	Akash.487972340@gmail.com
Ankush	456568	251316156	Vesenaz, Geneve, Switzerland	Ankush.251316156@gmail.com

Please help!

Thanks in advance

stevedark · ‎2020-09-22

Hi @sunil-kumar5

I would start by creating clean versions of each of the fields, where case is kept the same and some characters are replaced that could be the same but different (& and and, for instance).

You can do this with a parameterised variable, you can put as many replaces, purgechars etc. as you like in here:

set vClean = lower(replace(replace([$1], ',', ''), ' ', ''));

Load the fields with both original and clean variants:

LOAD
  ID_Number,
  [Customer Name],
  Mobile_Number,
  Address,
  Email_ID,
  $(vClean(Customer Name)) as [Customer Clean],
  $(vClean(Mobile_Number)) as [Mobile Clean],
  $(vClean(Address)) as [Address Clean],
  $(vClean(Email_ID)) as [Email Clean]
FROM [... your data source ...];

You will then want to create a separate table for each of the cleaned fields which highlights any duplicates. It will not catch all things (more code into the clean variable may catch more) and there could be false positives, but it will be a good start.

The dimension in each table will be the Clean field, and the expression will be:

if(count(ID_Number) > 1, concat(distinct ID_Number, ', '), null())

This will list all of the ID numbers associated with the dupe. Clicking on the cleaned field value will select all of the rows where the original value is similar, but not exactly the same.

To catch email addresses which are the same except for the number you would need to do an extra purgechar step to remove all digits from email addresses (this seems a strange error to have though).

To compare across multiple fields at once you can add multiple dimensions to the table. This is going to provide a less sensitive test though, but would remove false positives.

Hope that helps.

Steve

sunil-kumar5 · ‎2020-09-28

Hi Steve,

Thanks for your revert,

I tried the same you suggest but its only picking the exact duplicate only not the requirement I mentioned above. If there is 60-70% match in value that should also pick.

Thanks

Sunil

stevedark · ‎2020-09-28

Hi,

You could look at the Levenshtein matching function, or add more replacements in the clean variable.

Steve