Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
sunil-kumar5
Creator II
Creator II

Dulicate Entries

Hi Community,

Please see the below sample data where I need to find out the duplicate entries against the ID_Number. I have Name, Mobile_Number, Address and Email_ID.

Conditions:

** For Sunil the country written in capital letters but address is same. (See Sunil Kumar below)

** In case of Rohit mobile number is different, address is same and email has some numeric change. (See Rohit Verma below)

** In these column one value could be different like in case of Sohan mobile number is duplicate but address have zip code included that means this is duplicate and our system didn't capture it just because of zip code. (See Sohan Rajput below)

So on the basis of these information I have to identify how many new generated ID number have issued on same customer?

Customer NameID_NumberMobile_NumberAddressEmail_ID
Sunil880277975255096Eindhoven, Noord-Brabant, NetherlandsSunil.975255096@gmail.com
Rohit883890713290843Roodepoort, Gauteng,  South AfricaRohit Verma.290331901@gmail.com
Sohan482397560111187Chula Vista, CA,  United StatesSohan.560111187@gmail.com
Shakti114165699024270Riverside, NJ,  United StatesShakti.699024270@gmail.com
Brijesh181252540748379Julianstown, Meath,  IrelandBrijesh.540748379@gmail.com
Vicky341775125312405Ottawa, Ontario,  CanadaVicky.125312405@gmail.com
Gaurov657679561511307Hyderabad, Andhra Pradesh, IndiaGaurov.561511307@gmail.com
Vishal466095383604275London, England,  United KingdomVishal.383604275@gmail.com
Ravi657857966732561Salt Lake City, UT,  United StatesRavi.966732561@gmail.com
Ravi Nagar139596810888360Manchester, England,  United KingdomRavi.Nagar.810888360@gmail.com
Sunil Kumar509182549076680Eindhoven, Noord-Brabant, NETHERLANDSunil.Kumar.549075580@gmail.com
Rohit Verma616393290338765Roodepoort, Gauteng,  South AfricaRohit.Verma.1986@gmail.com
Sohan Rajput773090560111187Chula Vista, CA,  United States, 87654Sonu.Rajput@gmail.com
Parveen735457126601444Flossmoor, IL, United StatesParveen.126601444@gmail.com`
Sonu560252498047665Sugar Land, TX,  United StatesSonu.498047665@gmail.com
Shyam785482863231142New York, NY,  United StatesShyam.863231142@gmail.com
Suraj453526378575123London, England,  United KingdomSuraj.378575123@gmail.com
Sameer835841332489377Morton, IL,  United StatesSameer.332489377@gmail.com
Sagar751409126389799Los Gatos, CA,  United StatesSagar.126389799@gmail.com
Rohan449042667105703New York, NY,  United StatesRohan.667105703@gmail.com
Akash267057487972340Miami, FL,  United StatesAkash.487972340@gmail.com
Ankush456568251316156Vesenaz, Geneve,  SwitzerlandAnkush.251316156@gmail.com

 

Please help!

 

Thanks in advance

 

3 Replies
stevedark
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi @sunil-kumar5 

I would start by creating clean versions of each of the fields, where case is kept the same and some characters are replaced that could be the same but different (& and and, for instance).

You can do this with a parameterised variable, you can put as many replaces, purgechars etc. as you like in here:

set vClean = lower(replace(replace([$1], ',', ''), ' ', ''));

Load the fields with both original and clean variants:

LOAD
  ID_Number,
  [Customer Name],
  Mobile_Number,
  Address,
  Email_ID,
  $(vClean(Customer Name)) as [Customer Clean],
  $(vClean(Mobile_Number)) as [Mobile Clean],
  $(vClean(Address)) as [Address Clean],
  $(vClean(Email_ID)) as [Email Clean]
FROM [... your data source ...];

 

You will then want to create a separate table for each of the cleaned fields which highlights any duplicates. It will not catch all things (more code into the clean variable may catch more) and there could be false positives, but it will be a good start.

The dimension in each table will be the Clean field, and the expression will be:

if(count(ID_Number) > 1, concat(distinct ID_Number, ', '), null())

This will list all of the ID numbers associated with the dupe. Clicking on the cleaned field value will select all of the rows where the original value is similar, but not exactly the same.

To catch email addresses which are the same except for the number you would need to do an extra purgechar step to remove all digits from email addresses (this seems a strange error to have though).

To compare across multiple fields at once you can add multiple dimensions to the table. This is going to provide a less sensitive test though, but would remove false positives.

Hope that helps.

Steve

 

sunil-kumar5
Creator II
Creator II
Author

Hi Steve,

Thanks for your revert,

I tried the same you suggest but its only picking the exact duplicate only not the requirement I mentioned above. If there is 60-70% match in value that should also pick.

 

Thanks

Sunil

stevedark
Partner Ambassador/MVP
Partner Ambassador/MVP

Hi,

You could look at the Levenshtein matching function, or add more replacements in the clean variable. 

Steve