Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi Community,
Please see the below sample data where I need to find out the duplicate entries against the ID_Number. I have Name, Mobile_Number, Address and Email_ID.
Conditions:
** For Sunil the country written in capital letters but address is same. (See Sunil Kumar below)
** In case of Rohit mobile number is different, address is same and email has some numeric change. (See Rohit Verma below)
** In these column one value could be different like in case of Sohan mobile number is duplicate but address have zip code included that means this is duplicate and our system didn't capture it just because of zip code. (See Sohan Rajput below)
So on the basis of these information I have to identify how many new generated ID number have issued on same customer?
Customer Name | ID_Number | Mobile_Number | Address | Email_ID |
Sunil | 880277 | 975255096 | Eindhoven, Noord-Brabant, Netherlands | Sunil.975255096@gmail.com |
Rohit | 883890 | 713290843 | Roodepoort, Gauteng, South Africa | Rohit Verma.290331901@gmail.com |
Sohan | 482397 | 560111187 | Chula Vista, CA, United States | Sohan.560111187@gmail.com |
Shakti | 114165 | 699024270 | Riverside, NJ, United States | Shakti.699024270@gmail.com |
Brijesh | 181252 | 540748379 | Julianstown, Meath, Ireland | Brijesh.540748379@gmail.com |
Vicky | 341775 | 125312405 | Ottawa, Ontario, Canada | Vicky.125312405@gmail.com |
Gaurov | 657679 | 561511307 | Hyderabad, Andhra Pradesh, India | Gaurov.561511307@gmail.com |
Vishal | 466095 | 383604275 | London, England, United Kingdom | Vishal.383604275@gmail.com |
Ravi | 657857 | 966732561 | Salt Lake City, UT, United States | Ravi.966732561@gmail.com |
Ravi Nagar | 139596 | 810888360 | Manchester, England, United Kingdom | Ravi.Nagar.810888360@gmail.com |
Sunil Kumar | 509182 | 549076680 | Eindhoven, Noord-Brabant, NETHERLAND | Sunil.Kumar.549075580@gmail.com |
Rohit Verma | 616393 | 290338765 | Roodepoort, Gauteng, South Africa | Rohit.Verma.1986@gmail.com |
Sohan Rajput | 773090 | 560111187 | Chula Vista, CA, United States, 87654 | Sonu.Rajput@gmail.com |
Parveen | 735457 | 126601444 | Flossmoor, IL, United States | Parveen.126601444@gmail.com` |
Sonu | 560252 | 498047665 | Sugar Land, TX, United States | Sonu.498047665@gmail.com |
Shyam | 785482 | 863231142 | New York, NY, United States | Shyam.863231142@gmail.com |
Suraj | 453526 | 378575123 | London, England, United Kingdom | Suraj.378575123@gmail.com |
Sameer | 835841 | 332489377 | Morton, IL, United States | Sameer.332489377@gmail.com |
Sagar | 751409 | 126389799 | Los Gatos, CA, United States | Sagar.126389799@gmail.com |
Rohan | 449042 | 667105703 | New York, NY, United States | Rohan.667105703@gmail.com |
Akash | 267057 | 487972340 | Miami, FL, United States | Akash.487972340@gmail.com |
Ankush | 456568 | 251316156 | Vesenaz, Geneve, Switzerland | Ankush.251316156@gmail.com |
Please help!
Thanks in advance
I would start by creating clean versions of each of the fields, where case is kept the same and some characters are replaced that could be the same but different (& and and, for instance).
You can do this with a parameterised variable, you can put as many replaces, purgechars etc. as you like in here:
set vClean = lower(replace(replace([$1], ',', ''), ' ', ''));
Load the fields with both original and clean variants:
LOAD
ID_Number,
[Customer Name],
Mobile_Number,
Address,
Email_ID,
$(vClean(Customer Name)) as [Customer Clean],
$(vClean(Mobile_Number)) as [Mobile Clean],
$(vClean(Address)) as [Address Clean],
$(vClean(Email_ID)) as [Email Clean]
FROM [... your data source ...];
You will then want to create a separate table for each of the cleaned fields which highlights any duplicates. It will not catch all things (more code into the clean variable may catch more) and there could be false positives, but it will be a good start.
The dimension in each table will be the Clean field, and the expression will be:
if(count(ID_Number) > 1, concat(distinct ID_Number, ', '), null())
This will list all of the ID numbers associated with the dupe. Clicking on the cleaned field value will select all of the rows where the original value is similar, but not exactly the same.
To catch email addresses which are the same except for the number you would need to do an extra purgechar step to remove all digits from email addresses (this seems a strange error to have though).
To compare across multiple fields at once you can add multiple dimensions to the table. This is going to provide a less sensitive test though, but would remove false positives.
Hope that helps.
Steve
Hi Steve,
Thanks for your revert,
I tried the same you suggest but its only picking the exact duplicate only not the requirement I mentioned above. If there is 60-70% match in value that should also pick.
Thanks
Sunil
Hi,
You could look at the Levenshtein matching function, or add more replacements in the clean variable.
Steve