Solved: Customer Data Cleansing including data pre-process... - Qlik Community

Anonymous · ‎2017-05-20

Apologies if it sounds a stupid question. I have been relentlessly searching to get a high level answer. I am working on Customer Personal Identifier/Information. Over a period of years Customer information like Name/Address/Email/Phone....has not been standardized/cleansed.

I am looking to first

1) Standardize data i.e. remove any white space characters...ensure email address is correct and so on

2) There after I need to de-duplicate data but based on some algorithm

Same Address or Payment Card: 0.3
Last Name SoundEx: 0.25
First Name SoundEx: 0.1
Title: 0.05
Email, Telephone, or Visitor Id : 0.3

define upper and lower threshold...

The total of above adds to one/1

So I need to check row by row every record with all other records and come up with

Customer Key1	Cusomer Key2	Match Type
A	B	Same Customer
A	C	Same Customer
D		NoMatch

In above case for e.g. A and B are 2 separate records but they have same payment card information and same address and same email so Same customer

record A and C also match as they have same First name Last Name and address....after that within this i will create a Golden record

I can see and have tried Talend for Data Quality does Data Profiling only.....not actual transformation. This gives you stats on how good or bad your data is....

I have seen Talend for Data Preparation..here I can load a file apply my basic preparations i.e. remove white spaces...etc..and use this preparation in a job.

My fundamental question was where can use a component where I can define my weight and match (threshold) and then decide which ones are my Customer Golden records???

I seem to have got lost.

I am looking to standardise/cleanse/merge to a golden customer record.

Any pointers will be greatly appreciated.

Please can you refer to this video

https://youtu.be/sozxWzAXLBM?list=PLZrVWXgbuqT5OEM_QwwgopJHlUHAZzp2i&t=1477

here in this step through talend the key value match is given weights.

Thanks

Anonymous · ‎2017-05-22

Hi,

What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.

Let me know if you need additional details.

Regards,

Gwendal

View solution in original post

Anonymous · ‎2017-05-22

Hi,

What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.

Let me know if you need additional details.

Regards,

Gwendal

Anonymous · ‎2017-05-22

Thanks for your reply.

Now I understand, the component label has been renamed in the demo... We have licensed version of Talend Open Studio for Big Data. I can see the palette does have all the required Data Quality components, would be great if you can please re-confirm the same.

Many Thanks for your quick reply.

TalendDQ Palette.png

Anonymous · ‎2017-05-22

Hi Ashish

Talend Open Studio for Data Quality is the open source free studio, it does not contain the cleansing components you're looking for such as tMatchgroup.
If you can find tMatchgroup in your palette, then you're on a Subscription-based product.
HTH
Elisa

Anonymous · ‎2017-05-22

Thanks very much for all your responses.

Customer Data Cleansing including data pre-processing/standardization

Data Quality

v6.x