Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Apologies if it sounds a stupid question. I have been relentlessly searching to get a high level answer. I am working on Customer Personal Identifier/Information. Over a period of years Customer information like Name/Address/Email/Phone....has not been standardized/cleansed.
I am looking to first
1) Standardize data i.e. remove any white space characters...ensure email address is correct and so on
2) There after I need to de-duplicate data but based on some algorithm
define upper and lower threshold...
The total of above adds to one/1
So I need to check row by row every record with all other records and come up with
| Customer Key1 | Cusomer Key2 | Match Type |
| A | B | Same Customer |
| A | C | Same Customer |
| D | NoMatch |
In above case for e.g. A and B are 2 separate records but they have same payment card information and same address and same email so Same customer
record A and C also match as they have same First name Last Name and address....after that within this i will create a Golden record
I can see and have tried Talend for Data Quality does Data Profiling only.....not actual transformation. This gives you stats on how good or bad your data is....
I have seen Talend for Data Preparation..here I can load a file apply my basic preparations i.e. remove white spaces...etc..and use this preparation in a job.
My fundamental question was where can use a component where I can define my weight and match (threshold) and then decide which ones are my Customer Golden records???
I seem to have got lost.
I am looking to standardise/cleanse/merge to a golden customer record.
Any pointers will be greatly appreciated.
Please can you refer to this video
https://youtu.be/sozxWzAXLBM?list=PLZrVWXgbuqT5OEM_QwwgopJHlUHAZzp2i&t=1477
here in this step through talend the key value match is given weights.
Thanks
Hi,
What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.
Let me know if you need additional details.
Regards,
Gwendal
Hi,
What you see in the video are the Data Quality components which can be leveraged in a Talend job an (namely tMatchGroup here), which address your deduplication use case. These components are only available in the commercial version of Talend Data Quality, not in Talend Open Studio for Data Quality. See the feature matrix in https://www.talend.com/products/data-quality for more details.
Let me know if you need additional details.
Regards,
Gwendal
Thanks for your reply.
Now I understand, the component label has been renamed in the demo... We have licensed version of Talend Open Studio for Big Data. I can see the palette does have all the required Data Quality components, would be great if you can please re-confirm the same.
Many Thanks for your quick reply.
Thanks very much for all your responses.