Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
We're trying to sync records from multiple data sources to one place, and we try to detect the duplicate records and let the end user select the master record from the duplicate group. Here my approach:
1. Get all of the records from the data sources and use the tUnite to merge the records and pass to tMatchGroup.
2. In the tMatchGroup group the duplicate records and then pass the group records to Data Stewardship to let the user detect the master record.
It works for one time sync. But if any data source has record (s) created or updated, we still need to transfer to the end data source. We need to do a duplication check for the new record (s) as well.
With step #2, it will generate duplication group for all of the records again (include old records), any way to only detect the new record (s) duplication group? Or any other good approach for it?
Hello,
Please have a look at CDC feature, introduced in Qlik Talend Studio which quickly identifies and captures data that has been added to, updated in, or removed from database tables and makes this change data available for future use by applications or individuals. The CDC feature is available for Oracle, MySQL, DB2, PostgreSQL, Sybase, MS SQL Server, Informix, Ingres, Teradata, and AS/400.
Best regards
Sabrina
Thanks for the reply! It will help us to catch the new changes.
But my next issue is how to detect the duplicates for the new changes. Just use the tMatchGroup again to group all of the duplicate records which includes the old records, or any other way to only get the new change records duplications.
Our case is we always need to check if the current syncing records have the duplicate records with the syncing records and persist records, then let user to manually select the only one master record.
You need to select for the new records the matching records in your target and provide both of them as new match group.
Ok, thanks! I just curious about the performance.
We always need to compare the new records with the matching records. From my understanding, it will compare between the matching records again, wanna know any way only check the new records duplication from the whole records? Then the performance should be better.