Duplicate Data Troubleshooting Guide

Ted_Manka · Apr 7, 2020 12:01:26 PM

What causes Duplicates in Replicate?

Some source database endpoints do not enforce uniqueness and you may be moving data to a target which enforces uniqueness.
Data types on a source are different from the target.
Conversion between different data types. It could also be character encoding related locale related, UTF-8 conversion related etc.
If a FLOAT column is in the Primary Key(this is NOT a Best Practice) since different db handle differently the precision, this can cause a conflicts/duplicate records.
Problems in TRANSFORMATION rules on PK/UI column/s can cause Duplicates
UNDOCUMENTED FEATURES 😉 (defects)
Task is resumed from timestamp which causes replication of already processed data.
A database utility that was run on the table without writing changes to the database log. As a result, Replicate is unaware of the changes. This may cause inconsistency between source and target endpoints and among other problems can cause duplicate key errors. For example: in DB2z load replace utility that generates subcode 83. If a transaction from the log comes along to insert a row, the insert occurs on the source successfully but when Replicate moves the insert to the target, the row can already be there and hence the DUPLICATE error.
During the processing of cached applied changes, duplicate key errors can occur. These can be ignored.

What to collect when trouble shooting Duplicate data errors?

Collect the Create DDL for the Source and Target tables that are having the duplicate issue.
Turn on Store changes (under task setting --> store changes setting) to see what column and the values of before and after image on this column.
Set the SOURCE_CAPTURE, TARGET_APPLY to verbose. (Sometimes you can also turn on Verbose logging for DATA_RECORD. This may expose the whole record. This is not usable for all endpoints.)
Collect the Diagnostic Package or at least the Task JSON
Collect the attrep_apply_exceptions and _ct table - to get the values/rows of data that gets the error.

What to do to prevent Duplicates when using Replicate?

Review source and target DDLs to make sure the keys are aligned
Use a Data Profiling tool to determine if there are any duplicate rows in the source database, as some RDBMs don’t enforce a unique constraint.
Use UPSERT Mode: i.e. configure the task under: task setting --> Error Handling --> Apply Conflicts --> ‘Duplicate key when applying INSERT:’ UPDATE the existing target record

Note: the concept ’ UPSERT MODE’ is NOT documented in the User Guide. i.e. it is not a word you can search for in the User Guide and is not a key word in the Replicate UI.

In Change Processing Tuning be aware of the following: “ Applying cached events in transactional mode to endpoints that do not enforce constraints (such as Vertica and IBM Netezza), may result in duplicate records on the target. This is because such endpoints do not return duplicate errors.” (see User Guide)
In Change Processing Tuning be aware of the following:

“ Applying cached events in transactional mode to endpoints that do not enforce constraints (such as Vertica and IBM Netezza), may result in duplicate records on the target. This is because such endpoints do not return duplicate errors.” (see User Guide)

Steps to take to resolve a Duplicate Data problem:

Check for having same primary keys or unique indexes on both end points.
If there are a lot of discrepancies between the source and target, then the easiest way to resolve it is to do a Full Load of the table(s).
Adjust the apply conflicts setting to your needs.
If there is a transformation rule on Primary Key or Unique Index columns check that they are valid transformations

Example Cases showing Duplicate Data:

Example #1:

Table 'SHIPPING'.'TRANSPORTER_PROFILE' (subtask 1 thread 1) is suspended. RetCode: SQL_SUCCESS SqlState: 23000 NativeError: 2627 Message: [Microsoft][SQL Server Native Client 11.0][SQL Server]Violation of PRIMARY KEY constraint 'TRANSPORTER_PROFILE_PK_TRANSPORTER_PROFILE'. Cannot insert duplicate key in object 'rol.TRANSPORTER_PROFILE'. The duplicate key value is (BNIN , BETTA, EMAIL).; Failed to commit 10000 rows to target 'rol.TRANSPORTER_PROFILE'

Resolution: Upon checking, the record from the source, it contains spaces and after doing a query to trim the records, it was indeed a duplicate record. Customer is expecting duplicate records so they will change the apply error handling. As for full load, they can change the pk for the target or they can filter the "bad record"

Example #2:

Resolved by customer: "I just found out the issue. I had to re-create the table late Friday night and when doing so applied the wrong PK to the table. I was experimenting with results in COMPOSE modelling in TEST and used wrong script when applying to PROD. I will reload prod and let you know whether I still see these issues. You can close this ticket. However, can you tell me whether it would be possible to see these warning in the notifications monitoring via AEM?"

Replicate allows you to set a notification for warnings (written to the log with "W:" as opposed to "E:" for errors). Also if you adjust the error handling settings, AEM can send a notification with the "all errors" notification option:

If there is a Duplicate key when applying INSERT, the Apply Conflicts handling option is set to “Log record to the exceptions table”, no notification will be sent. However, if it is set to Stop task, a notification will be sent.

Example #3:

ISSUE: We got duplicate data in snowflake.
The duplicate data happened about 02/24 12:00PM.
Please help advice how to avoid such issue in future?
I think I need reload the whole task.

Resolution: The issue has to do with applying cached changes after the full load completes. This is the recommended configuration for Snowflake target. In task settings, Change Processing Tuning, disable the checkbox "Apply batched changes to multiple tables concurrently" In Error Handling, Apply conflicts, set Duplicate key when applying INSERT to UPDATE; and No record found for applying an UPDATE to INSERT.

Duplicate Data Troubleshooting Guide