Handling Duplicate Rows when doing a Full Load in Replicate

Sonja_Bauernfeind · Nov 6, 2023 1:18:04 AM

This is a problem which on first impressions should not (and you would think logically cannot) happen. Therefore it is important to understand why it does, and what can be done to resolve it when it does.

The situation is that Replicate is doing a Full Load for a table (individually or as part of a task full loading many tables). The source and target tables have identical unique primary keys. There are no uppercasing or other character set issues relating to any of the columns that make up the key which may sometimes cause duplication problems. Yet as the Full Load for the table progresses, probably nearing the end, you get a message indicating that Replicate has failed to insert a row into the target as a result of a duplicate. That is there is already a row in the target table with the unique key for the row that it is trying to insert. The Full Load for that table is terminated (often after several hours); and if you try again the same error, perhaps for a different row, will often occur.

Logically this shouldn’t happen, but it does. The likelihood of it doing so depends on the source DBMS type, the type of columns in the source table, and you will find it is always for a table that is being updated (SQL UPDATEs) as Replicate copies it. The higher the update rate and the bigger the table, the more likely it is to happen.

Note: This article discussed the problems that are related to duplicates in the TARGET_LOAD and not the TARGET_APPLY, that is during Full Load and before starting to apply the cached changes.

Analysis:

To understand the fix we first need to understand why the problem occurs, and this involves understanding some of the internal workings of most conventional Relational Database Management Systems.

RDBMS’s tend to employ different terminology for things that exist in all of them. I’m going to use DB2 terminology and explain each term the first time I use it. With a different RDBMS the terminology may be different, but the concepts are generally the same?

The first concept to introduce is the Tablespace. That’s what it’s called in DB2, but it exists for all databases and is the physical area where the rows that make up the table are stored. Logically it can be considered as a single contiguous data area, split up into blocks, numbered in ascending order.

This is where your database puts the row data when you INSERT rows into the table. What’s also important is that it tries to update the existing data for a row in place when you do an UPDATE, but may not always be able to do so. If that is the case then it will move the updated row to another place in the tablespace, usually at what is then the highest used (the endpoint) block in the tablespace area.

The next point concerns how the DBMS decides to access data from the tablespace in resolving your SQL calls. Each RDBMS has an optimiser, or something similar that makes these decisions. The role of indexes with a relational database is somewhat strange. They are not really part of the standard Relational Database model, although in practice they are used to guarantee uniqueness and support referential integrity. Other than for these roles, they exist only to help the optimiser come up with faster ways of retrieving rows that satisfy your SELECT (database read) statements.

When any piece of SQL (we’ll focus on simple SELECT statements here) is presented to the optimiser, it decides on what method to use to search for and retrieve any matching rows from the tablespace. The default method is to search through all the rows directly in the tablespace looking for rows that match any selection criteria, this is known as a Tablespace Scan.

A Tablespace Scan may be the best way to access rows from a table, particularly if it is likely that many or most of the rows in the table will match the selection criteria. For other SELECTs though that are more specific about what row(s) are required, a suitable matching index may be used (if one exists) to go directly to the row(s) in the tablespace.

The sort of SQL that Replicate generates to execute against the source table when it is doing a Full Load is of the form SELECT * FROM, or SELECT col1, col2, … FROM. Neither of these has any row specific selection criteria, and in fact this is to be expected as a Full Load is in general intended to select all rows from the source table.

As a result the database optimiser is not likely to choose to use an index (even if a unique index on the table exists) to resolve this type of SELECT statement, and instead a Tablespace Scan of the whole tablespace area will take place. This, as you will see later, can be inconvenient to us but is in fact the fastest way of processing all the rows in the table.

When we do a Full Load copy for a table that is ‘live’ (being updated as we copy it), the result we end up with when the SELECT against the source has been completed and we have inserted all the rows into the target is not likely to be consistent with what is then in the source table. The extent of the differences is dependent on the rate of updates and how long the Full Load for that table takes. For high update rates on big tables that take many hours for a Full Load the extent of the differences can be quite considerable.

This all sounds very worrying but it is not as the CDC (Change Data Capture) part of Replicate takes care of this. CDC is mainly known for Replicating changes from source to target after the initial Full Load has been taken, keeping the target copies up to date and in line with the changing source tables. However CDC processing has an equally important role to play in the Full Load process itself, especially when this is being done on ‘live’ tables subject to updates as the Full Load is being processed.

In fact CDC processing doesn’t start when Full Load is finished, but in fact before Full Load starts. This is so that it can collect details of changes that are occurring at the source whilst the Full Load (and it’s associated SELECT statement) are taking place. The changes collected during this period are known as the ‘cached changes’ and they are applied to the newly populated target table before switching into normal ongoing CDC mode to capture all subsequent changes.

This takes care of and fixes all of the table row data inconsistencies that are likely to occur during a table Full Load, but there is one particular situation that can occur and catch us out before the Full Load completes and the cached changes can be applied. This results in Replicate trying to insert details for the same row more than once in the target table; triggering the duplicates error that we are talking about here.

Consider this situation:

We obtain a copy of a row with a particular key as the Tablespace Scan satisfying our SELECT statement passes by its location in the source tablespace.
Subsequently that row is updated in the source table. After update it cannot fit into its original place in the tablespace. So the row for that key is moved to a location in a empty block at the current high point in the tablespace.
As the tablespace continues it comes across the new copy of the row for that same key and returns it as part of the result set for the SELECT statement. Now we have 2 rows for the same key that we are going try to insert into the target table.

That is how the problem occurs. Having variable length columns, and binary object columns in the source table make this (movement of the row to a new location in the tablespace) much more likely to happen and the duplicate insert problem to occur.

Resolution:

So how to fix this, or at least how to find a method to stop it happening.

The solution is to persuade the optimiser in the source database to use the unique index on the table to access the rows in the table’s tablespace rather than scanning sequentially through it. The index (which is unique) will only provide one row to read for each key as the execution of our SELECT statement progresses. We don’t have to worry about whether it is the ‘latest’ version of the row or not because that will be taken care of later by the application of the cached changes.

The optimiser can (generally) be persuaded to use the unique index on the source table if the SELECT statement indicates that there is a requirement to return the rows in the result set in the order given by that index. This requires having a SELECT statement with a order clause matching the columns in the unique index. Something of the form SELECT * FROM ORDER BY col1, col2, col3, etc. Where col1, col2, col3 etc. are the columns that make up the tables unique primary index.

But, how can we do this. Replicate has a undocumented facility that allows the user to configure extra text to be added to the end of the generated SQL for a particular table during Full Load processing specifically to add a WHERE statement to determine which rows are included and excluded during a Full Load extract.

This is not exactly what we want to do (we want to include all rows), but this ‘FILTER’ facility also provides the option to extend the content of the SELECT statement that is generated after the WHERE part of the statement has been added. So we can use it to add the ORDER BY part of the statement that we require.

Here is the format of the FILTER statement that you need to add.

—FILTER: 1=1) ORDER BY col1, col2, coln —

This is inserted in the ‘Record Selection Condition’ box on the individual table filter screen when configuring the Replicate task. If you want to do this for multiple tables in the Replicate task then you need to set up a FILTER for each table individually.

To explain, the —FILTER: keyword indicates the beginning of filter information that is expected to begin with a WHERE clause (which is generated automatically).

The 1=1)) component completes that WHERE clause in a way that all rows are selected (you could put in something to limit the rows selected if required, but that’s not what we are trying yo achieve here)

It is then possible to add other clauses and parameters before terminating the additional text to be added with the final —

In this case an ORDER clause is added that will guarantee that rows are returned in the order selected. This causes the unique index on the table to be used to retrieve rows at the source; assuming that you code col1, col2, etc. to match the columns and their order in the index. If the index has some columns in descending order (rather than ascending) make sure that is coded in the ORDER BY statement as well.

If you code things incorrectly the generated SELECT statement will fail and you will be able to see and debug this through the log.

jjames · ‎2020-12-16

Hello,

I was trying out this method for doing a full load. Source and Target endpoint are MS SQL Server.

Parse expression is giving out the below error message

"SQLite general error. Code <1>, Message <unrecognized token: ":">.

Ran the task, Still getting the above error in the task log.

Qns

- Is this supported for endpoints other than DB2?

- Is this still a supported feature on v6.6 + of Qlik Replicate.

Thanks,

jjames

jjames · ‎2020-12-16

TomB · ‎2020-12-29

Similarly, I'm trying to get this to work with a Hana data source. Accessing the hidden 'Fullload Passthru Filter' by Cntrl clicking the table and navigating to Filter (it is added feature in lower left).

Here are two examples that do not work.

Example 1:
Fullload Passthru Filter:
MANDT='010' ORDER BY MANDT, VBELN

Yields SQL Statement (from log):

SELECT ....
FROM "****"."VBRK"
WHERE (MANDT='010' ORDER BY MANDT, VBELN)

looks great with the exception of the parenthesis!

Example 2 (as documented above in this blog post)
Fullload Passthru Filter:
"MANDT"='010')) ORDER BY "MANDT", "VBELN"

Yields SQL Statement (from log):
SELECT ....
FROM "****"."VBRK"
WHERE ("MANDT"='010')) ORDER BY "MANDT", "VBELN")

It's the last close parenthesis ')' that always gets in the way! Why is it there?

TomB · ‎2020-12-29

Replicate 6.6.0.384 for examples above.

Jeff_Schwartz · ‎2021-01-20

Jjames - Are you using two dashes and not one? The keyword is '- -FILTER' (I've added a space between the 2 dashes for emphasis) and not '-FILTER'. With the 2 dashes (and without the space) it works for me.

jjames · ‎2021-01-21

Hello Jeff,

I tried with two dashes and used one paranthesis instead of two and that fixed it. Thank you.

--FILTER: 1=1) ORDER BY Col1--

Testing was successful on MS SQL Server and DB2 z/OS.

Thanks,

jjames

john_wang · ‎2021-01-22

Hello @jjames ,

This is correct. I think R&D changed something or even fix a minor bug, and that lead Replicate behave differently.

Regards,

John.

KellyHobson · ‎2022-05-27

Another example from a recent customer:

Paste

--FILTER: 1=1) ORDER BY FIRST_NAME, LANGUAGE, LAST_NAME --

directly into the record selection condition.

Table Settings -> Filter -> Record Selection Condition (paste it in the box on bottom right and do not go to Expression Builder)

KellyHobson · ‎2023-07-10

++ Another example customer had with syntax.

Filter causing issue:

--FILTER: 1=1) ORDER BY USERID--

Resolved with:

--FILTER: 1=1) ORDER BY USERID --

They need extra space after the PK field before last --

JacobTews · ‎2023-11-03

@Sonja_Bauernfeind this is perhaps the best technical article I've ever read. Kudos to you for the clear, thorough documentation of the issue and its solution!

Handling Duplicate Rows when doing a Full Load in Replicate