Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content
Announcements
See why Qlik was named a Leader in the 2025 Gartner® Magic Quadrant™ for Augmented Data Quality Solutions: GET THE REPORT
cancel
Showing results for 
Search instead for 
Did you mean: 
JacobTews
Creator
Creator

Getting duplicates in data warehouse from source table with no primary key

I would be grateful if someone can see what I'm missing here. I'll explain the scenario:

  • Table ITEMHIST on the source has no PK; every record is accepted as unique
  • Records in ITEMHIST may be updated; the source achieves this by a DELETE followed by an INSERT
  • Here is the current state of one record on the source:

JacobTews_0-1734045246938.png

  • Here is that same single record in the data warehouse target, which has been duplicated in the load process:

JacobTews_1-1734045348610.png

(given the LastCommitDatestamp values which flow from Replicate, it would appear that the first line there is the original record, and the second line is the current state of the record in the source)

  • Here is the archive metadata from the change table for that same record:

JacobTews_2-1734045500435.png

 

My questions:

  1. Why does the archive not show the original insert of the record? It seems to begin with a DELETE, which doesn't make sense to me. Is this because the initial load of a table is not captured?
  2. Am I getting the duplicate because of a combination of how the source handles the updates and the fact that I've set the PK in Compose as the RRN? There is no column or combination of columns in the source which can serve as a key (even selecting all columns as a PK doesn't work, as it is feasible that there could be duplicate transactions which are not errors, if 2 identical moves of inventory occur simultaneously), so using the RRN from Replicate was my workaround.
  3. How can I best handle this scenario? Obviously I can run a post-load custom script to just delete all but the newest record, but I have to believe there's a better way that I'm just missing at the moment.

 

Looking forward to learning from y'all about how to deal with this type of situation!

Qlik Compose for Data Warehouses Qlik Compose Qlik Replicate 

1 Reply
TimGarrod
Employee
Employee

Hi, 

In the Change Table you only see the activity that occurred in DB Log.  The way Replicate works with Compose is to send the full / initial load - which would have the original record.   
If you skipped this - then you would not see the 'original insert'...     You are only getting records from the time you start the Replicate task. 

Since your source does updates via a DELETE/INSERT pair - that is what you see in the change table. 
In general, Compose ignore deletes.  (Just because a record is deleted does not mean it should be removed from the data warehouse as its still applicable for analytics - e.g. archival processes that run on source systems, or a product you DID sell, but don't anymore still should be available for historical analysis in the DW). 

 

In order to manage history / updates to data, Compose needs some type of 'logical key'.   If you are using an RRN - then this is why you are seeing these as 'duplicates'.  Due to the DELETE/INSERT on the source instead of update - the RRN is changeing.  

The issue here is you have data with no discernable 'logical key' which is updated with delete+insert - any downstream process (ignore even Compose) - would need to have some form of logical key in order to update the target.   

I'm curious - you state "I can run a post-load custom script to just delete all but the newest record".  
How would you handle that delete ?    If its possible that there are legitimate duplicates in the dataset (based on using every business column you have) - how would you determine the 'latest' record that should be kept or deleted ? 

If you can explain that, we can think about the best way to solve this.