Qlik Community

Ask a Question

Data Integration Documents

Data Integration documentation and resources.

Best Practices for Managing SOFT and HARD Deletes with Qlik Compose for Data Warehouses

TimGarrod
Employee
Employee

Best Practices for Managing SOFT and HARD Deletes with Qlik Compose for Data Warehouses

Data warehouses are optimized to store historical data for complex analytics and reporting. Historical data can include transactional history (e.g. store all transactions for seven years), or slowly changing dimensions attribute history (e.g. manage changes to specific attributes like name or address, over time). Consequently, we rarely purge data from the data warehouse.

However, operational systems that feed the data warehouse, frequently delete records. Some systems allow for complete record removal, while others impose restrictions. As a result, the data warehouse can experience side effects of record deletion form the source systems.

In short, managing record deletion in a data warehouse is complicated, and this paper describes the best practices for handling various data deletion scenarios with Qlik Data Integration.

 

<<UPDATED 3/5/2021 with additional comments on how to handle data marts>>

Attachments
Comments
aj96
Contributor III
Contributor III
How to manage soft delete if the records are re inserted with the same key but different data in other columns?
TimGarrod
Employee
Employee

Hi Aj96 - are you saying the primary key may be the same, but the data really represents something completely different? 

For example - 

ID | Col1 | Col2

1 |  Jane |  Doe   - gets deleted and a new record gets inserted with

 

1 | John | Smith  - which really represents a different instance of the entity? 

 

Can you provide an example as that will help us provide guidance... 

 

aj96
Contributor III
Contributor III

Yeah exactly. If I'm right when replicating in the replicate side the 1st record will comes as an update and a second as an insert. In such a case the new record won't be inserted in the table since the keys are same and this would lead to having incorrect data in the replicated tables. This also could lead to bad data if we use that table as a lookup. What can be done in such a situation? 

TimGarrod
Employee
Employee

Is this related to Compose DW or just replicate? 

If Replicate only and you are only concerned with Replicated tables then  the first will be seen as an UPDATE, however  for the INSERT to then be processed as an update instead of an INSERT (which could fail as you point out) - you can configured Replicate to handle Apply Conflicts

In the Replicate task configuration - go to Task Settings.  In the Error Handling section select Apply Conflicts and configure Replicate to handle duplicate keys / no records found.  (You likely only need to handle duplicate keys in this scenario). 

This will allow Replicate to see duplicate PK issues and resolve them by Updating the existing record.

See screenshot below - 

TimGarrod_0-1600986216001.png

 

This will solve Replication scenarios. 

If you are using Compose - you may need to either use a different (more natural key) or have some Single/Multi table etl to update current key columns in the staging and HUB tables. 

Let me know if the Replicate configuration solves your issue.  If not and its Compose - I'll try to put together an example.

 

 

 

aj96
Contributor III
Contributor III

Thanks Tim 🙂 . The replicate configuration did solve my issue in the replicated tables but for Compose most of time updated the records correctly but once or twice it did not even before I have changed the Apply Conflicts. Does compose use the order in which the records have come and apply it accordingly?

TimGarrod
Employee
Employee

If you are running Compose CDC ETL Sets - they are consuming from the __CT Tables. 

The __CT table has every change - so you'll see the UPDATE <which was really the DELETE> then you'll see the INSERT.   

If the header__timestamp column is the same in both rows (which it maybe if the DELETE/INSERT occur within the same millisecond on the source) - then Compose will pull the more recent record.  (It uses the header__change_seq_num which Replicate generates and tells us the order of operations in the source transaction log).     

If the header__timestamp column is not  the same - but you've defined it as a Type1 - Compose will perform the same operation. 

 

The complexity here (it seems to me) is that you essentially have a "PK change".  The "old" PK is no longer really valid.   Within or without Compose - you would need to identify that the delete occurred and essentially change the key in Compose for the now "deleted" record.   Off the top of my head - this would require some custom SQL as a SINGLE-TABLE ETL that would see these "deletes" and then UPDATE both the TSTG table AND the current HUB to alter the Key value. 

 

That would allow the new INSERT to be inserted and the old data to remain, but with something that identifies it as being deleted (and likely some date value so if it happens again you can retain all the history). 

Feel free to PM me - perhaps we can jump on a call to discuss your options and what you are trying to do ?

aj96
Contributor III
Contributor III

Absolutely. 

I have tried out with the idea that the updates are applied according to the header__change_seq_num. So, the ct tables have same timestamp as you said and according header__change_seq_num , the transaction with delete flag as 'Y' has come first and the transaction with delete flag "N" came after those. So for example the ct data looks like this

HEADER_SEQ_NUM | OP_IND | P_ID NAME | DELETE_FLAG
1 | "U" | 1000 | SAM | Y
2 | "U" | 1001 | JOHN | Y
3 | "U" | 1002 | DOE | Y
4 | "I" | 1000 | SAM1 | N
5 | "I" | 1001 | JOHN1 | N
6 | "I" | 1002 | DOE1 | N

Now I have sleeted handle duplicates in compose and after the run I would expect the records to be updated from ct having the deleted flag as 'N' but 1 out of 4 times I tried to update the same way, the records was updated with the deleted flag as 'Y' records and other times it got updated correctly. The one time that it happened the seq_num was in the order mentioned in the above example. 

 

 

Version history
Revision #:
6 of 6
Last update:
‎2021-05-04 07:15 AM
Updated by: