Best Practices for Managing SOFT and HARD Deletes with Qlik Compose for Data Warehouses

TimGarrod · Jul 9, 2020 10:50:04 PM

Data warehouses are optimized to store historical data for complex analytics and reporting. Historical data can include transactional history (e.g. store all transactions for seven years), or slowly changing dimensions attribute history (e.g. manage changes to specific attributes like name or address, over time). Consequently, we rarely purge data from the data warehouse.

However, operational systems that feed the data warehouse, frequently delete records. Some systems allow for complete record removal, while others impose restrictions. As a result, the data warehouse can experience side effects of record deletion form the source systems.

In short, managing record deletion in a data warehouse is complicated, and this paper describes the best practices for handling various data deletion scenarios with Qlik Data Integration.

<<UPDATED 3/5/2021 with additional comments on how to handle data marts>>

aj96 · ‎2020-09-24

How to manage soft delete if the records are re inserted with the same key but different data in other columns?

TimGarrod · ‎2020-09-24

Hi Aj96 - are you saying the primary key may be the same, but the data really represents something completely different?

For example -

ID | Col1 | Col2

1 | Jane | Doe - gets deleted and a new record gets inserted with

1 | John | Smith - which really represents a different instance of the entity?

Can you provide an example as that will help us provide guidance...

aj96 · ‎2020-09-24

Yeah exactly. If I'm right when replicating in the replicate side the 1st record will comes as an update and a second as an insert. In such a case the new record won't be inserted in the table since the keys are same and this would lead to having incorrect data in the replicated tables. This also could lead to bad data if we use that table as a lookup. What can be done in such a situation?

TimGarrod · ‎2020-09-24

Is this related to Compose DW or just replicate?

If Replicate only and you are only concerned with Replicated tables then the first will be seen as an UPDATE, however for the INSERT to then be processed as an update instead of an INSERT (which could fail as you point out) - you can configured Replicate to handle Apply Conflicts.

In the Replicate task configuration - go to Task Settings. In the Error Handling section select Apply Conflicts and configure Replicate to handle duplicate keys / no records found. (You likely only need to handle duplicate keys in this scenario).

This will allow Replicate to see duplicate PK issues and resolve them by Updating the existing record.

See screenshot below -

This will solve Replication scenarios.

If you are using Compose - you may need to either use a different (more natural key) or have some Single/Multi table etl to update current key columns in the staging and HUB tables.

Let me know if the Replicate configuration solves your issue. If not and its Compose - I'll try to put together an example.

aj96 · ‎2020-09-25

Thanks Tim 🙂 . The replicate configuration did solve my issue in the replicated tables but for Compose most of time updated the records correctly but once or twice it did not even before I have changed the Apply Conflicts. Does compose use the order in which the records have come and apply it accordingly?

TimGarrod · ‎2020-09-25

If you are running Compose CDC ETL Sets - they are consuming from the __CT Tables.

The __CT table has every change - so you'll see the UPDATE <which was really the DELETE> then you'll see the INSERT.

If the header__timestamp column is the same in both rows (which it maybe if the DELETE/INSERT occur within the same millisecond on the source) - then Compose will pull the more recent record. (It uses the header__change_seq_num which Replicate generates and tells us the order of operations in the source transaction log).

If the header__timestamp column is not the same - but you've defined it as a Type1 - Compose will perform the same operation.

The complexity here (it seems to me) is that you essentially have a "PK change". The "old" PK is no longer really valid. Within or without Compose - you would need to identify that the delete occurred and essentially change the key in Compose for the now "deleted" record. Off the top of my head - this would require some custom SQL as a SINGLE-TABLE ETL that would see these "deletes" and then UPDATE both the TSTG table AND the current HUB to alter the Key value.

That would allow the new INSERT to be inserted and the old data to remain, but with something that identifies it as being deleted (and likely some date value so if it happens again you can retain all the history).

Feel free to PM me - perhaps we can jump on a call to discuss your options and what you are trying to do ?

aj96 · ‎2020-09-26

Absolutely.

I have tried out with the idea that the updates are applied according to the header__change_seq_num. So, the ct tables have same timestamp as you said and according header__change_seq_num , the transaction with delete flag as 'Y' has come first and the transaction with delete flag "N" came after those. So for example the ct data looks like this

HEADER_SEQ_NUM | OP_IND | P_ID NAME | DELETE_FLAG
1 | "U" | 1000 | SAM | Y
2 | "U" | 1001 | JOHN | Y
3 | "U" | 1002 | DOE | Y
4 | "I" | 1000 | SAM1 | N
5 | "I" | 1001 | JOHN1 | N
6 | "I" | 1002 | DOE1 | N

Now I have sleeted handle duplicates in compose and after the run I would expect the records to be updated from ct having the deleted flag as 'N' but 1 out of 4 times I tried to update the same way, the records was updated with the deleted flag as 'Y' records and other times it got updated correctly. The one time that it happened the seq_num was in the order mentioned in the above example.

juandavidc · ‎2022-10-10

@TimGarrod what if we aren't able to leverage a PK and instead are using the RRN option for processing changes. I tried to enable the operation function to handle soft-deletes. For a table that is using RRN, it seems that some data/columns/fields as being written as null, fields in target tables are defined not null, and throwing errors. The error is: /*
RetCode: SQL_ERROR SqlState: 22000 NativeError: 100072 Message: NULL result in a non-nullable column
*/

If a delete is handled as an update during normal functionality, why would it be trying to write a null?

Any thoughts or insight on how to handle soft deletes using the operation function when the table is configured to use RRN, please?

TimGarrod · ‎2022-10-10

@juandavidc , TBH - I'm not familiar with Db2 replication. However if you only get the RRN when a delete occurs ( and thus don't get the actual business key) - you will need to ensure the RRN is included in the HUB, and would want to handle the soft delete with manual update -- OR perform a lookup to get the business key from the HUB based on the RRN.

JacobTews · ‎2024-02-01

Hi, @TimGarrod and @Sonja_Bauernfeind! Decided to add to this thread, since my question is related:

Our source ERP system (DB2) purges records older than 3 years. This is fine, as those records are durable when written by Compose to the data warehouse tables. However, if a record gets manually deleted, those records are also durable, and this is the problem I'm trying to solve.

Let's say a sales rep enters an order and saves it, which writes to an Orders table on the source, which is brought by Replicate and Compose into the Orders table in the data warehouse.

Before it has shipped and invoiced, the rep gets an email from the customer requesting to cancel the order. Sales rep deletes the order, which triggers a delete in the Orders table on the source. Replicate reflects the change in the replicated source table, and Compose dutifully maintains the record in the Orders table in the data warehouse. This causes errors in our reporting: for example, this order will still show in an Open Orders dashboard, even though it is deleted.

I generally don't want records to be deleted from the data warehouse, as the vast majority of deletes are the scheduled purges. Any ideas on an elegant way to handle this scenario? I could implement a workaround (for example, maybe a single table ETL which looks for records in the Orders table older than some number of days which have a NULL ShipDate; or looks for records in the dw.Orders table which do not exist in neither the source.Orders table nor the dw.Shipments table), but I don't want to inadvertently remove good data with the bad (the occasional order for, say, 6 months from now).

Thanks, anyone who might have a good idea!

Best Practices for Managing SOFT and HARD Deletes with Qlik Compose for Data Warehouses

Best Practices for Managing SOFT and HARD Deletes with Qlik Compose for Data Warehouses

Deployment