Solved: store changes and duplicate records - Qlik Community

hcmjensen · ‎2024-12-17

We have an Oracle source table coming from EBS that has no primary key and contains many records for which all column values are exactly the same. Unfortunately, we need to keep all of these duplicate records and are having a lot of trouble figuring out to handle them. We are currently storing changes for this table and see DML of all types coming in. The metadata columns in the change table show 'I', 'D', 'U' and 'B' operation types for this table. How would we use these operations to apply the stored changes on the base table without having a key? Is there some sort of internal key intrinsic to each record that can be used to identify the record in the target?

For example, if the store changes table showed a deletion operation, how would I determine which record in the base table should be deleted? I couldn't delete all records for which all column values matched the column values in the changes table because that might delete some duplicate records that we need to keep.

Unfortunately, this is a large table so we don't have the option of doing a full load, we need to figure out how to apply changes. How can this be done?

Thanks in advance for any advice, I'm fairly new to Qlik Replicate.

john_wang · ‎2024-12-18

Hello Hannah, @hcmjensen

I’m not sure if the ADLS Parquet files data is being loaded into Databricks Delta by your own programs or SQL scripts.

My two cents, to address this scenario in Databricks Delta, where you want to perform a MERGE INTO operation by your own but ensure that duplicate rows are handled correctly (specifically deleting only one instance of a duplicate group rather than all duplicates), you can use Delta Lake's MATCHED clause combined with a ROW_NUMBER() or RANK() window function. The ROW_NUMBER() function ensures that only one row is deleted for each duplicate group.

However I'm not Databricks expert, please check with Databricks for expert.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

View solution in original post

john_wang · ‎2024-12-17

Hello @hcmjensen ,

Welcome to Qlik Community forum and thanks for reaching out here!

In the Store Changes Table, the operation types are represented as follows:

'I' for INSERT,
'D' for DELETE,
'U' for UPDATE (afterImage), and
'B' for UPDATE (beforeImage).

Your own program is to process these change records from the Store Changes Table and apply the corresponding operations to your target systems, such as databases, streaming platforms, file systems, etc.

Regarding the deletion of only one row from duplicate rows: this approach is reasonable. In fact, some databases already implement this functionality via Qlik Replicate. Could you share the type of the target database? This information will allow us to provide a more precise and tailored solution for your requirements.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

hcmjensen · ‎2024-12-18

Hi, John. Thanks for your response. The target system is ADLS parquet files that we are loading into Azure Databricks delta tables.

Regards,

Hannah

john_wang · ‎2024-12-18

Hello Hannah, @hcmjensen

I’m not sure if the ADLS Parquet files data is being loaded into Databricks Delta by your own programs or SQL scripts.

My two cents, to address this scenario in Databricks Delta, where you want to perform a MERGE INTO operation by your own but ensure that duplicate rows are handled correctly (specifically deleting only one instance of a duplicate group rather than all duplicates), you can use Delta Lake's MATCHED clause combined with a ROW_NUMBER() or RANK() window function. The ROW_NUMBER() function ensures that only one row is deleted for each duplicate group.

However I'm not Databricks expert, please check with Databricks for expert.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

hcmjensen · ‎2024-12-19

Hi, John. Thank you very much for your two cents. So in the strategy you suggested, there is not internal key of any sort, I would just need to make sure that in case of update or delete only one record is affected. I will give this a shot! Thanks again.

Hannah

john_wang · ‎2024-12-21

Hello Hannah, @hcmjensen

It's just like the virtual column ROWNUM in Oracle return dataset:

DELETE FROM... AND ROWNUM = 1 ;

While ROW_NUMBER() function is used in Databricks to identify a single row from a duplicate group.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

hcmjensen · ‎2025-01-02

Hi, John. What if instead of using a change table with ADLS as the target, we set up a load with apply changes and a target of Databricks Delta tables? With this configuration would Qlik handle the duplicates correctly?

Thanks,

Hannah

john_wang · ‎2025-01-02

Hello Hannah, @hcmjensen

No, unfortunately not. In Databricks, primary keys (PK) are not enforced; they are informational only. As a result, primary keys cannot prevent duplicate entries from being written into Databricks Delta tables. This behavior is inherent to Databricks Delta.

Regards,

John.

Help users find answers! Do not forget to mark a solution that worked for you! If already marked, give it a thumbs up!

store changes and duplicate records

Functionality

Related Topics