Sub-routines to speed-up your ETL: "Change Data Capture" strategy with Qlik leveraging optimised bulk inserts

cheenu_janakira · May 16, 2021 10:53:39 AM

Hello Qlik Community,

Here is a video where I discuss an optimised "Change Data Capture" strategy with Qlik and leveraging its bulk (optimised) insert for speeding up your ETL process from the source DB through all the transformation parts and into your front-end application.

I include easy-to-use sub-routines also that will help you achieve this process if you have not already implemented it and these are available here and discussed/explained in the video.

The text of the video below, if you prefer reading:

I have seen quite a number of instances where an "incremental load" in Qlik was barely more than fetching the data from the database in what can be called an incremental manner.
Basically a Where-clause in the SQL statement which extracts a limited amount of data which is then apended to a QVD, typically one that gets larger and larger. Since this QVD is one big data file, all subsequent transformation processes have to go through the whole data set to produce the final product. Hence, even though some of the data is historical and unchanged from the previous day, all the corresponding fields need to be recalculated or re-transformed since there is only one QVD file. This process makes ETL processes unnecesarily long, because the whole data set needs to be loaded and transformed rather than just the portion of data that is new - the incremental, "change data capture" part of the data.

The "Change Data Capture" strategy I elaborate upon in this video, is a process that will minimise the amount of time the QVD takes to get stored and also minimise the subsequent ETL processes to get to the final data set that is required for the front-end application and analysis.

Here I show the incremental ETL I have so often witnessed. Fetch the change-data from the DB, apend it to the existing QVD resulting in an ever growing QVD, which can in some instances I have seen, take up to 20-25 minutes to write to disk.

The "Change Data Capture" strategy I want to elaborate on, is fetching the data incrementally, and THEN write out a Year-Month QVD, based on a date-time stamp, hopefully on that is akin to a "Last_Modified_Date" for that record.

The key to this strategy is identifying a Primary_Key in the source DB data and a Date-time field upon which the Year-Month QVD slices will be based on.
This will minimise the QVD store time, because the QVD store time will max-out in any given month, on the last day of the month. Then for the next month, the QVD store time will drop and build up to a new maximum for that month, then the month after that drop again.

Each subsequent transformation layer can follow this Year-Month QVD logic, thereby ensuring the full ETL takes a minimum amount of time to complete.

Then, if you ensure you have ordered the data reverse chronologically in the QVDs, you can use the WHERE NOT(EXISTS(Primary_Key)) to reverse chronologically LOAD all the QVDs. Hence, if there are any duplicates of that Primary_Key, Qlik will only load the latest record of it. And naturally, since the WHERE-NOT-EXISTS is based on ONE field, the loading will be optimised and execute very quickly.

This is the essence of an Optimised "Change Data Capture" strategy with Qlik and leveraging its bulk (optimised) insert for speeding up your ETL process.

I have included a few-subs that can be used in this process and that might help you quickly set up this sort of dataflow architecture.

1) FullLoad_Oracle(parDB_Connection,parDB_Owner,parDB_Table,parDB_WHERE_Clause)
Sub to load and store on full DB table.
N.B. The WHERE clause you include must start with 'WHERE [....]'.
2) FullyAutomatedIncrementalLoad_Oracle(parDB_Connection,parDB_Owner,parDB_Table,parDB_WHERE_Clause,parPrimaryKey,parDateTimeField,parDBTableFieldsChecker_YES)
Sub to load and store a DB table extracting incrementally and storing only the lastest YYYYMM QVD.
N.B. The WHERE clause you include must start with 'AND [....]'.
Key requirements for this sub is the identification of the PRIMARY-KEY in the table and the DATE-TIME field to use for incremental loading
and which will also be the manner in which the script will store YYYYMM QVDs.
3) QVDFileNamesList(parQVDTargetFolder,parQVDNamePrefix,parLoadLastNFiles)
When a folder has been created full of YYYYMM QVDs, load the LIST of files name in reverse-chronological order.
4) QVDFileNamesLoad(parQVDTargetFolder,parQVDNamePrefix,parLoadLastNFiles,parOPTIONALsubName)
When a folder has been created full of YYYYMM QVDs, load the QVD FILES in reverse-chronological order.
999) QVDCreateYMSlicesFromQVD(parSourceQVDFolder,parSourceQVDFile,parTargetFolder,parDateFieldNameToCreateQVDSlices)
If you have a large QVD with a lot of data, in order to start the process of creating YYYYMM QVDs in an incremental load with the above subs,
use this sub to start the initial slices of YYYYMM QVDs.

Sub-routines to speed-up your ETL: "Change Data Capture" strategy with Qlik leveraging optimised bulk inserts

Sub-routines to speed-up your ETL: "Change Data Capture" strategy with Qlik leveraging optimised bulk inserts

Connectivity & Data Prep

Qlik Sense