Qlik Replicate: Identify Segment Boundaries for Data Ranges based parallel FULL load

dima_etkin · Mar 4, 2024 6:28:06 AM

The information in this article is provided as-is and to be used at own discretion. Depending on tool(s) used, customization(s), and/or other factors ongoing support on the solution below may not be provided by Qlik Support.

When larger tables participate in the Full Load replication phase in Qlik Replicate, you can accelerate the replication of these tables by splitting them into logical segments and loading these segments in parallel. Tables can be segmented by data ranges, by partitions, or by sub-partitions.

This article provides guidance in finding the adequate data segment boundaries based on actual data stored in those tables on the source.

Size does matter

As a first step, it is essentially to understand the size of the data tables.

Not every long table is big, nor not every short table is small. Table average uncompressed data size is determined by the formula below:

“table_size_MB” = “number of records in the table” * “avg. record length”

The table size may directly affect the unload duration. The longer is the unload, the bigger impact it will have on the following:

Source transactional concurrency – may cause locking on the source
Number of cached changes to apply on the target during the FL + CDC process
Overall duration of a Full Load process

It is a good practice to unload the source table in “digestible” chunks that could be loaded optimally in less than 30 minutes. The amount of data in each segment can vary between tens of MB and few GBs and will heavily depend on the network quality and bandwidth, source database load at any given time, source database processing capability and other environmental aspects. The best way to gauge the overall Full Load performance is to test how long it takes to Full Load for a medium size table in your environment, and extrapolate from there.

Each segment should contain a reasonable number of records. Depending on the table width, the “reasonable” number of records in each batch may vary. As a rule of thumb, we would want each partition\segment load duration to be in a ballpark of 15-30 minutes.

This is important for the following reasons:

to avoid any contention in the source DB
to prevent the thread from being dropped because of excessive duration by either
- TCP layer
- ODBC driver itself
- source DB

Calculating the “partition” size

At this point we assume that we already know the approx. number of logical “partitions” (or segments) we would want to use for parallel unload. Perform the following to calculate the partition size:

To obtain the estimated number of rows for each logical partition [records_in_logpart] get the table row count from the system catalog or count(*) from the table itself and divide this number by number of parallel load processes you want to create.
Identify the index to be used as a base
1. CLUSTER index ideally, as the data in the tables will be sorted by it by default
2. PK or other IX index (may cause performance degradation)
Adjust the queries below to the table, where:
1. SCHEMA.TABLE – qualified identifier of a production table name
2. KEYCOL1, KEYCOL2,…,KEYCOLn – represent the columns in the order they appear in the chosen UNIQUE index itself.
3. the columns should be the ones you are going to partition by, and the [records_in_logpart] represents the number calculated in the previous step

When identifying the data ranges for SAP tables, MANDT field should appear in the subqueries WHERE clause.

ORACLE:

SELECT T1.GEN1, T1.THIS_ROW
FROM (
    SELECT GEN1, ROWNUM  AS THIS_ROW
    FROM (
        SELECT CAST(KEYCOL1 AS VARCHAR(200))||';'||CAST(KEYCOL2 AS VARCHAR(200)) AS GEN1
        FROM SCHEMA.TABLE
        ORDER BY KEYCOL1, KEYCOL2 DESC
         )
      ) AS T1
WHERE MOD(T1.THIS_ROW, [records_in_logpart]) = 0;

DB2i / DB2LUW / DB2z:

SELECT T1.GEN1, T1.THIS_ROW
FROM (
    SELECT
    ROW_NUMBER() OVER(ORDER BY KEYCOL1, KEYCOL2) AS THIS_ROW,
    CAST(KEYCOL1 AS VARCHAR(200)) || ';' || CAST(KEYCOL2 AS VARCHAR(200)) AS GEN1
    FROM SCHEMA.TABLE
      ) AS T1
WHERE MOD(T1.THIS_ROW, [records_in_logpart]) = 0;

SQL SERVER:

SELECT T1.THIS_ROW, T1.GEN1
FROM (
    SELECT
    ROW_NUMBER() OVER(ORDER BY KEYCOL1, KEYCOL2) AS THIS_ROW,
    CONCAT(CAST(KEYCOL1 AS VARCHAR(200)),';',CAST(KEYCOL2 AS VARCHAR(200))) AS GEN1
    FROM SCHEMA.TABLE
      ) AS T1
WHERE THIS_ROW % [records_in_logpart] = 0;

Setting up ranges in Replicate

Once the query results are returned, tasks will need to be set up consistently in all the environments. Note that in non-Prod environments the data segments will be identical to the PROD one, hence if any testing needs to be done, values should be adjusted specifically to the environment the task is running in.

Open Qlik Replicate\QEM UI
Open the task where the table is registered in Designer view
Edit the selected table properties
Edit Parallel Load
Select “Use Data Range”

Qlik Replicate - Table Parallel Load Settings

6. Click “Select Segment Columns” and pick the columns you are going to segment the table by and click OK. By default, the PK columns will be automatically selected

Qlik Replicate - Table Parallel Load - Select Segmentation Columns

For SAP application(DB) endpoints the MANDT field is implicitly taken from the endpoint CLIENT field and will not appear in the column selection criteria.

7. Populate the segments with the ranges returned by the queries in ascending order and click “Validate”:

Qlik Replicate - Validating Data Segments settings

Validate” button only checks that all the input fields have been populated and that the data in those fields matches the field data type.

8. Correct any error that may be discovered and click “OK”.

Segments’ boundaries should be in an always-ascending order. Failure to comply may result in inconsistent data unload generating duplicate records and/or missing data.

Dana_Baldwin · ‎2022-08-08

Another use case where parallel load can be helpful is if you are getting ORA-01555 Snapshot too old errors, as it will use multiple queries of smaller data sets rather than one large "select * from" query.

MoeE · ‎2023-03-09

Hi,

I don't really understand how the sql code works to return an ascending range of numbers. Can someone please explain? Also why shouldn't the partitions be evenly sized?

thanks,

Mohammed

Sonja_Bauernfeind · ‎2023-03-10

Hello @MoeE

To get more direct assistance, please post your question directly to our Qlik Replicate forums. This will give your question more visibility across our customers and active support engineers.

All the best,
Sonja

dhina · ‎2024-03-01

Hello Team,

In oracle query I could see we have used DESC in order by function. Does it necessary to give? Because results of DESC and ASC totally vary. Could you please clarify on this?

Thanks,

Sonja_Bauernfeind · ‎2024-03-04

Hello @dhina

Please post your query in the Qlik Replicate forum to gain access to your Qlik peers and our active Support Agents. Include all version information for your products when raising it.

All the best,
Sonja

Qlik Replicate: Identify Segment Boundaries for Data Ranges based parallel FULL load