topic Re: How do you define upper segment boundary for parallel load? in Qlik Replicate

How do you define upper segment boundary for parallel load?

desmondchew — Fri, 02 Jun 2023 00:42:38 GMT

We have a table having one primary key (SNP_ID). TABLE: TRANS_TBL

select max(snp_id) from TRANS_TBL
--- 58,251,744

select min(snp_id) from TRANS_TBL
--1

Assuming we are using 6 segments.

Each segment would look like

1 - 10000000

2 - 20000000

3 - 30000000

4 - 40000000

5 - 50000000

6 - 60000000

Is this correct? I ever tested in descending order however the full loading took ages like going into a loop loading much more records that it should.

Please let me know if this setup is workable?

Thank you.
Desmond

Re: How do you define upper segment boundary for parallel load?

john_wang — Fri, 02 Jun 2023 05:35:02 GMT

Hello @desmondchew ,

Thanks for reaching out!

If I understood the issue correctly, you input the segments values in decreasing order, eg 60000000, 50000000,...10000000 in the Parallel Load Segments Boundaries window of the table setting. Then you get extra rows in target side after full load done.

This is reasonable because the segments values order is sensitive, Replicate will compose the unload SQL by using these values and their order, the latter value is treated as "the upper data range". Please input the values in increasing order, let's say 10000000,20000000...60000000.

Typical the correct unload SQL looks like:

SELECT ... FROM ... WHERE ( ("ID" <= 1000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 1000) )) AND ( ("ID" <= 2000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 2000) )) AND ( ("ID" <= 3000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 3000) )) AND ( ("ID" <= 4000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 4000) )) AND ( ("ID" <= 5000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 5000) )) AND ( ("ID" <= 6000) )
SELECT ... FROM ... WHERE (NOT (("ID" <= 6000) ))

If you like you can set SOURCE_UNLOAD to Trace and then check the lines in task log file to understand further:

Select statement for UNLOAD is ...

You will see all SQLs and their values.

Hope this helps.

Regards,

John.

Re: How do you define upper segment boundary for parallel load?

desmondchew — Fri, 02 Jun 2023 09:00:05 GMT

Hi John,

What if the table has two columns primary keys (ID, EMPNO)? How can I define the multiple segments?

Thank you.
Desmond

Re: How do you define upper segment boundary for parallel load?

Heinvandenheuvel — Fri, 02 Jun 2023 21:09:05 GMT

>> What if the table has two columns primary keys (ID, EMPNO)? How can I define the multiple segment

The "select segment columns" is plural. You can pick more than 1 and you'll be given the option to provide values for both.

For some cases that obviously makes sense, but for an example you indicate you pick either. If the primary (cluster) order is a highly granular ID then pick that for easy source search. Now if ID is not at all selective, but perhaps a country code, then you might want ranges within given ID and need to provide both. You wouldn't want to skip a first-order segment as that upsets the source engine. It would have to do much more scanning.

Hein

Re: How do you define upper segment boundary for parallel load?

desmondchew — Fri, 02 Jun 2023 23:51:25 GMT

I am reading the guide in link https://help.qlik.com/en-US/replicate/November2022/Content/Global_Common/Content/SharedEMReplicate/Customize%20Tasks/Parallel_Load.htm

So we have three columns as a composite primary keys. We know that first columns needs to be in ascending order. Does column two needs to be in descending order as per the guide? While the third column can be any order but the last segment needs to be the least value within the third column?

Desmond

Re: How do you define upper segment boundary for parallel load?

Heinvandenheuvel — Sun, 04 Jun 2023 20:45:05 GMT

>> Does column two needs to be in descending order as per the guide?

I do not see this indicated in the guide, nor in the link you provided. Can you provided a specific reference?

Let's take a step back. What is the goal of the segmentation? It is not a filter as such, so all the data will be loaded, but there is an attempt to select segments/chunks which can be processed in parallel to speed up the load. Ideally each segment would have a similar load time, which typically means a similar number of rows.

It is best if the source database can 'find' the segments readily. That in turns means it is best if the segments are along PK boundaries such that the DB can directly select a starting point for a segment, and does not need a scan nor sort. The number of PK columns you use should just be 'just enough' to make similar chunks, and no more. 'Similar' depends on how many segment you target but is probably such that the row count for each segment is within 10% or even 20% of the average rows count.

That's easy for a slowly increasing numeric PK such as we might see for a table with sequence number as PK.

It's also easy for a DATE or YEAR as PK for example order number which start with YYNNNNN each year starting at 000001. Just pick a 'comfortable' number of years. Maybe 10 years in the first segment, 3 years in the next few and every year for the segments after business started booming :-).

Now if the PK is something a phone number or SSN this gets tricky - some 'area codes' will have many more than others. You may just want to analyze, counting for each area code and putting it in a table/program to see approx which area codes down to exchange?) to select to get similar segment. Ditto for 'last-name'. Just count once for every 3 or 4 character start, add a running count, and select your boundaries.

On target every index will be ascending (ref guide: "A unique index consisting of several ascending and descending columns will always be replicated to the target as ascending columns. In other words, the descending columns will become ascending columns. ". On the source segment selection you should stick to choice that allow the source DB to make 'easy' select start + scan possible.

The example in the Replicate Reference guide is just crazy IMHO. Who here has any clue as to what the goals is there? Maybe if you are more specific about your columns with a few examples, and about your goals, someone can help better? I'd say KISS - Keep It Simple S...ir. ( 🙂 ).

Hope this helps some, Hein.

Re: How do you define upper segment boundary for parallel load?

vlokeswa — Tue, 18 Jun 2024 06:56:25 GMT

Hi, My source is oracle, I am trying to perform full load for one of a big table unfortunatly the table doesn't have any key column but my manager asking me to use row id as segments. will it possible to use row id as parallel segment? please confirm.

Re: How do you define upper segment boundary for parallel load?

john_wang — Wed, 19 Jun 2024 03:15:16 GMT

Hello @vlokeswa ,

Welcome to Qlik Community forum and thanks for reaching out here!

Key (or index) columns are not mandatory for Parallel Load. Any column can be used as a segment column as long as it can divide the data into segments. However, if the table has a primary key (PK), the performance will be better.

Currently, Oracle data types ROWID/UROWID are not supported in current major versions of Qlik Replicate. As a result, columns with these data types will not be visible in the Qlik Replicate GUI, making it impossible to utilize ROWID directly in Parallel Load design Window.

I am working on an article that demonstrates how to use ROWID in Parallel Load. I will share the detailed steps with you shortly.

Regards,

John.

Re: How do you define upper segment boundary for parallel load?

john_wang — Wed, 19 Jun 2024 14:08:29 GMT

Hello @vlokeswa ,

Please check below article to see the detailed steps:

Qlik Replicate and Oracle source endpoint: use ROWID as segment column in Parallel Load

Regards,

John.