Skip to main content
Announcements
See what Drew Clarke has to say about the Qlik Talend Cloud launch! READ THE BLOG
cancel
Showing results for 
Search instead for 
Did you mean: 
Mr_H_Huang
Contributor III
Contributor III

Solved: a script to resolve the performance issue for syntaxes HierarchyBelongsTo and Hierarchy

Syntaxes HierarchyBelongsTo and Hierarchy are good for demonstrate the hierarchy relationship for Bill of Materials (BoM) table – a finished product_x has used certain quantity of component_a and component_b …, while a component_a is consisted by other component_l, component_m, and component_n etc… Syntaxes HierarchyBelongsTo and Hierarchy can show how many Trace_sequence/Steps/Levels and what they are between a finished product_x and a source component_n.

I have an existing Qlik app using syntax HierarchyBelongsTo. It was working well until recent date, the reload time increased dramatically from 1 minute to over 3 hours and it is still taking forever to reload (the server just stopped reloading because of running out of memory – the memory being occupied is the same size of the QVD (Qlik storage file) being generated). I found out if the data’s received date is from 20230101 to 20231001, the app can reload in 1 minute. If the data’s received date is from 20230101 to 20231201, the app takes forever to reload. However, the size/rows of source data doesn’t increase that much from 20231001 to 20231201.

After investigation, it is the values of the data that causing many DepthDiff (or say TraceSequence or Steps) for the same pair of childID (NodeID) and parentID. Also it is the values of the data that causing thousands of duplicated lines for the same pair of nodeID and parentID. The question turns to how can I apply WHERE clause and DISTINCT during the running iteration of syntax HierarchyBelongsTo.

In my original data set, apart from nodeID and parentID, there is an additional field named dept_id. If nodeID and parentID can be treated as the dimension data as they are not always changing or being created. The dept_id then is the fact data like transaction id that means which department transaction id is being applied to a pair or multiple pair of nodeID and parentID. This particular dept_id - fact data contributes to the performance issue significantly. But this dept_id is important to keep and align to nodeID and parentID for further link to other transaction data.

The original script below, putting the parent_id in to the place of Ancestor_id that makes the parent_id can be traced as an ancestor to all of its child and grand-child (and grand-grand-child etc.) IDs.

TRACE_PATH_1:
HierarchyBelongsTo (node_id, parent_id, node_name, parent_id, TraceNo,DepthDiff)
LOAD node_id as node_id,
     parent_id as parent_id,
     node_id as node_name,
     dept_id
RESIDENT  TRACE_PATH_0;
HierarchyBelongsTo (NodeID, ParentID, NodeName, AncestorID, AncestorName, [DepthDiff])(loadstatement | selectstatement)

By testing different scripts and syntaxes bit by bit, I found out if I don’t put dept_id, the script is running fast even with full data from 202301 to 20231201. As long as I put the dept_id into the HierarchyBelongsTo, the script becomes very slow wherever I put dept_id in to NodeName, AncestorID, AncestorName or as extra field.

The script below without dept_id can finish the reload in 50 minutes.

TRACE_PATH_1:
HierarchyBelongsTo (NodeID, ParentID, NodeName, ParentID, AncestorName, DepthDiff)
LOAD  DISTINCT
     node_id as NodeID,
     parent_id as ParentID,
     node_id as NodeName
RESIDENT TRACE_PATH_0
WHERE NOT ISNULL(node_id) AND LEN(node_id) <> 0;

 

The next question is how do I join back the dept_id after the syntax HierarchyBelongsTo’s fast reload without putting dept_id. The relationship is 2 to 1+ for each node_id and each parent_id to one or multiple dept_id in the source table TRACE_PATH_0.

But since I used the syntax HierarchyBelongsTo, it creates a lot of new parent_id against the original node_id. The new parent_id is in between the original parent_id and original node_id that telling the Trace_sequence/Steps/Levels and the component id that could be in between the finished and source id.

So now I need to trace back what is the source id or the source parent_id for the newly generated parent_id after syntax HierarchyBelongsTo.

In order to achieve this goal, I have tried using: method 1, reversed HierarchyBelongsTo by putting node_id as parent_id, parent_id as node_id; method 2, I have tried looping for parent_id until is not mapped (https://community.qlik.com/t5/New-to-Qlik-Sense/How-to-integrate-a-do-while-loop-within-a-for-each-l...). But those no success (I think the loop somehow should work).

I found out I can use syntax Hierarchy to bring back the source parent_id for the newly generated parent_id.

Hierarchy (NodeID, ParentID, NodeName, [ParentName, [ParentSource, [PathName, [PathDelimiter, Depth]]]])(loadstatement | selectstatement) 

If I use the full set of Hierarchy, it will run as slow as the issue of HierarchyBelongsTo I am resolving. So I only put the very necessary fields in. The nature of syntax Hierarchy will create NodeNameNo as many as the levels have. So I use CrossTable to transform from the horizontal to vertical table. The source parent_id is now NodeIDValue, the newly generated parent_id is ParentID_transformed (or ALT( ParentID_transformed, NodeIDValue) incase ParentID_transformed is null ) the node_id is NodeID_transformed.

I use the result table of HierarchyBelongsTo as the base, the bring back the source parent_id (NodeIDValue), against the newly generated parent_id (ParentID_transformed). We join back the original source node_id and parent_id, so that we can join back the dept_id to my result table. Because each pair of node_id and parent_id could have one or multiple dept_id, so we use OUTER JOIN and we remove the row if node_id is null or LEN(node_id) <>0.

Hierarchy (NodeID, ParentID, NodeName)

Original script that causing performance issue:

TRACE_PATH_1:
HierarchyBelongsTo (node_id, parent_id, node_name, parent_id, TraceNo,DepthDiff)
LOAD node_id as node_id,
     parent_id as parent_id,
     node_id as node_name,
     dept_id
RESIDENT  TRACE_PATH_0;

 

Source table:

Mr_H_Huang_0-1702037935528.png

Targe table:

Mr_H_Huang_1-1702038009909.png



Full resloved script:

TRACE_PATH_1:
HierarchyBelongsTo (NodeID, ParentID, NodeName, ParentID, AncestorName, DepthDiff)
LOAD  DISTINCT
     node_id as NodeID,
     parent_id as ParentID,
     node_id as NodeName
RESIDENT TRACE_PATH_0
WHERE NOT ISNULL(node_id) AND LEN(node_id) <> 0;

Hierarchy_temp:
Hierarchy (NodeID, ParentID, NodeName)
LOAD distinct
     parent_id as ParentID,
     node_id as NodeID,
     node_id as NodeName
RESIDENT TRACE_PATH_0
WHERE NOT ISNULL(node_id);

Hierarchy_transformed:
CrossTable(NodeNameNo, NodeIDValue, 2)
LOAD DISTINCT
      NodeID AS NodeID_transformed,
     ParentID AS ParentID_transformed,
      *
RESIDENT Hierarchy_temp
WHERE LEN(ParentID) <> 0;
DROP TABLE Hierarchy_temp;

TRACE_PATH_TEMP:
LOAD DISTINCT
     NodeID &'|'& ParentID AS KEY_Hierarchy_transformed_source_ParentID,
     NodeID AS KEY_normal_order_origin_node_dept_id,
     NodeID AS TRACE_PATH.NODE_ID,
     ParentID as TRACE_PATH.PARENT_ID,
     DepthDiff as TRACE_PATH.TRACE_SEQ
RESIDENT TRACE_PATH_1
WHERE NOT ISNULL(NodeID) AND LEN(NodeID) <> 0;
DROP TABLE TRACE_PATH_1;

LEFT JOIN(TRACE_PATH_TEMP)
Load DISTINCT
     NodeID_transformed &'|'& NodeIDValue AS KEY_Hierarchy_transformed_source_ParentID,
     ALT(ParentID_transformed, NodeIDValue) AS SOURCE_ParentID
RESIDENT Hierarchy_transformed;
DROP TABLE Hierarchy_transformed;

TRACE_PATH_step_1:
LOAD DISTINCT
     KEY_Hierarchy_transformed_source_ParentID,
     TRACE_PATH.NODE_ID &'|'& SOURCE_ParentID AS KEY_normal_order_origin_node_parent_dept_id,
     KEY_normal_order_origin_node_dept_id,
     TRACE_PATH.NODE_ID,
     TRACE_PATH.PARENT_ID,
     TRACE_PATH.TRACE_SEQ
RESIDENT TRACE_PATH_TEMP
WHERE NOT ISNULL(TRACE_PATH.NODE_ID) AND LEN(TRACE_PATH.NODE_ID) <> 0;

OUTER JOIN(TRACE_PATH_step_1)
Load DISTINCT
     node_id &'|'& parent_id AS KEY_normal_order_origin_node_parent_dept_id,
     dept_id AS dept_id_node_parent
RESIDENT TRACE_PATH_0;

OUTER JOIN(TRACE_PATH_step_1)
Load DISTINCT
     node_id AS KEY_normal_order_origin_node_dept_id,
     dept_id AS dept_id_node
RESIDENT TRACE_PATH_0;

TRACE_PATH:
LOAD DISTINCT
     ALT(dept_id_node_parent, dept_id_node) AS JOB_ID,
     TRACE_PATH.NODE_ID,
     TRACE_PATH.PARENT_ID,
     TRACE_PATH.TRACE_SEQ
RESIDENT TRACE_PATH_step_1
WHERE NOT ISNULL(TRACE_PATH.NODE_ID) AND LEN(TRACE_PATH.NODE_ID) <> 0;

DROP TABLE TRACE_PATH_0, TRACE_PATH_TEMP, TRACE_PATH_step_1;

The result is nice, the reload time down from over 3 hours and forever to 1 hours 50 minutes, the size of the QVW file down from 5 GB to 27 MB, the lines fetched down from billions (maximum in Hierarchy_transformed table) to 6 million lines.

 

In short, if we meet the performance issue of syntax HierarchyBelongsTo and Hierarchy, we may use as simple as possible of the syntax HierarchyBelongsTo and Hierarchy together, apply DISTICT and WHERE clause to the result table that created by these HierarchyBelongsTo and Hierarchy (to remove duplication and unnecessary lines), then join back whatever we need afterward. If you know another way like loop to achieve better performance please do let the community know. 

Labels (1)
1 Reply
Mr_H_Huang
Contributor III
Contributor III
Author

After two weeks I applied my script above, I came across an issue where the Qlik server's memory being consumed at maximum about 90% while data growing larger. This caused the Qlik server down. I found out it is the script below taking the largest memory and creating biggest rows.

After investigation I realised I don't need to CrossTable all of the fields that created by syntax Hierarchy. Depends on you data values, Hierarchy could have different Trace_sequence/Steps/Levels. Like say if you have 50 levels, the Hierarchy creates 50 additional fields named from NodeName1 to NodeName50. 

In my above script, I CrossTable all of 50 fields to straight table, the process itself creates billions rows (although I removed the duplicated rows in later script). But it my case, the fields from NodeName2 ... to NodeName50 are not necessary. Because all of this values have already been included in NodeName1. The values in NodeName2 ... to NodeName50 are just recursive to create the longest/highest Trace_sequence/Steps/Levels incrementally.

I just need the first 2 fields - NodeName and NodeName1. This saves server memory being consumed from maximum 90% down to 30%.

Amended from

 

Hierarchy_transformed:
CrossTable(NodeNameNo, NodeIDValue, 2)
LOAD DISTINCT
      NodeID AS NodeID_transformed,
     ParentID AS ParentID_transformed,
      *
RESIDENT Hierarchy_temp
WHERE LEN(ParentID) <> 0;
DROP TABLE Hierarchy_temp;

 

to

 

Hierarchy_transformed:
CrossTable(NodeNameNo, NodeIDValue, 2)
LOAD DISTINCT
     NodeID AS NodeID_transformed,
     ParentID AS ParentID_transformed,
     NodeName,
     NodeName1
RESIDENT Hierarchy_temp
WHERE LEN(ParentID) <> 0;
DROP TABLE Hierarchy_temp;