Measuring the speed of loading a new calculated fi... - Qlik Community

ahaahaaha · ‎2017-02-06

Colleagues,

during one of the discussions of the Qlik Community there was an idea to test in practice the following question. Often, in practice, there is a task to add to an existing table a new calculation field. The discussions often occur two variants.

Variant 1. Loading table starting (using From, Inline or Autogenerate). Next, using NoConcatenate to create a second table with all the fields of start table and add a new field. At the end of the starting table is deleted.

Variant 2. Loading table starting (using From, Inline or Autogenerate). Next, using Left Join add a new field.

Question: Which variant is faster when much data?

The following code was used for variant 1.

//load the start table

StartTable:

LOAD

RecNo() as ID,

Div(Rand()*100, 1) as Volume

AutoGenerate 100000000;

LET StartTime = Now(); //Start time

NoConcatenate

FinishTable:

LOAD*,

Volume - 50 as Volume1

Resident StartTable;

DROP Table StartTable;

LET FinishTime = Now(); //Finish time

LET ResultTime = Time(FinishTime - StartTime);

The following code was used for variant 2.

//load the start table

StartTable:

LOAD

RecNo() as ID,

Div(Rand()*100, 1) as Volume

AutoGenerate 100000000;

LET StartTime = Now(); //Start time

left Join

LOAD*,

Volume - 50 as Volume1

Resident StartTable;

LET FinishTime = Now(); //Finish time

LET ResultTime = Time(FinishTime - StartTime);

The measurements were performed on a single computer. For each variant, created from scratch a new clean database with no data. Create 100 million records in the starting table. Carried eight downloads the script of each variant.

The results were obtained as follows.

Variant 1

Example picture

The average download time was ~27,9 seconds.

Variant 2

Example picture

The average download time was ~33,0 seconds.

Conclusion: The rate of data download variant 1 above variant 2 download speed by ~18,3%. For a small data size speed difference is likely to be not noticeable, it may be used any of the variants. For large amounts of data difference of speed should be considered depending on the circumstances.

I am glad, if these measures show someone interesting.

Regards,

Andrey

v_petrenko · ‎2017-03-03

Hi Andrey, you have to be concerned that your second approach has a severe vulnerability to data inconsistency risk. Consider that your fact table to which you are adding a new calculated field has several fully duplicated rows. This case is rather likely in real environments, and if you make a join like the one you mentioned you'll inevitably come with a set of unintentionally multiplicated records which will misrepresent your calculations. The code below is an illustration.

So I would strongly NOT recommend using Left Join for a task of adding a calculated field to a logical table. The safest and most practical method would still be using a temporary table with resident load from a previously loaded table and subsequent deletion of the latter. An alternative approach would be Preceeding Load feature, though it has it's own limitations, see Rob Wunderlich's article The Cost of Preceding Load | Qlikview Cookbook

/* load 3 records of data with 2 of them beeing fully similar */

[fact table]:

LOAD

*

Inline

[

Dim1 , Dim2 , Measure1

DimValue1-1 , DimValue2-1 , 1

DimValue1-2 , DimValue2-2 , 2

]

;

/* adding a calculated field through Left Join */

Left Join ([fact table])

LOAD

*,

( Measure1 * 10 ) as Measure2

Resident

[fact table]

;

/* !!! ending up with 5 records of data, which is an undesired result */

ahaahaaha · ‎2017-03-03

Hi Vasily,

thank you for your reply and your thoughts. It was only on the comparison of download speed in those cases where the use of the left connection is possible (key fields by definition can not have two identical values, for example).

And of course you're right. We have to understand the applicability of the LEFT JOIN use at load.

Thank you for your helpful comment.

Regards,

Andrey

ahaahaaha · ‎2017-03-08

Although the above task is not within the theme this document, let's we will solve it - the task is set and has not been solved. As variant, to the source code we will add two lines (the source code is left unchanged, the added one is highlighted in red)

/* load 3 records of data with 2 of them beeing fully similar */

[fact table]:

LOAD

RecNo() as ID, //Add artificial key field

*

Inline

[

Dim1 , Dim2 , Measure1

DimValue1-1 , DimValue2-1 , 1

DimValue1-2 , DimValue2-2 , 2

]

;

/* adding a calculated field through Left Join */

Left Join ([fact table])

LOAD

*,

( Measure1 * 10 ) as Measure2

Resident

[fact table]

;

DROP Field ID From [fact table]; //Delete artificial key field

/* !!! ending up with 5 records of data, which is an undesired result */

Result

v_petrenko · ‎2017-03-17

Well, now you ended up with adding calculated field twice, first one being artificial key ))) This trick might double reload time of this part of the script. I cannot think of any scenario where it could be useful. However, thanks for your research, now I'm quite sure that the simplier method with temporary resident is also the most effective and safe. No doubt.

Measuring the speed of loading a new calculated field when large volumes of data