Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Pulkit_Thukral
Partner - Creator II
Partner - Creator II

Optimized Incremental Load

Hi All,

Need support in writing a script for incremental data that uses minimum system resources(CPU and Memory) and also gets executed faster.

Below is the original Script :-

A:

Load

Max( Date) as Date

from Previous_Complete.qvd(qvd);

Let Max_Date=peek('Date',0,'A');

drop table A;

Load *

from Incremental

where Date >$(Max_Date);

Here ,

1st Issue :-Max function takes huge load time as data set consists of millions of rows.

Also ,

2nd Issue:-CPU reaches 70 % mark while comparing dates in 2nd part of the script.

Hence, i tried with few optimized scripts :

For 1st Issue:-

Purpose:- Finding max date from existing QVD.

I used the below script :-

First 1

A:

Load

Date

From

Previous_Complete.qvd(qvd);

Order by Date Desc;

It decreases Load time comprehensively BUT same time CPU get utilized completely for ORDER function.

I also have used another script using Field Value and auto number record function which again take some time for loading.

Any Other Alternative to this Script ?

I need Quick Load time with least system resources usage.

For 2nd Issue:-

Purpose:- Date comparison (a<b) takes huge time and CPU Utilization

Please provide an alternative this this script.

I feel '<' takes a lot of time. Any Workaround for this ?


9 Replies
marcus_sommer

Take a look on the suggestion from Rob: “Fastest” Method to Read max(field) From a QVD | Qlikview Cookbook.

- Marcus

prieper
Master II
Master II

Do you have a unique ID in your dataset?

Would it be possible to construct such ID?

If so, you may work the the EXISTS-funktion:

// load existing data

Data: LOAD * FROM MyQVD.QVD (QVD);

// load only new data, for which ID is not known

CONCATENATE (Data) LOAD * FROM NewData WHERE NOT EXISTS(ID);

// overwrite existing QVD with updated database

STORE Data INTO MyQVD.QVD (QVD);

Pulkit_Thukral
Partner - Creator II
Partner - Creator II
Author

Hi Peter,

Incremental Data contains data for all existing servers for each day.Hence, incremental logic has to be based on a date(timestamp).

Pulkit_Thukral
Partner - Creator II
Partner - Creator II
Author

Hi Marcus,

I used this script as well but it was taking some time to get completed.

'First' and 'Order' function was at least 20 times faster than these scripts but with the added overhead of utilizing complete CPU.

prieper
Master II
Master II

You may use a combination of both:

AllDates: LOAD DISTINCT MyDate FROM Data.qvd;

MaxDate: LOAD MAX(MyDate)  AS LastDate RESIDENT AllDates;

and then continue with variable and LOAD ... WHERE for the new data.

The first part will be significantly faster, can't help on the second.

edit: It may be worth to try

AllDates: LOAD DISTINCT MyDate FROM Data.qvd;

LOAD MyDate, .... FROM NewData WHERE NOT EXISTS (MyDate);

This is not exactly, what you needed (NewDates should be greater than the existing ones, here you may also get some smaller dates), but you still may load the data optimized. You may add another filter on the smaller (new) dataset in a second step (not preceding load):

Peter

marcus_sommer

I would claim that could not be. Did you really an optimized qvd-load of the field [Date] and read afterwards the max. value from the field which only read the distinct fieldvalues from the symbol-tables? Even by quite large qvd's it will be really fast.

Beside them I wouldn't do it at this point. I would do it within the script which this qvd creates and storing it into another qvd  - because why reading these data a second time?

Further I would use then this max-date qvd for an optimized where exists() load like suggested from Peter. More to the topic of incremental loadings and optimzed loads with exists could you find here: Advanced topics for creating a qlik datamodel.

In addition to it you shouldn't be worried about a high CPU Utilization because it only showed that the script worked efficiency. Normally no other tools should run on a machine parallel to qlik and if this isn't possible than there are not much possibilities. One could be to find different time-frames in which the tools run and/or to disable some cores within the qmc - also on OS level are possibilties to restrict cores and/or setting priorities to processes.

- Marcus

Pulkit_Thukral
Partner - Creator II
Partner - Creator II
Author

Thanks Marcus for providing detailed information.

I understood your point of extracting data through an optimized load.I am doing the same thing.

Not sure why is it taking a lot of time.

Also,i would have to consider CPU utilization factor as i am doing entire implementation on a single server i.e. Qlik Desktop, Server and Publisher on the same server

Hence, when a script is using 100 % CPU , users may face some issues for accessing the application.

Pulkit_Thukral
Partner - Creator II
Partner - Creator II
Author

Thanks Peter for this information.

Unfortunately, i don't have a single unique field in this data.

However, i can try to create 1 by joining 2 different fields (maybe Name and Date).

Though not sure if this would work faster as field content is increased.

But i feel it's worth a try.

I will share back the results.

marcus_sommer

If this approach is rather slow than there is most often anywhere a bottleneck to the system-resources. Mostly caused by the lack of RAM (take also a look on the working set configurations within the qmc) but it could be also the speed of your (network) storage. I suggest you try to monitor the resource consumption within the task-manager maybe also by running some tests - for example by limiting the number of records for those loadings or moving the source-files to a local drive or ...

I have the same situation that I have only a single server for all services and understand the problem very well. But if you couldn't separate the tasks from the server and the publisher to different time-frames you will have to life with the consequences - slow loadings / responses and sometimes even the crash of the services or the whole machine. Personally I try to use my local machine for update tasks which went wrong in any way and which must be update as soon as possible - but it's limited of course. The only real solution is to increase the hardware maybe to several clusters.

- Marcus