Re: ETL Incremental Loads - Qlik Community

rebelfox · ‎2015-06-13

I have a report that needs updating daily. I can do this by loading all data each day as a scheduled job overnight that overwrites the previous day and letting the user run it on demand once.

However the best practice is to use incremental loads and a QVD middle layer.

In other words, the report would run against the data in the QVD.

The QVD would be updated daily with just the rows that had changed on the source system in the last 24 hours using a source system change date field.

The only thing that niggles at me with this approach is that to identify the rows that have changed in the last 24 hours, your SQL style select statement still has to read through the entire source table to be able to identify the rows that have changed in the first place.

So is this approach really making a great time saving compared to reading all rows and just overwrite rebuilding the report from scratch?

Peter_Cammaert · ‎2015-06-13

Yes, in almost all cases.

A couple of remarks:

Your RDBMS is always much more capable at optimizing the code to extract specific rows from one or more tables than QV will ever be. Whatever trick you need to do is best delegated to the Database engine which for products like Oracle or SQL Server is exceptionally powerful.
A SELECT WHERE clause that says "get me everything that is more recent than yesterday" doesn't need to scan the whole table.
In the whole sequence of RDBMS DB TABLE READ->DATA TRANSFER TO QV->QLIKVIEW DATA TRANSFORM, the middle part is usually the slowest (networks and stuff). There is an almost linear relation between delay and amount of data to transfer. So why would you transfer the whole table, when your RDBMS can already decide on which records really don't matter to you now?

I apologize if I'm being too technical here. Just let me know.

Best,

Peter

maxgro · ‎2015-06-13

your SQL style select statement still has to read through the entire source table

the database can have an index, partition; you read only the data you need, not the entire source table

another difference: with incremental load you transfer (from db server to qlik server) only the changed records, not the entire table, you read a lot from qvd (maybe optimized) and few rec from the db

So is this approach really making a great time saving compared to reading all rows and just overwrite rebuilding the report from scratch?

For Qlik developer isn't a time saving; incremental is not so simple as a select * from .....

For exec time it depends on

- num of rec of the entire table (I usually think about incremental on table of some millions of row (>= 5-10 mil)) or when is a must (client wants it).

- num of rec in incremental / num of rec of the entire table

Peter_Cammaert · ‎2015-06-13

Good point from Massimo that puts everything in perspective: it's a costs/benefits calculation.

Imagine that your document needs two rather small tables of x and y rows that take about 5 minutes each to perform a full load (everytime). An incremental load would reduce this to half a minute. Is it worth it if you have a window of about 5 hours every night to reload your document, knowing that your script will become more complex with each incremental load? I guess not.

Now imagine that the next document needs the same two tables + 48 more. Your document update will take 50 x 5 minutes = 4+ hours. Incremental loads suddenly become much more appealing for some or all tables (common mechanics) because they'll take only 25 minuts for 50 tables.

So is this approach really saving time? It depends...

Peter

rebelfox · ‎2015-06-13

Thanks that helps me understand the issues. Not to technical at all.

So reducing the number of rows with a WHERE clause is highly beneficial in pulling the data into the QlikView environment. Seems obvious now I think about it.

The source system does not in fact read all rows ( didn't realise that ) and if appropriate indexes are available on the source tables, the selection will be even quicker.

rebelfox · ‎2015-06-13

In the case I am dealing with there is a table with 3 million records.

I am trying to replace a crystal report written very badly.

The current report links 6 tables ( one of which I have just discovered in not needed anyway ) and takes a very long time to run. The joins seem to be a big bottleneck. Are joins also best done on the source system?

My working theory is that I should either

i) reduce the size of the 3 million rows table into a QVD with a WHERE selection and then do the joins to the

other tables in the source from the QVD

OR

ii) reduce the size of the 3 million rows into a QVD with a WHERE selection and then grab the other tables into QVD files ( one per table ) and then load and join.

If I had access to change the source system I would consider a stored procedure to create a temporary work file of the cut down 3 million rows by WHERE clause and then link the other tables before returning the dataset to QLikView but I am not sure if QlikView accepts a stored procedure as a data source?

Peter_Cammaert · ‎2015-06-13

Just a few quick ones, before I admit that I don't understand the difference between your choices (must be the weekend)

Joins in QV are slow for big tables and a memory hog. Add the latter to the delay when QV starts paging due to lack of memory and your reload becomes xxx times slower. Best practice says to replace JOINs with applymap() calls if at all possible.

Joins can be delegated to the source system in their entirety, like:

SELECT ... FROM

INNER JOIN ...

SELECT ... FROM...;

In light of the previous remarks, that may be the best choice. Depending on the type of JOIN and the way you may aggregate rows in your source system, an incremental load may become tricky to define.

QlikView has no problems whatsoever with executing Stored Proecedures in the source system. As long as they return rows, it's ok. You'll use something like:

SQL EXEC StoredProcedureName, parameter1, parameter2, ..., parameterN;

Peter

rebelfox · ‎2015-06-15

Thanks for the info.

I think I have a long way to go before understanding all the performance issues.

I will strive to reduce network traffic by using WHERE in the select clause issues to the database server.

I now deduce from your comments that JOIN statements should also be done on the database server wherever possible unless you have a really big QlikView server to do the work.

rebelfox · ‎2015-06-15

In this case I am accessing data from an AS400.

For anybody interested, the complete syntax for joining on the server instead of QlikView is as per standard SQL.:-

ODBC CONNECT TO QLIKVIEW_AS400_ODBC;

Test_Table:
LOAD

// Header Fields

ALG01,ALG02,ALG03, ALG04, ALG05, ALG06, ALG07, ALG08, ALG09, ALG10,

// Position Fields

APO01,APO02, APO02T, APO03, APO03T, APO04,APO04V,APO05,APO06,APO07,APO08;
SQL SELECT *
FROM COAS400.DBLIB.HEADER
INNER JOIN COAS400.DBLIB.POSITION ON

ALG05 = APO05

      AND ALG06 = APO06
WHERE
               APO12C = '1'
     AND APO12J >= '13';

Good to know. Thanks.

Peter_Cammaert · ‎2015-06-15

Well, it may not be useful to you now, but again: "it depends". Indeed, there are various factors that may influence performance of a document reload. The good things is that you have a choice of techniques to handle less than optimal performance. Your situation will dictate what technique is best. A bit of trial/discovery will help you out in most cases.

QlikView JOINS have alternatives:

applymap() for simple joins, e.g. those that add one or two columns to an existing table.
If your data situation permits, push JOINs to the DB Server.
Otherwise, don't hesitate to let QlikView do its own JOINS. But keep an eye on your performance indicators (CPU & RAM in Task Manager). The bigger the tables to JOIN, the more resources QlikView will need to succesfully perform a JOIN.

In any case, avoid paging at all costs.

Best,

Peter