An ETL method for large tables faster than Increme... - Qlik Community

reanfadyl · ‎2016-04-21

Hello experts,

I was wondering if anyone has any ideas for a faster method to make an entire table worth of data available to Qlik than an Incremental Load?

http://www.quickintelligence.co.uk/qlikview-incremental-load/

Currently my process is:

1. Load records updated today

2. Concatenate with a QVD optimized load of records updated before today (identified through WHERE NOT EXISTS).

The process takes about 12 secs per million records, and I am doing this across several tables bringing my overall ETL to just over 3 minutes.

But this is slower than the business requirement for data availability allows for. The customers need their charts up to date within 2 minutes.

You might ask… do they really need all million records? The answer is “no”, on average it is about 10k of records per table we are needing to serve all our current charts, but the tricky bit is determining exactly which records we do require.

For example in most cases the record required will have been updated within 60 days, so we can use and SQL select based the last modified date to isolate about 97.5% of the records we need. But that still leaves 2.5% missing!

We have made attempts at extending the timeframe, or using more complex SQL statements to isolate the pool of records required from the larger tables, but there always seems to be a handful of records we miss. Besides, needing to design complex SQL for every table we want to extract seems to not be a very Qlik way of doing things.

Look forward to your ideas folks.

Anonymous · ‎2016-04-22

For the 2 main bits :

1. Load records updated today

2. Concatenate with a QVD optimized load of records updated before today (identified through WHERE NOT EXISTS).

How long does each one take ? As in which bit takes the most time ?

reanfadyl · ‎2016-04-22

Part 2 takes the most time. The duration depends on the size of the table being processed.‌

Anonymous · ‎2016-04-22

I think you having got the Incremental Load update down to 3 minutes means your QV Script and db sql are pretty much as good as they are going to get.

Sometimes splitting a large qvd into smaller qvd's, maybe a discrete qvd per month can help. But I somehow guess that may not be a viable option for you as you say "We have made attempts at extending the timeframe, or using more complex SQL statements to isolate the pool of records required from the larger tables, but there always seems to be a handful of records we miss."

The only way I think of speeding up your Optimised Load that takes up most of the time would be at the physical level - as in in maybe holding the qvd on fast solid state discs with a fast i/o route from them to the processor / RAM. This will cost money and [cynical hat on now] it is amazing how often when the person complaining about speed is advised that the solution would need money spending, can often change their mind and feel that accepting the status quo is better than denting their budget.

From another stance whilst QlikView is an excellent OLAP / BI / Data Discovery tool it may well not be a suitable tool for a 2 minute refresh of a large dataset. I would class that as an "Operational Reporting" or an OLTP requirement needing an OLTP solution not an OLAP solution.

reanfadyl · ‎2016-04-22

T‌hanks Bill, this corresponds with what I was thinking but it's good to have an outside view.

With my 'fantasy server' hat on I like the SSD idea.

I'll leave the thread open for a bit and see what others say.

Anonymous · ‎2016-04-22

I have just spec'd a server environment for a new QlikSense installation and my spec, including SSD storage, has been approved !!!

My rationale was to reduce reload chain run times for the qvd / qvf io time aspects.

SSD's are steadily moving from fantasy to mainstream, but certainly not finally there yet. I sure hope my spec works as I anticipate.

marcus_sommer · ‎2016-04-22

If the qvd-loading is already optimized and a better hardware rather a future dream then it could be helpful to consider if you really needs all fields (sounds silly here but quite often ...) respectively if the fields itself could be optimized, for example by splitting a timestamp into a date and a time field, see for this: The Importance Of Being Distinct (and this logig could be applied for other fields, too).

Another way might be to replace large string-fields like hash-fields with their origin content or to replace them with numerical values - not with autonumber() which won't work with incremental loadings else more as an approach to create in the end an additionally dimension-table. The idea behind them is to reduce the space- and RAM consumption.

Also the idea from Bill to slice the qvd's might be useful - maybe not in a horizontal direction else in vertical direction.

This as well as the idea to create additionally dimension-tables or to replace anything will add some overhead and make the things more complex but it might be to speed up the single qvd-loads and could be faster at all - especially if you could split these update-task into several parallel running update-tasks (to check out what is your most important bottleneck on the hardware).

- Marcus

reanfadyl · ‎2016-04-22

‌Thanks Marcus,

All good points, and with regards to the fields I expect a pruning here could save an additional 10-20 secs.

The other points suggest a transformation of data which we don't do untill the next step of the process after Extract.

But this does raise the idea of doing our incremental load from the Transform stage onward, but I suspect this wouldn't work in our case as we need the wider dataset available to the transform code in order for Apply Maps to work etc.

swuehl · ‎2016-04-23

Maybe have another look at your complex SQL queries.

If you only need to load some 10k rows instead of millions per table, this may not only reduce your load time, but the overall performance in the front end.

And even if designing complex SQL may not be 'a very Qlik way of doing things', using only the amount of data really needed is.

reanfadyl · ‎2016-04-23

‌Thanks. Our solution as it stands at the moment is actually a combo of the incremental load and SQL methods. Where the SQL queries do return all the records required, that is what we use as it is faster. We are using the incremental load as a fall back position.

Part of the issue is the amount of time available for the development, if we had more time then perfecting the SQL for every table would be the highest performance way to go. However, this would also add to the overhead of maintaining the solution into the future as, the SQL for each table may need to reviewed in light of any new requirements or changes. I also suspect that the SQL itself might start to look quite unwieldy, which was one of the downsides of the system we are replacing with Qlik.

An ETL method for large tables faster than Incremental Load?