Re: An ETL method for large tables faster than Inc... - Page 2 - Qlik Community

reanfadyl · ‎2016-04-21

Hello experts,

I was wondering if anyone has any ideas for a faster method to make an entire table worth of data available to Qlik than an Incremental Load?

http://www.quickintelligence.co.uk/qlikview-incremental-load/

Currently my process is:

1. Load records updated today

2. Concatenate with a QVD optimized load of records updated before today (identified through WHERE NOT EXISTS).

The process takes about 12 secs per million records, and I am doing this across several tables bringing my overall ETL to just over 3 minutes.

But this is slower than the business requirement for data availability allows for. The customers need their charts up to date within 2 minutes.

You might ask… do they really need all million records? The answer is “no”, on average it is about 10k of records per table we are needing to serve all our current charts, but the tricky bit is determining exactly which records we do require.

For example in most cases the record required will have been updated within 60 days, so we can use and SQL select based the last modified date to isolate about 97.5% of the records we need. But that still leaves 2.5% missing!

We have made attempts at extending the timeframe, or using more complex SQL statements to isolate the pool of records required from the larger tables, but there always seems to be a handful of records we miss. Besides, needing to design complex SQL for every table we want to extract seems to not be a very Qlik way of doing things.

Look forward to your ideas folks.

marcus_sommer · ‎2016-04-25

I must admit that my thoughts (unless the first two regarding to the check if fields could be left or optimized) are rather the worst case - to try to speed up the reload-times on cost of the datamodel and/or the gui-performance, especially with unknown output then it's not certain that those efforts will be save enough reload-time - in your case about 33%.

Another thought is to use a two-step update process - one while the business time is to make a binary-load on the datamodel and to add to them the newly sql-data (maybe also prepared as qvd) and while the nightly update window the whole datamodel will be updated.

- Marcus

Peter_Cammaert · ‎2016-04-25

If I read your OP correctly, your Incremental load experiences problems with loading the right set of records. Shouldn't you fix that one first?

IMHO there are two steps in every ETL approach:

Step 1 - Select a LOAD strategy and make it work 100% of the time. It doesn't really matter that a new strategy doesn't produce sub-2 min reloads immediately. However it must provide room for improvement. For example, use direct discovery to import data that is only needed at specific moments and piece-by-piece (not by the millions)
Step 2 - Optimize the LOAD times without changing the LOAD strategy. This can be done best by influencing external factors like hardware (SSDs!), I/O bandwidth, restructuring and parallelizing queries and loads etc. Usually there is lots of room for improvement, with only physics and budget to take into account.

If the second step doesn't produce the required improvements, start over again by selecting a new strategy and make this strategy work.

And in the end, there is still a hard limit (now, tomorrow or in the near future): can it be done, a refresh of all this data in under 2 mins?

Peter

reanfadyl · ‎2016-04-25

Thanks Peter. Just to clarify. The Incremental Load method does return all the records I need (and more). It was the SQL method that was failing to return all the records we need.

Appreciate the other input. To be honest I haven't done much exploration of direct discovery across multiple tables yet. Is there any threads you recommend on this topic?

Anonymous · ‎2016-04-25

Personally I have never had much joy with Direct Discovery, so gave up trying to use it. Maybe just me though.

An ETL method for large tables faster than Incremental Load?