13 Replies Latest reply: Apr 25, 2016 11:53 AM by Bill Markham RSS

    An ETL method for large tables faster than Incremental Load?

    Rean Fadyl

      Hello experts,

      I was wondering if anyone has any ideas for a faster method to make an entire table worth of data available to Qlik than an Incremental Load?

      http://www.quickintelligence.co.uk/qlikview-incremental-load/

       

      Currently my process is:

      1.       Load records updated today

      2.       Concatenate with a QVD optimized load of records updated before today (identified through WHERE NOT EXISTS).

       

      The process takes about 12 secs per million records, and I am doing this across several tables bringing my overall ETL to just over 3 minutes.

       

      But this is slower than the business requirement for data availability allows for.  The customers need their charts up to date within 2 minutes.

       

      You might ask… do they really need all million records? The answer is “no”, on average it is about 10k of records per table we are needing to serve all our current charts, but the tricky bit is determining exactly which records we do require. 

       

      For example in most cases the record required will have been updated within 60 days, so we can use and SQL select based the last modified date to isolate about 97.5% of the records we need.  But that still leaves 2.5% missing!

       

      We have made attempts at extending the timeframe, or using more complex SQL statements to isolate the pool of records required from the larger tables, but there always seems to be a handful of records we miss. Besides, needing to design complex SQL for every table we want to extract seems to not be a very Qlik way of doing things.

       

      Look forward to your ideas folks.

        • Re: An ETL method for large tables faster than Incremental Load?
          Bill Markham

          For the 2 main bits :

           

          1.       Load records updated today

          2.       Concatenate with a QVD optimized load of records updated before today (identified through WHERE NOT EXISTS).


          How long does each one take ?  As in which bit takes the most time ?

            • Re: An ETL method for large tables faster than Incremental Load?
              Rean Fadyl

              Part 2 takes the most time.  The duration depends on the size of the table being processed.

                • Re: An ETL method for large tables faster than Incremental Load?
                  Bill Markham

                  I think you having got the Incremental Load update down to 3 minutes means your QV Script and db sql are pretty much as good as they are going to get.

                   

                  Sometimes splitting a large qvd into smaller qvd's, maybe a discrete qvd per month can help.  But I somehow guess that may not be a viable option for you as you say "We have made attempts at extending the timeframe, or using more complex SQL statements to isolate the pool of records required from the larger tables, but there always seems to be a handful of records we miss."

                   

                  The only way I think of speeding up your Optimised Load that takes up most of the time would be at the physical level - as in in maybe holding the qvd on fast solid state discs with a fast i/o route from them to the processor / RAM.  This will cost money and [cynical hat on now] it is amazing how often when the person complaining about speed is advised that the solution would need money spending, can often change their mind and feel that accepting the status quo is better than denting their budget.

                   

                  From another stance whilst QlikView is an excellent OLAP / BI / Data Discovery tool it may well not be a suitable tool for a 2 minute refresh of a large dataset.  I would class that as an "Operational Reporting" or an OLTP requirement needing an OLTP solution not an OLAP solution.

              • Re: An ETL method for large tables faster than Incremental Load?
                Marcus Sommer

                If the qvd-loading is already optimized and a better hardware rather a future dream then it could be helpful to consider if you really needs all fields (sounds silly here but quite often ...) respectively if the fields itself could be optimized, for example by splitting a timestamp into a date and a time field, see for this: The Importance Of Being Distinct (and this logig could be applied for other fields, too).

                 

                Another way might be to replace large string-fields like hash-fields with their origin content or to replace them with numerical values - not with autonumber() which won't work with incremental loadings else more as an approach to create in the end an additionally dimension-table. The idea behind them is to reduce the space- and RAM consumption.

                 

                Also the idea from Bill to slice the qvd's might be useful - maybe not in a horizontal direction else in vertical direction.

                 

                This as well as the idea to create additionally dimension-tables or to replace anything will add some overhead and make the things more complex but it might be to speed up the single qvd-loads and could be faster at all - especially if you could split these update-task into several parallel running update-tasks (to check out what is your most important bottleneck on the hardware).

                 

                - Marcus

                  • Re: An ETL method for large tables faster than Incremental Load?
                    Rean Fadyl

                    Thanks Marcus,

                     

                    All good points, and with regards to the fields I expect a pruning here could save an additional 10-20 secs.

                     

                    The other points suggest a transformation of data which we don't do untill the next step of the process after Extract.

                     

                    But this does raise the idea of doing our incremental load from the Transform stage onward, but I suspect this wouldn't work in our case as we need the wider dataset available to the transform code in order for Apply Maps to work etc.

                      • Re: An ETL method for large tables faster than Incremental Load?
                        Marcus Sommer

                        I must admit that my thoughts (unless the first two regarding to the check if fields could be left or optimized) are rather the worst case - to try to speed up the reload-times on cost of the datamodel and/or the gui-performance, especially with unknown output then it's not certain that those efforts will be save enough reload-time - in your case about 33%.

                         

                        Another thought is to use a two-step update process - one while the business time is to make a binary-load on the datamodel and to add to them the newly sql-data (maybe also prepared as qvd) and while the nightly update window the whole datamodel will be updated.

                         

                        - Marcus

                    • Re: An ETL method for large tables faster than Incremental Load?
                      Stefan Wühl

                      Maybe have another look at your complex SQL queries.

                       

                      If you only need to load some 10k rows instead of millions per table, this may not only reduce your load time, but the overall performance in the front end.

                       

                      And even if designing complex SQL may not be 'a very Qlik way of doing things', using only the amount of data really needed is.

                        • Re: An ETL method for large tables faster than Incremental Load?
                          Rean Fadyl

                          Thanks.  Our solution as it stands at the moment is actually a combo of the incremental load and SQL methods.  Where the SQL queries do return all the records required, that is what we use as it is faster.  We are using the incremental load as a fall back position.

                           

                          Part of the issue is the amount of time available for the development, if we had more time then perfecting the SQL for every table would be the highest performance way to go.  However, this would also add to the overhead of maintaining the solution into the future as, the SQL for each table may need to reviewed in light of any new requirements or changes. I also suspect that the SQL itself might start to look quite unwieldy, which was one of the downsides of the system we are replacing with Qlik.

                        • Re: An ETL method for large tables faster than Incremental Load?
                          Peter Cammaert

                          If I read your OP correctly, your Incremental load experiences problems with loading the right set of records. Shouldn't you fix that one first?

                          IMHO there are two steps in every ETL approach:

                           

                          • Step 1 - Select a LOAD strategy and make it work 100% of the time. It doesn't really matter that a new strategy doesn't produce sub-2 min reloads immediately. However it must provide room for improvement. For example, use direct discovery to import data that is only needed at specific moments and piece-by-piece (not by the millions)
                          • Step 2 - Optimize the LOAD times without changing the LOAD strategy. This can be done best by influencing external factors like hardware (SSDs!), I/O bandwidth, restructuring and parallelizing queries and loads etc. Usually there is lots of room for improvement, with only physics and budget to take into account.

                           

                          If the second step doesn't produce the required improvements, start over again by selecting a new strategy and make this strategy work.

                           

                          And in the end, there is still a hard limit (now, tomorrow or in the near future): can it be done, a refresh of all this data in under 2 mins?

                           

                          Peter