7 Replies Latest reply: Apr 26, 2013 4:28 AM by Friedrich Hofmann RSS

    Advanced performance tuning

    Nick Borunov
      Here is thedeal.
      This is the statement which I need to tune:
      TempDates:
           LOAD
                 'D' as AggrLevel,
                 *,
                 if(dmDateHour = MonthStart(dmDateHour), ES) as MonthES, \\ calc ES for Monthlyaggregation
      
                 if(dmDateHour = WeekStart(dmDateHour), ES) as WeekES, \\ calc ES for Weeklyaggregation
      
                 MonthStart(dmDateHour) as MonthStartDay,
                 WeekStart(dmDateHour) as WeekStartDay
           ;
           LOAD
                 DPId,
                 TZId,
                 StationId,
                 DayDateHour as dmDateHour,
                 AggrCountry,
                 sum(TLH) as TLH,
                 sum(SS) as SS,
                 sum(if(DPESDateHour = dmDateHour, ES)) as ES \\ calc ESfor Daily aggregation
           Group by
                 DPId,
                 TZId,
                 StationId,
                 DayDateHour,
                 AggrCountry
           ;
           LOAD
                 DPId,
                 TZId,
                 StationId,
                 dmDateHour,
                 AggrCountry,
                 DayDateHour,
                 sum(TLH) as TLH,
                 sum(SS) as SS,
                 sum(ES) as ES,
                 Min(dmDateHour) as DPESDateHour \\first DateHour for a day part
           Group by 
                 DPId,
                 TZId,
                 StationId,
                 dmDateHour,
                 AggrCountry,
                 DayDateHour 
           ;
           LOAD
                 DPId,
                 TZId,
                 StationId,
                 dmDateHour,
                 AggrCountry,
                 DayDateHour,
                 TLH,
                 SS,
                 ES
           Resident Temp
           Where DayDateHour >= $(vDateStart) \\dailyloop condition
            and DayDateHour < $(vDateEnd); 
      
      Temp table:
      The table has aggregated information about sessions on different stations for predefined day parts.
      
        Field
        
        Description
        
        DPId
        
        30 possible day  parts (e.g. “Whole Week/Whole Day”, 
      “Monday-Friday/10-13”, “Tuesday-Sunday / 9  - 20” 
      and etc.)
        
        TZId
        
        5 possible time  zones (e.g. GMT, EST and etc.)
        
        StationId
        
        About 16 500  Stations’ Ids
        
        dmDateHour
        
        A timestamp rounded  to each hour sharp (e.g. 
      1/1/2011 10:00:00) or in QV syntax 
      = floor(timestamp  * 24)/24
        
        AggrCountry
        
        2 possible countries  – Main and Others
        
        DayDateHour
        
        A beginning of a day  calculated derived 
      from dmDateHour (= floor(dmDateHour))
        
        TLH
        
        Total hours for a  sessions during 
      dmDateHour (> 0)
        
        SS
        
        # of Started Sessions  during dmDateHour
        
        ES
        
        # of Entered  Sessions from previous hour 
      into current dmDateHour
        
      
      The table has 1 month data. As you can see it may have 30 * 5 * 16500 * (30 * 24) * 2 
      = 3564 000 000 records. Usually it’s not that huge because not all stations have 
      an activity during a month. It’s about 30-40 millions of records.
      
      Main problem is related with ES. I need to derive number of Active Sessions (AS)which 
      is Sum(SS) + Sum(ES of first dmDateHour in a day part). And I need to aggregate it to Day. 
      That’s why I have these multi level aggregations.
      
      I’ve made this condition in a begging
      
      Where DayDateHour>= $(vDateStart)
            andDayDateHour < $(vDateEnd);
      
      to reduce a number of records for the aggregation. I have a loop where I derive vDateStart
       and vDateEnd for each day of a month. If I remove this condition it will kill whole memory 
      on a server (98GB).
      
      So, daily aggregation takes 45 mins per each day (about 22 hours for whole month).
      
      If I have just one day data in the Temp table it takes about 3 mins to do sameaggregation 
      for a day. But most interesting thing is all the rest days of a monthin this daily loop 
      will calculates during 90 seconds to retrieve NO DATA fromthe Temp table. It means that 
      a day aggregation takes about 90 second and other90 seconds for data retrieval for 
      the condition.
      
      My assumption is – the more data I have in the table the longer this data retrievalis. 
      (It’s too pity that we don’t have indexes in QV). 
      
      Is there other way to make this retrieval faster? Or maybe there is another way 
      for my statement.
      
      Please help.
      
      P.S. The Word version is in attachment.
      
        • Re: Advanced performance tuning
          Daniel Rozental

          Nick, is really hard to give a solution without being able to test things with the real volume and really understand why you're doing each thing.

           

          I think it will be best to work with a consultant on site.

           

          A few ideas that might help

          1) Instead of doing something like this

                             

           

          if(dmDateHour = MonthStart(dmDateHour), ES) as MonthES, \\calc ES for Monthly aggregation
          if(dmDateHour = WeekStart(dmDateHour), ES) as WeekES, \\calc ES for Weekly aggregation
          MonthStart(dmDateHour) as MonthStartDay,
          
          WeekStart(dmDateHour) as WeekStartDay
          

           

          Perhaps you can have a table, built just from script using autogenerate, that only has records por monthstart and weekstart dates

           

          So you can do something like

           

          LEFT JOIN(TempDates)
          LOAD
                    dmDateHour,
                    MonthStart(dmDateHour) as MonthStartDay,
                    WeekStart(dmDateHour) as WeekStartDay,
                    1 as FLG_MonthES,
                    1 as FLG_WeekES
          RESIDENT CALENDAR;
          

           

          And then use the FLGs in the calculations instead of creating new fields, something like SUM(ES*FLG_MonthES).

           

          2) Instead of doing this

           

                 Where DayDateHour >=$(vDateStart) \\daily loop condition
                  and DayDateHour < $(vDateEnd); 
          

           

          You can again, create a table with the dates you want to obtain, i.e all dates in the week, or all dates in the month, and do something like this

           

          LOAD
                    Date AS DayDateHour 
          RESIDENT CALENDAR
          Where Date >=$(vDateStart)
             and Date < $(vDateEnd);
          
          LOAD
              DPId,
              TZId,
              StationId,
              dmDateHour,
              AggrCountry,
              DayDateHour,
              TLH,
              SS,
              ES
          FROM File (qvd)
          WHERE Exists(DayDateHour);
          
          

           

          This will cause the load to be optimized and will also filter records with the desired dates.

          • Re: Advanced performance tuning

            You should comment out the top portion and load just the first load statement and record the time.  Then uncomment each preceeding load statement one at a time, and perform a reload to capture the time.  Use the time differences to figure out where you really start to loose performance.  I would also monitor PC and RAM and write down some notes each step of the way.

             

            Also, as stated above its hard to really know the impact of a suggestion without using your hardware and data, but I would try the following:

             

            1) Seperate Day, Date and Hour into different fields this will make the possible values less distinct and possibly decrease your memory footprint.  Probably only a slight difference but it may add up for you. 

             

            2) All of your date functions result in Dual value outputs, try putting a Num() around them so only the numeric value is passed to the next load.  Not sure if the overhead of the Num() will outweight the removal of the Daul value.  Only testing will tell.

             

            3) You may want to try using numbers as your IDs not codes.  1, 2, ... may perform better than EST, GMT, ... again I would test.

             

            4) Try saving the Temp resident table to a QVD and dropping it, and then loading it back into memory with the where statement, this may sound crazy but this will cause QlikView to compress the data and free up the memory.  Loading a large resident into another resident is ofter a bad idea because until you drop that Temp table (that won't be needed in the end) your using twice the memory.

             

            I don't promise any of these suggestions will work, but I would try each one and record the results.  Please share the results I would be interested to know whether each suggestion has a negative or positive impact.  There are also other things to consider like what are your hardware and software specs.  I have seen machines where the QlikView load process slows down if the CPU affinity is too high (this only started occuring after the multi-threaded loads were introduced).

              • Re: Advanced performance tuning
                Friedrich Hofmann

                Hi,

                 

                is there a document or a Blog around for general performance tuning tips and tricks, everything one can do to make different parts of a QlikView document run faster?

                Thanks a lot!

                Best regards,

                 

                DataNibbler

                 

                P.S.: I've bookmarked one thread where someone put a collection of some things. Still it seems like a good idea to put all that into a document...