14 Replies Latest reply: Jan 10, 2018 10:21 AM by Juan Moreno RSS

    Storing millions of records in QVD files

    Juan Moreno

      Hi,

       

      I have a table in my DB with approximately 220 millions of records and rising every day.

       

      In my data script, I decided to store the records in on QVD for each month but reading the data from the QVDs is taking more time than I had expected (around 15 min) and it's consuming a lot of resources in my server.

       

      Would it be better if I had one QVD per year? Is it faster for Qlik to read a single QVD than several of them?

       

      Thanks,

       

      Juan

        • Re: Storing millions of records in QVD files
          Tim Driller

          Hello Juan,

           

          In General it doesn't matter in Performance which way you go (monthly or yearly QVD's)

          Both ways Need the same RAM and CPU if you load all data.

           

          I think monthly QVD's are a good way. This way you can shrink down the data model if your Client only Needs the last 6 months.

           

          I think what you Need to check here is if you Load is an optimized load?

          if not you can save a lot of time.

          Turning Unoptimized Loads into Optimized Loads

           

          regards

          tim

            • Re: Storing millions of records in QVD files
              Juan Moreno

              For the monthly QVDs I'm using a temp table where I put the records within the month, then I store it in a qvd, drop this temp table and load (concatenate) the qvd to my main table.

               

              Looks like I'm doing a lots of steps. Do you think this could be done in a simpler way? I'm using the temp table to avoid storing records of previous months in the monthly QVDs.

                • Re: Storing millions of records in QVD files
                  pradosh thakur

                  1: you guys use incremental load ?

                  2: any aggregation function used in the script with group by ?

                  check this out this may help

                  https://community.qlik.com/docs/DOC-19450

                  • Re: Storing millions of records in QVD files
                    Marcus Sommer

                    It sounds that your temp-table approach isn't necessary and could be removed by loading the data with a where exists-clause to keep the loadings optimized. Take a closer look on the link provided by Tim.

                     

                    - Marcus

                      • Re: Storing millions of records in QVD files
                        Juan Moreno

                        Hello all,

                         

                        Thanks a lot for your replies. I've had a look on the DOC but I don't see how it applies to my case. I don't want to overwhelm you with code, but I'd like to share a piece of it where I'm doing the following:

                         

                        1. If the QVD for the month exists, I concatenate its data in the main table (traffic)

                        2. If the QVD does not exist yet, I put the data in a temporary table (temp_traffic)

                          2.1. STORE temp_traffic into new QVD

                          2.2 DROP table temp_traffic

                          2.3 Concatenate this new QVD to the main table (traffic)

                         

                        It takes a lot of time, that's why it seems to me I must be doing something wrong.

                         

                        Any idea is welcome! Thanks!

                         

                        --

                         

                        The code:

                         

                        // If the monthly qvd exists, I concatenate its data to the main table

                        IF not isnull (qvdcreatetime(_vFileName)) THEN

                          [traffic]:

                          CONCATENATE LOAD "link_timestamp",

                        "link_date",

                        "link_time",

                        "link_time_15min",

                        "link_minute",

                        "link_hour",

                        [link.id],

                        speed,

                        "speed_mph",

                        TravelTimeIndex,

                        traveltime

                          FROM $(_vFileName) (qvd);

                        ELSE

                          [temp_traffic]:

                          CONCATENATE LOAD "start_date" as "link_timestamp",

                        Date(Floor("start_date")) AS "link_date",

                        Time("start_date",'hh:mm') AS "link_time",

                        Time(Round(Frac(Time("start_date",'hh:mm')),1/96),'hh:mm') as "link_time_15min",

                        Minute("start_date") as "link_minute",

                        Hour("start_date") as "link_hour",

                        "item_id" as [link.id],

                        speed,

                        "speed_mph",

                        Round(freeflow_speed / speed_mph, 0.1) as TravelTimeIndex,

                        traveltime;

                          SQL SELECT "start_date",

                        "item_id",

                        speed,

                        "speed_mph",

                        traveltime,

                        link.freeflow_speed

                          FROM pems."public"."measure_link_15min" tc

                          LEFT JOIN link on tc.item_id = link.link_id

                          WHERE (start_date > '$(vStartDate)' AND start_date <= '$(vEndDate)') AND item_id=11001069 ;

                         

                         

                          STORE temp_traffic into $(_vFileName) (qvd);

                         

                         

                          DROP Table temp_traffic;

                         

                         

                        [traffic]:

                        CONCATENATE LOAD "link_timestamp",

                          "link_date",

                          "link_time",

                          "link_time_15min",

                          "link_minute",

                          "link_hour",

                          [link.id],

                          speed,

                          "speed_mph",

                          TravelTimeIndex,

                          traveltime

                        FROM $(_vFileName) (qvd);

                         

                        ENDIF

                  • Re: Storing millions of records in QVD files
                    Juan Moreno

                    Well, I'm supposed to make all records available, so that's why I read the qvd of all months. The problem I'm seeing is that when I load the 220 million records in the table, the App stops working (I can't open it, I see the animated circles indefinitely).

                      • Re: Storing millions of records in QVD files
                        Martin Pohl

                        try to debug script with a few records to see that data model is ok (no synth keys or loops)

                        • Re: Storing millions of records in QVD files
                          Petter Skjolden

                          When you have typically more than a million records it is always important to look at how you can help Qlik to get the most out of it's built-in compression mechanisms.

                           

                          1. The #1 method for doing that is to look at the cardinality of each field. If you can turn low-cardinality field into a high-cardinality field you can save a lot of space (memory) and it affects also calculation time accordingly.
                          2. Don't store redunant data - fields that contains just variants of each other. You have several of them.

                          "link_timestamp",

                          "link_date",

                          "link_time",

                          "link_time_15min",

                          "link_minute",

                          "link_hour",

                          [link.id],

                          speed,

                          "speed_mph",

                          TravelTimeIndex,

                          traveltime

                           

                            • link_timestamp might be stored with a high precision. So it would very often be a high-cardinality field. You can often round them off to a much lower resolution and save a lot of memory. Sometimes as much as 6-20 times the reduction in storage.
                            • Maybe you don't need the link_timestamp at all - why do you keep it - doesn't the other fields contain the same information if you combine them?
                            • link_time and link_time_15min is redundant. If you need to keep link_time at least round it off to the lowest acceptable resolution like seconds or minutes. You might save a lot there too.
                            • Be sure that link_date contains only the date part so truncate it to an integer by doing av Floor(link_date)
                            • speed and speed_mph are redundant. Why not just use a conversion factor when you need mph? it's a constant of 1.69034 and is easy to add to any expression.
                            • be sure to round off the speed to the lowest resolution possible to save memory. Do you really need it to be with decimals - if not make it into an integer - you can round it off too.

                           

                          Splitting a high-cardinality field like a timestamp into it's constituent parts like date, hour, minute, second and tenths of seconds will help tremendously - but don't keep the original timestamp. This will enable much much higher compression under the hood for Qlik. You might save a lot of memory.

                            • Re: Storing millions of records in QVD files
                              Juan Moreno

                              Thanks Petter. I've been editing the script following your advice and I've seen some improvement. I'm keeping the timestamp and I think the cardinality is not that high since in my DB table all values are in 15min samples (9:00, 9:15, 9:30, 9:45, 10:00, 10:15...). But I've got rid of all the other "link" fields and of speed. Speed_mph I've rounded it as you suggested.

                               

                              Still, it takes a long time to load, but I'm now thinking of the resources of the machine where I have Qlik Sense installed. I'm using two cores and the CPU usage is of 100% during the load. I'm going to increase to 4 and see what happens.

                               

                              Same with the memory, but this time not when loading but while editing or using the app. The 8 GB I have are used most of the time so I'm going to increase the memory too.

                               

                              Thanks for the help, and I'll let you all know how it goes.

                               

                              Regards,

                               

                              Juan