4 Replies Latest reply: Mar 30, 2015 10:55 AM by Peter Turner RSS

    How to optimize when loading large amount of data from txt file

    Rogerio Faria

      Hi all,

       

      I need to load a big text file, generate qvd, then a grouped qvd.

      It will be like this:  TXT --> QVD --> GroupedQVD

       

      The txt is very big, more than 40Gb. (This is mandatory, can't connect to Database)

      So I decided to split the file, and load only 1 mi rows at a time, this way many files will be genarated.

      1 TXT --> Many QVD files --> Many Grouped QVD files

       

      This is ok.

      The problem is the time to process. It get slower as it create more files. It takes about 10 hours:

        - First Hour, 16 files generated

        - Second Hour, 9 files

        - Third Hour, 7 files

        - Fourth Hour, 5 files

        - Fifth Hour, 5 files

        - and so on....


      I know this happens because of the parameter "header is $(cont) lines" on load script.

      Well, is there anything I can do, or any tip, that could optimize this process.

       

      My code is bellow:

       

      let seq = 1;
      let step = 1000000;
      let cont = 0;
      let leave=false();

      do while leave=false()

      tab:
      First $(step)
      LOAD
      *
      FROM
      c:\MyBigFile.txt (txt, codepage is 1252, no labels, delimiter is ';', header is $(cont) lines);

      store tab into tab_$(seq).qvd(qvd);

      LET
      seq = seq +1;

      LET cont = cont + step;
      LET regs = NoOfRows('tab');
      drop table tab;


      if (regs<step) then
          LET
      leave = true();
      endif

      loop

       

      Rogério

        • Re: How to optimize when loading large amount of data from txt file
          Marcus Sommer

          Have you tried to load these txt in one step? Against them is only the max. amount of RAM which couldn't be enough. Further do you need all fields from this file and are all these fields really fact-fields or are beside some id's any descriptions to them included - this could be linked later per dimension-tables?

           

          Another thought is to increase the number of rows to maybe 5M or 10M then 1M isn't really big. Also the splitting-logic could be changed from a record-level to a content-level with a where-clause to periods or some categories like products or countries and/or to split the data vertically, too.

           

          Further if you worked inside a network it will be speed up the loads and stores if you transferred it to a harddisc-drive.

           

          - Marcus

            • Re: How to optimize when loading large amount of data from txt file
              Rogerio Faria

              Thanks for the answer Marcus,

               

              Yes, in one step, its a little faster, but I need that, because the mid-tier changes sometimes, so I process only the mid-tier, that is fast, and I don't need to read all the txt file again.

              Yes I need all the fields (25), this file was created only for Qlikview to read it.

              Some fields are dimensions, but time-dimension, so they cannot be linked later.

               

              I didn't test it with different number of rows. I will do it and share the results within a few days.

              I dont know how I could split the files using a dimension, lets say: Category, without taking more time than today. could you get in details? Would be something like this??:

              allrows: load * from c:\MyBigFile.txt (...);

              cat1: load * Resident allrows where Category=1; store cat1 into cat1.qvd(qvd) ; drop table cat1;

              cat2: load * Resident allrows where Category=2; store cat2 into cat2.qvd(qvd) ; drop table cat2;

              cat3: load * Resident allrows where Category=3; store cat3 into cat3.qvd(qvd) ; drop table cat3;

              drop table allrows;

               

              And all files used, for load or store, are local.

              Thanks again...

                • Re: How to optimize when loading large amount of data from txt file
                  Marcus Sommer

                  Yes I meant something like this - to load everything in one step and in a second loop-step stored it in slices maybe in years as qvd.

                   

                  Also it could be helpful to think about some transforming-steps within this load(s). For example if are some from these time-dimensions timestamps the split into a date- and a time-field - per floor() and frac() and maybe throw away the milliseconds - would save a lot of RAM and file-sizes from the qvd and could be faster over all. Maybe such logic could be applied to other fields, too:

                   

                  http://community.qlik.com/blogs/qlikviewdesignblog/2014/03/11/the-importance-of-being-distinct

                   

                  - Marcus

                  • Re: How to optimize when loading large amount of data from txt file
                    Peter Turner

                    Hello Rogerio,

                     

                    If you are able to load that large txt file into ram in one step, the splitting part could be done faster by:

                    Loading MyBigFile.txt into QV table Allrows,

                    Perform any transformation/modelling here if possible

                    Store MyBigFile into a QVD, drop table Allrows.

                     

                    Then loading the large QVD using a where exists with your Category filter field to get a smaller subset of the file.

                    This would be an optimised QVD load and you'll then be able to store the subset back into a smaller QVD cat1.qvd if needed.

                     

                    The advantage of this over doing a resident load for each of your filter category’s is that an optimised QVD load will be faster than a resident load, and when you resident load it will need more RAM to store the copy of data.

                    Hope that helps!