Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
Fernando_Sanchez
Partner - Contributor III
Partner - Contributor III

Any better way to split a huge table into differente qvds

Hi all!

I am sure this topic has been discussed tons of time but can´t find a proper answer.

Imagine you have table with Millions of registers (for example CDRs calls) and you want to split it into different qvds each of if for a certaing day.

Right now, what I am doing is a simple loop and store, something like this:

 

FOR vDay=  $(vDayMin) to $(vDayMax)

//Load original file
NoConcatenate
CDRS:
load *
from ..\work\CDRS.qvd (qvd)

//Load files for vDay
NoConcatenate
CDRS_DATE:
load *
Resident CDRS
where day = $(vDay)

//Store day file
store CDRS_DATE into ..\output\$(vDay)_CDRS.qvd (qvd)

drop table CDRS_DATE;

//Decrease original file without already stored registers

NoConcatenate
CDRS_PENDING:
load *
Resident CDRS
where day > $(vDay)

drop table CDRS;

store CDRS_PENDING into ..\work\CDRS.qvd (qvd);

drop table CDRS_PENDING;

next;

 

 

With this code, everything works fine, but it taked too long. CDRS original file might be over 100 million registers.

 

Is there anyway to do it better? some way to iterate vDay within CDRS table only once and "create" on memory CDRS_DAY tables and afterwards make a loop and store all of them?

 

Thanks a lot in advance.

 

Regards.

 

Labels (1)
6 Replies
Vegar
MVP
MVP

You are reading from and storing to file more than neccessary in your solution. What about keeping your initial load throughout the whole process? Like this?

 

//Load original file
CDRS:
load *
from ..\work\CDRS.qvd (qvd);

for each vDay in fieldvaluelist('Day')

  CDRS_DATE:
  NoConcatenate load *
  Resident CDRS
  where day = '$(vDay)';

  //Store day file
  store CDRS_DATE into ..\output\$(vDay)_CDRS.qvd (qvd);
  drop table CDRS_DATE;
next vDate

drop table CDRS;

 

Fernando_Sanchez
Partner - Contributor III
Partner - Contributor III
Author

Hi Vegar!

thanks for the answer!

that would be the straightest way, but beware CDRS is about hundred millions lines, so every time you would search vDay against those millions lines.

CDR_PENDING is there to decrease pending registers everytime you loop vDay (store and load qvds is very fast).

With the amount of data this solution is much more efficient.

My question is if theres any way to split a table based in a field without using loop.

thanks!

Vegar
MVP
MVP

I think you would need to loop in some way, but you should try to ensure an optimized load of your qvd file in order to save time. 

If you have trouble holding/handling the amount of data in my previous suggestion then you might consider to never load all the values at once. Consider this suggestion:

 

 

FOR vDay= $(vDayMin) to $(vDayMax)
  DayScope:
  LOAD
 
    '$(vDay)' as Day
  ];  
 

  //Optimized loag of single day data from qvd file
  CDRS_DATE:
  load *
  from ..\work\CDRS.qvd (qvd)
  where
     exists(Day) //Exists does not break optimized load
   ;
 

  Drop table DayScope;


  //Store day file
  store CDRS_DATE into ..\output\$(vDay)_CDRS.qvd (qvd);
  drop table CDRS_DATE;
 next vDate

Fernando_Sanchez
Partner - Contributor III
Partner - Contributor III
Author

Thanks Vegar, 

unfortunatly the solution needs to loop and check within the table everytime we iterate a day.

My question is if there is any possible way to do it without looping 😞

marcus_sommer

I don't think that this could be done without a loop - at least not if you start with such qvd. Maybe this split might be better done in beforehand? And if this isn't very practically a split on a weekly/monthly level could reduce the afterwards efforts.

Compared with your origin method and using a classical where-clause should the where exists() suggestion from @Vegar perform much better because now the loads would be optimized. Especially if you there only perform the daily split and not the decrease of the CDRS_PENDING because this could be done after the loop because a exists-filter-field could be created twice - one time for the daily split which is dropped again in each iteration and the second time each day is concatenated in the field and after the loop is a where not exists() performed (the second field needs of course another fieldname which is then appropriate renamed before the reducing-load.

Further you may speed up this task by doing the split- and reducing-loads on a resident table and not loading from a qvd. AFAIK think the most people that an optimized qvd-load is faster as an optimized resident-load but I'm not so sure that's really true especially not if the storage or the network is rather slow and a bigger bottleneck as RAM or CPU. IMO it's a try worth.

- Marcus 

Fernando_Sanchez
Partner - Contributor III
Partner - Contributor III
Author

Great answer Marcus.

Definititly would give a try to @Vegar proposed solution.

The preload of a qvd was made up for the example. In the real world I have to upload a bunch of csv files.

I would give a run in paralel what is better and consumes less time.

Thanks a lof for your kindness.

Regards.