Odd number when loading QVDs

markperrone · ‎2022-10-07

We're loading a collection of daily qvds, scrubbing and appending each into a final qvd.
After appending we drop the temp table. Inside the loop...

//optimize load
NOCONCATENATE
TempTable:
Load
$(vBaseFieldList)
From $(vDailyFile) (qvd)
Where EXISTS(FILTER);

//scrub data using resident load
if firstfile
NOCONCATENATE
FinalTable:
Load
$(vFieldList)
RESIDENT TempTable;
firstfile=0;
ELSE
CONCATENATE(FinalTable)
Load
$(vFieldList)
RESIDENT TempTable;
ENDIF
DROP TABLE TempTable;

Something unusual starts happening after 150 or so files. The qvd
row count (below - 123,100,000) is wrong. Once this number shows up it keeps
getting higher and the script starts slowing down.

10/7/2022 8:32:46 AM [lib://OurData\DailyFile__20211201.qvd] 160 of 387
TempTable << RQ039__20211201
(QVD (row-based) optimized)
Lines fetched: 256,638
FinalTable << TempTable
Lines fetched: 19,106,104

TempTable << RQ039__20211202
(QVD (row-based) optimized)
123,100,000

then it will show normally (overwrites the 123,100,000)
Lines fetched: 84,198
FinalTable << TempTable
Lines fetched: 19,190,302

JustinDallas · ‎2022-10-07

I'm thinking something might be going on in your loop concerning the CONCATENATE statement. If the fieldlist of table Foo has the same names as the fieldlist of table Bar, then the concatenation will happen automatically.

So I think somehow your table is being concatenated to itself, over and over again and that is causing the row count explosion.

When I have to do what you're doing, I start my table with a dummy column that I later drop.

[Starter Table]:
Load * Inline
[
   'Dummy Column'
    1
]
;

//Loop and concatenate to the [Starter Table] here

Drop Field [Dummy Column] From [Starter Table]
;

markperrone · ‎2022-10-08

So you're saying create the table with dummy field, concatenate and drop the table each loop ?

interesting. I'll have to give that try.

Originally, we would loop the "temp" table then do a resident load of everything but this runs the risk of two very large tables stressing the memory but this row count explosion never happened.

do while...

if firstfile
   temptable:
   noconcatenate
   load $(basefieldlist) from $(qvdfile);
else
   concatenate(temptable)
   load $(basefieldlist) from $(qvdfile);
endif

loop

finaletable:
load $(fieldlist) resident temptable;
drop temptable;

marcus_sommer · ‎2022-10-10

I suggest to change the implicitly file-wildcard load-loop to an explicitly filelist() loop, like:

for each file in filelist('$(qvdfile)')
... firstfile logic ...
t: load ... from [$(file)] (qvd);
let x = $(x) - noofrows('t'); let y = noofrows('t');
Trace $(y) ... $(x);
next

With such logic you could control each single load by checking all kind of information and maybe branching into various measures.

- Marcus

vinieme12 · ‎2022-10-10

If all your DailyFiles have the same structure i.e same columns you won't even need to loop thru all files

You can simply make use of Auto Concatenation

FinalTable:

Load * from [lib://OurData\DailyFile__*.qvd](qvd);

Vineeth Pujari
If a post helps to resolve your issue, please accept it as a Solution.

markperrone · ‎2022-10-10

Would love to do that but there are other files in the folder. we only want to load the files from the last 13 months.

marcus_sommer · ‎2022-10-10

Within the above suggested filelist() such things are quite easily possible by extracting the relevant information from the file-name or looking to the file-time or similar measures.

- Marcus

markperrone · ‎2022-10-10

Tried that but the numbers are correct. Wish you could see the script running. Originally, we did a an optimized load of all files and the resident after the loop but this tasked the memory with 500M rows each. When we monitored this load the odd loading row count never appeared.

One thing we thought was to load in chunks and store each qvd. After the loop, simply read all the chunks

marcus_sommer · ‎2022-10-10

Coming back to your origin description I'm not sure that there two issues else it might be just one and the others are an afterwards effect. Means at first comes hitting any limitation, for example not to have enough available RAM and then the performance slow-down starts and with it also the displaying of the wrong record-number.

You may now check if really a lack of RAM is causing your issue and if you may be able to increase this resource.

Beside this I suggest to check if you really load the relevant information from the qvd (there might be more fields and records as needed) and if these information are stored properly (no record counter, timestamps, formatted values and similar stuff) to minimize the RAM footprint. Further thinkable is to apply another kind of data-layer as extra data or as replacement - this means storing the data as periodic data of YYYYMM or similar. Also to split your ETL into several steps to avoid that too many data are loaded within a single application. Of course it will add some overhead but each single step is simpler, easier to develop and maintain, simpler to troubleshoot, running in different time-frames or in parallel ...

- Marcus

General Question