Skip to main content
Announcements
Have questions about Qlik Connect? Join us live on April 10th, at 11 AM ET: SIGN UP NOW
cancel
Showing results for 
Search instead for 
Did you mean: 
evan_kurowski
Specialist
Specialist

Adding ETL to .QVD load statement decreases load times in comparison to (qvd optimized)

Hello Qlik community,

To this point the premise that optimized loads are faster than non-optimized has always held true.  From my understanding the only ETL syntax compatible with optimization is WHERE EXISTS() utilizing a single field, and aliasing of all fields except one involved in WHERE EXISTS().  Has this changed?

I've encountered .QVDs which in side-by-side testing are reducing in the amount of time to load, when ETL statements are added to the scripting.  A sample set of about 19million loaded optimized in about 2 minutes, and normally when the optimization is deactivated, that load time might expand to around 10 minutes.  In this case, adding ETL produced reduced times (in the example Upper(), KeepChar() & a concatenation string were added, as "deactivating" syntax).

Does anyone know what kind of circumstances could reverse load performance, and make (qvd optimized) the non-optimal result?

9 Replies
Gysbert_Wassenaar

evan_kurowski
Specialist
Specialist
Author

Hello Gysbert,

Thank you for your response, the two articles you referenced reinforce my understanding of optimized vs. non-optimized load, and I agree with all of the premises outlined in them.  What has happened is I have encountered a situation where they did not hold true.

We took the same .QVD, loaded the field list without any ETL expressions or syntax whatsoever, and during reload it showed the '(qvd optimized)' message.

We then took the same script, added ETL syntax to a few fields to deliberately deactivate optimization.  The output window showed a non-optimized load with a faster time than the optimized one.

marcus_sommer

Which ETL syntax has you applied - maybe there are more undocumented functions which didn't break the optimized logic but wouldn't be regocnized as them.

Further I could imagine that if the storage delivers the data slower as qv need to process them inclusive some etl-overhead that a non-optimized load mustn't be slower then an optimized load.

- Marcus

evan_kurowski
Specialist
Specialist
Author

I was showing a junior developer the whole "Let me demonstrate the awesome power of optimized load" spiel that I've done a dozen times in the past (and has always worked).  We were working in a brand new app, solely a one-off just to demonstrate this concept.

The original .QVDs were about 200 fields with pretty high cardinality.  Most examples of optimization had operated as expected to that point, but this particular session did not.


Just to make sure I was adding enough ETL to truly de-optimize, I added syntax to 3 fields like such:

Upper([Field1]) as [Field1],

Left([Field2], 4) & 'zzz' as [Field2],

[Field3] & '_' & [Field4] as [Newfield],

...

To this point in my QlikView experiences, anything beyond alias or WHERE EXISTS() using a single field argument has deactivated optimized load and always resulted in longer load times.  If I could've estimated based on past performance, an optimized load of our 20million rows was coming in at 2:30 mins, I was expecting north of 10 mins for the non-optimized version.  Instead, adding the syntax above shaved off :20 seconds.  Am I on candid camera?

marcus_sommer

This sounds very strange. Maybe hic could give insights what is going on.

- Marcus

Gysbert_Wassenaar

Did you load the exact same qvd file from the same location? And if so are you sure that no caching of that data occurs anywhere by the OS of your system or the hardware of your system (like the harddisk)?


talk is cheap, supply exceeds demand
evan_kurowski
Specialist
Specialist
Author

Yes, it was the same .QVD, sourced from a hard-disk on the local machine, using the Desktop to perform reloads (so network, server issues should be non-factors). We took a blank app, used the 'Table Files' wizard to script a .QVD, and on back to back reloads performed the optimized load first, modified the script to de-optimize, and ran again.

Thinking the first example was an anomaly, we repeated the test on a 20million row QVD, a 2 million row QVD, and a 500k row QVD.  Each time the non-optimized versions were coming in with script completion times approx. 10% briefer.

 

As far as caching, if there's a caching technique that can write QlikView data to hard disk and retrieve it faster than an optimized QVD load, doesn't that imply that .QVDs have something to learn from this technique? (i.e. the caching technique is a more optimal way of moving QlikView data than a QVD)

Or are you referring to persistently caching result sets in RAM the way a .QVW does? (so that the data model isn't clearing out on full reloads?)  I'm not aware of how to control that kind of caching.

Still, that would not make sense.  The ETL syntax being added was random, and we kept switching it up adding more complexity to "prove" non-optimized was slower (which wasn't happening).  If we were forming new fields on the fly there's no way a cached record set should have anticipated these new fields to have them prepared and ready.


Gysbert_Wassenaar

Your harddisk might be caching the data in its cache assuming it has one as most do these days. Or your OS may be caching data so a second request for it can be answered without having to access the disk.

You can eliminate such effects by running test lots of time. First do an optimized load 12 times, discard the slowest and fastest run and average the remaining 10 runs. Then do the same for the unoptimized load. Then compare the averages of the optimized and unoptimized loads. This is could be done with a batch file that starts qv.exe from the command line to reload the document so every run a fresh instance of qv is used.


talk is cheap, supply exceeds demand
Not applicable

I wonder, could these statements be changing the sizes of your symbol tables. I specifically refer to the change to upper case, and the substring. These could both be reducing the distinct values you need to support (as you've mentioned, you have high cardinality in the data).

The Upper command would effectively reduce AbCd, aBcD, and abcD to just ABCD - thus reducing the size of the symbol table generated here.

Similarly, the substring method you're using might also result in a less distinct data set, reducing the size of your symbol tables.

These might be the reason that you're seeing faster response, although you're not optimized.

Of course, I'm just guessing here, but thought it might be useful.