Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
I'm in the process ofcreating an incremental loading framework based upon the RobWunderlich Cookbook: http://robwunderlich.com/downloads/
My framework makes it easyto incrementally load files automatically from a database and includes loggingfeatures as well. However, in my testing I have come across ananomaly. It seems that when I load part of a QVD into a QVWto compare for the incremental load, the smaller the portion of the QVD that Iload, the longer the load takes.
The server I amtesting with is atwo quad core processor server with 32GB of Ram and locally attached 15K SCSIdrives in Raid5 and the QVD file is 1.51GB.
The query is very basic.
FactTable:
Load
…Fields…
From FactTable.qvd (qvd);
The field I am using in my Where clause is a date field, but I don't thinkthat should matter. The selective load format would be like this.
FactTable:
Load
…Fields…
From FactTable.qvd (qvd)
WHERE (Date < '1/1/2011');
Each test was performed 3 times and I'm presenting the average time in seconds to load.
Method | Seconds |
---|---|
FullQVD Load(Optimized) | 55 |
Extract 100% non-optimized | 137 |
Extract 90% with Where | 162 |
Extract 50% with Where | 207 |
Extract 10% with Where | 270 |
This makes no sense to me. If anything, I would think that the "smaller" extractions would be faster or at least no slower.
My first question would be, does anyone know why Qlikview loads the larger extracts faster than the slower extracts?
I then performed a 10% extract with a resident load. This involved a full optimized load, then load selected records from the in memory resident table into another table, and drop the original table. That method took the most time at 313 seconds. Also, the behavior during this load was consistent and "strange."
So, this begs another question. If all of this table manipulation takes place in memory, why does it take Qlikview 4 minutes to create the new smaller table and then drop the original while it only takes 1 minute to load the entire table from disk?
Any help or advice on what to research and test would be greatly appreciated.
I don't know why the larger extracts are faster than the smaller ones, but you should be using an optimized load for this data. Implement the where statement in some other way. For your example, I'd do something like this:
FactTable:
LOAD date(makedate(2011,1,1)+1-recno()) as Date
AUTOGENERATE 10000 // or whatever is appropriate for your data set
;
INNER JOIN (FactTable)
LOAD
...Fields...
FROM FactTable.qvd (QVD)
WHERE exists(Date)
;
That approach keeps the load optimized, but only handles one filter, which should be your most restrictive. To apply additional filters, try doing them as inner joins after the main load. That tends to be much faster than loading from a resident table with a where clause. Loading from an resident table is often slower than an optimized load. An optimized load has the entire data structure of your table being loaded. There's nothing to generate. It just slams what's on disk into memory. A resident load from the table is building a new table, generating whatever it is that QlikView needs to generate to support that table. That's often slower. I don't know the details because they're proprietary, but that's just the behavior we've observed.
Thank you for your reponse John, I greatly appreciate it. However, it doesn't help me in this situation as my incremental loads may not be by date.
Since I'm creating a framework based upon Rob's Cookbook, the way to make the where clause with a date load faster isn't really the issue since dates are not typically primary keys.. The basic code which Rob uses in the cookbook to update is:
Directory;
// If incremental reload was , load previous data and concatenate to data just read.
IF $(vQvdExists) THEN
// Use CONCATENATE in case we've added any new fields.
CONCATENATE ($(vTableName)) LOAD * FROM $(vQvdFile) (qvd)
WHERE NOT exists($(vPK)) // Load only QVD rows that were not already loaded in the data load.
;
END IF
I'm trying to make the framework run as quickly as possible so I was trying to figure out why some of the QVD files took so long to load.
I guess I'm not sure where to go from here unless there is a surefire way of editing my script to load faster another way no matter what data type the primary key is.
Well, there's probably no sure fire method that works for every possible case.
That said, I suspect that MOST incremental loads can work by date. If you have a daily load, you want to load all changes since yesterday. A database that changes will typically timestamp the changes. Then you load changed records, say >= yesterday on the change timestamp, and then you load from the QVD, < yesterday on the change timestamp (and for performance, using the date approach I mentioned). There's then no need to refer to primary keys in a not exists, so you won't need to kill the optimized load. Alternatively, you may be able to store the exact time of the previous load, only load forward from there, and load EVERYTHING from the QVD, which can obviously be an optimized load. Rob's template is not the only way to do incremental loads (it looks like it's "Case 3" in the help file, which I think could typically be modified as I mentioned).
And THAT said, if you're doing an incremental load, you'd normally expect to be loading, say, 99% or more of the data from the QVD. If you're only loading a small amount of data from the existing QVD, like 10%, why bother? It's simpler and quite possibly faster to just load ALL data from the original source.
I think the help file does a pretty good job of explaining incremental loads. Search for "incremental", and you should see "Using QVD files for incremental load". It doesn't cover every case, such as my suggestion of generating a list of dates to load rather than using a condition on the QVD, but it gives the basic ideas. You may have already read it, of course.
Perhaps I wasn't clear. I do realize that often the changed records are selected from the database based upon date. However, please look at Rob's code.
Directory;
// If incremental reload was , load previous data and concatenate to data just read.
IF $(vQvdExists) THEN
// Use CONCATENATE in case we've added any new fields.
CONCATENATE ($(vTableName)) LOAD * FROM $(vQvdFile) (qvd)
WHERE NOT exists($(vPK)) // Load only QVD rows that were not already loaded in the data load.
;
END IF
The Where clause in his cookbook is based upon the primary key of the database table which is not going to be a primary key. Therefore, incremental extraction from a QVD and improving that extraction through the use of a date field, isn't really pertinent here.
I could have simply used an identity column instead and said
Where Identity_Column >= 800000
The point is that QV behaves strangely based upon the amount of records which are actually extracted from a QVD. I was hoping that someone could tell me why.
Or suggest a faster way of taking that incremental extraction from the QVD more quickly using a primary key value which is typically some form of integer type, but could often be a character string as well.
I appreciate your help, but focusing on the date aspect of this incremental load isn't going to lead me to a solution to my problem which is how to get the framework to perform optimally.
Yeah, never mind, I'm being stupid. You can't use dates for the QVD load because the QVD doesn't know the record has been modified, so you'd still pull it in even when looking for just old modification dates. You could do an inner join with primary key and max(ModDate) after the fact, but chances are that that would be slower than just doing an unoptimized load in the first place. (It might be worth testing, of course.)
As far as making the framework perform optimally, it's just going to be slow since it isn't using an optimized load. However, you could in many cases avoid the initial load to get the max(ModDate). Instead, if you have a daily load, say, give yourself a day or two extra for safety, and do this:
LET vIncrementalExpression = 'WHERE ModDate >=' & date(today()-2);
It'll load more data from your data source than the max(ModDate) approach, but the time spent there may be minor compared to the time saved avoiding the initial QVD load. Worth testing, in any case.
For the incremental load of our largest QVD, we have a bunch of partial QVDs based on date that we combine together at the end. Each separate QVD is incrementally loaded in a loop. It looks something like this (oversimplified, but giving the basic idea):
for each QVD
load from database any records changed since previous run
if number of rows retrieved > 0
concatenate from partial QVD where not exists(key)
store table into partial QVD
end if
drop table
next
load all partial QVDs into same table
store table into final QVD
Now, we did this out of necessity back when we were running on 32-bit, and didn't have the memory to load the whole thing at once. But I believe this is also paying dividends on performance. We have years and years of data, and changes to older data are increasingly rare. So much so that typically, only the most recent partial QVD is being reloaded, and the rest are just being ignored. So on a typical run, we're only doing an unoptimized load on a fairly small QVD, the most recent partial QVD. I suspect it would only slow the load down if we combined it all together now that we have 64-bit and plenty of memory.
So as with other ideas, something like this might be worth testing.
Edit: We have an index on the database to support "changed since previous run". Otherwise breaking it into pieces like this would kill performance as we table scan the database multiple times.