Load data from QVD more efficiently using non-uniq... - Qlik Community

nafi_qlikview · ‎2015-04-15

Hi,

I've got the following tables:

Table 1:

TABLE1_KEY	FIELDS1
1	...
2	...

Table 2:

TABLE1_KEY	FIELDS2
1	...
1	...
1	...
2	...
2	...

Table 2 is the one this is about. It contains ~1 billion rows and I'm having some trouble processing a merge after an incremental load. Due to the data volume I can't store the real primary key (i.e. TABLE2_KEY) as I'm restricted on memory resources. The incremental database load is fairly quick, but then merging the new data into the existing data is painfully slow.

Say we have extracted new data for Table 2 from the database and old data stored in a QVD.

Some rows, which already exist in the QVD, have been updated and hence are coming through again in the new data
If an update occurs, all rows in table 2 with the same TABLE1_KEY would have been updated

QVD Data:

TABLE1_KEY	FIELDS2
1	...
1	...
1	...
2	...
2	...

New Data:

TABLE1_KEY	FIELDS2
2	...
2	...
3	...
3	...

Result should be:

TABLE1_KEY	FIELDS2	(Source)
1	...	old
1	...	old
1	...	old
2	...	new
2	...	new
3	...	new
3	...	new

This is what I'm currently doing to create the result table:

NewData:

LOAD TABLE1_KEY

,TABLE1_KEY AS TABLE1_KEY_COPY

,FIELDS2

FROM database;

Concatenate(NewData)

LOAD TABLE1_KEY

,FIELDS2

FROM QVD

WHERE NOT Exists(TABLE1_KEY_COPY, TABLE1_KEY);

I have to take a copy of TABLE1_KEY and do the Exists() this way, because if I used the same column TABLE1_KEY for Exists(), QlikView would always only load the first record from the QVD and then ignore all others for this TABLE1_KEY (the value would added to the lookup by the first record).

Exists(TABLE1_KEY_COPY, TABLE1_KEY) is what causes the load to become un-optimized. If I could store TABLE2_KEY and use Exists(TABLE2_KEY) it would be optimized and fast, but as mentioned before, that's not possible.

Now my questions:

Why would the Exists() lookup in two fields be slower than and different from a lookup in the same field? In my view this is just a simple hash lookup and I can't see why using differently named lookups would make it slower. Is this something QlikTech could potentially improve?
Can you think of any better way to process the merge?
The uncompressed QVD is currently 18gb and fits well into memory. Adding the distinct TABLE2_KEY values for some reason increases memory usage massively even though the resulting QVD size increases only by the size I would expect it to (~1 billion * 8 byte). Can anyone explain this to me?

Any input would be much appreciated!

Thanks,

Martin

nafi_qlikview · ‎2015-04-20

OK, here are two things I've noticed while further investigating both of which I can't really explain:

In the above example, I've created a copy of TABLE1_KEY within Table 2 and then used the Exists() function on this. It seems that even though any new concatenated rows don't add any new values to this field (should only be NULL, although I'm not sure if this would be part of the value list in QV) the time it takes to do the lookup increases the more rows there are in the old data.I would have expected this to be a simple value list hash lookup which means any comparison should always take O(1) and any concatenated rows from the old data wouldn't have any impact on it.

To test this, I've instead created the copy of TABLE1_KEY in a separate table so that it doesn't have anything to do with the concatenation:

NewIDs:

LOAD TABLE1_KEY AS TABLE1_KEY_COPY

FROM NewData;

This performed about 5 times faster than my previous approach.

To try to understand, why memory usage explodes if a unique PK field is loaded for a large table, I've loaded such a field without any other data and checked the memory usage during the process. There are a few jumps in memory usage:
- one from ~1gb to ~4gb at ~15 million unique values
- and another one at around ~180 million unique values where it was at ~6gb and then suddenly caused my 16gb of ram to overflow.

I've tried to explain this by the bit pointers having to use the next byte in order to store the next values (and double counting for pointer references and value list), but 180 million is nowhere near a level where this would be required and even if I assume the full 8 byte for each value, this should still fit well into my memory: 8 byte * (180000000 * 2) = 2.88gb.

Again, if anyone could give me some advice on the two points, it would be much appreciated. Using the new approach for the first point, I'm happy with the performance now, but I'd still like to understand why QV is doing this.

Thanks,

Martin

Load data from QVD more efficiently using non-unique key