Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi,
I've got the following tables:
Table 1:
TABLE1_KEY | FIELDS1 |
---|---|
1 | ... |
2 | ... |
Table 2:
TABLE1_KEY | FIELDS2 |
---|---|
1 | ... |
1 | ... |
1 | ... |
2 | ... |
2 | ... |
Table 2 is the one this is about. It contains ~1 billion rows and I'm having some trouble processing a merge after an incremental load. Due to the data volume I can't store the real primary key (i.e. TABLE2_KEY) as I'm restricted on memory resources. The incremental database load is fairly quick, but then merging the new data into the existing data is painfully slow.
Say we have extracted new data for Table 2 from the database and old data stored in a QVD.
QVD Data:
TABLE1_KEY | FIELDS2 |
---|---|
1 | ... |
1 | ... |
1 | ... |
2 | ... |
2 | ... |
New Data:
TABLE1_KEY | FIELDS2 |
---|---|
2 | ... |
2 | ... |
3 | ... |
3 | ... |
Result should be:
TABLE1_KEY | FIELDS2 | (Source) |
---|---|---|
1 | ... | old |
1 | ... | old |
1 | ... | old |
2 | ... | new |
2 | ... | new |
3 | ... | new |
3 | ... | new |
This is what I'm currently doing to create the result table:
NewData:
LOAD TABLE1_KEY
,TABLE1_KEY AS TABLE1_KEY_COPY
,FIELDS2
FROM database;
Concatenate(NewData)
LOAD TABLE1_KEY
,FIELDS2
FROM QVD
WHERE NOT Exists(TABLE1_KEY_COPY, TABLE1_KEY);
I have to take a copy of TABLE1_KEY and do the Exists() this way, because if I used the same column TABLE1_KEY for Exists(), QlikView would always only load the first record from the QVD and then ignore all others for this TABLE1_KEY (the value would added to the lookup by the first record).
Exists(TABLE1_KEY_COPY, TABLE1_KEY) is what causes the load to become un-optimized. If I could store TABLE2_KEY and use Exists(TABLE2_KEY) it would be optimized and fast, but as mentioned before, that's not possible.
Now my questions:
Any input would be much appreciated!
Thanks,
Martin
OK, here are two things I've noticed while further investigating both of which I can't really explain:
To test this, I've instead created the copy of TABLE1_KEY in a separate table so that it doesn't have anything to do with the concatenation:
NewIDs:
LOAD TABLE1_KEY AS TABLE1_KEY_COPY
FROM NewData;
This performed about 5 times faster than my previous approach.
I've tried to explain this by the bit pointers having to use the next byte in order to store the next values (and double counting for pointer references and value list), but 180 million is nowhere near a level where this would be required and even if I assume the full 8 byte for each value, this should still fit well into my memory: 8 byte * (180000000 * 2) = 2.88gb.
Again, if anyone could give me some advice on the two points, it would be much appreciated. Using the new approach for the first point, I'm happy with the performance now, but I'd still like to understand why QV is doing this.
Thanks,
Martin