Incremental load - how to filter out superseded re... - Qlik Community

crossroadsit · ‎2022-09-12

Hi folks,

I have a qvd file with consignments and their products e.g.

FROM consignments.qvd (qvd):

consignment_id	product_id	product_state
A`	1	pending
A	2	received
B	1	pending
B	3	pending
B	4	pending
C	2	received

Then I query my POS system for new/modified records. The way the POS API works means I have to pull a full replacement list of records for any consignment that has been added/modified.

e.g. RESIDENT consignments:

consignment _id	product_id	product_state
B	3	received
C	4	received
D	5	pending

The final updated table I want would look like this:

consignment _id	product_id	product_state
A	1	pending
A	2	received
B	3	received
C	4	received
D	5	pending

Note that:

Consignment A data from disk stays same (because A hadn't changed in the POS).
Consignment B data from disk is replaced entirely (including deletion) (because I had to reload all of B)
- Same is true of C but it had no deletions in that case so it looks like a little update
Consignment D is new so it has been added

To get the result above I thought I could do

Concatenate Load * FROM /consignments.qvd(qvd) WHERE NOT EXISTS(consignment_id);

But doing that means I only ever pull one product per consignment from disk (because as soon as the first product is pulled from disk, the consignment_id exists and no more products are pulled for that consignment.

Then I thought I could try making a compound e.g.

consignment_id & product_id AS key

so I could then do

Concatenate Load * FROM /consignments.qvd(qvd) WHERE NOT EXISTS(key);

...but that merges rather than replaces based on consignment_id. In the above example, consignment B would end up with three products rather than only one.

I'm sure there is a neat way to do this. Any thoughts?

marcus_sommer · ‎2022-09-13

If you could identify the new and changed records on unique values respectively combinations of fields the where not exists(Key) approach should be working. If not you need a different and more complex logic.

This could be to query an existing change-timestamp or to create one by adding now() to each record and then newer records could be identified.

Another method could be to regard the states as a numeric value and only the max. value remained maybe by an inner join load on an appropriate max-grouping. Possible are probably also mapping-solutions or ways to create an extra exists-table. In each case it will need more efforts ...

Beside this you may check if you could divide your data into YYYYMM slices and approaches like everything which is older as n month is loaded from the historic qvd's and all newer stuff is completely loaded from the data-base.

If the last is possible and fitting within the time-frames to update the data I would use such simple logic instead of building the more complex logic which was mentioned above.

- Marcus

crossroadsit · ‎2022-09-15

Thanks Marcus,

I did a bad job of describing my scenario.

Each consignment id has a "version" number associated with it. So I pull latest consignments from the API using that.
The json payload that comes back includes the child products (some of which may have been added, deleted or modified since the prior version of the record)

It turns out that my question is mostly answered by https://community.qlik.com/t5/Knowledge/WHERE-NOT-EXISTS-clause-returns-only-one-row-for-each-non/ta... which says

The solution in this case is to rename the original ID field (for example ID1) and use WHERE NOT Exists(ID1, ID); instead. With this implementation, ID and ID1 are different fields, so they have different symbol tables and is not affected by the on-the-run update of the symbol table as shown above.

The trick I'm now trying to work out is how to get that onto disk with a single "consignment_id" column. Right now I'm getting:

consignment_id: populated for all the records on disk that weren't needing to be updated (and empty for those that were updated)
consignment_id1: populated for all the newly updated records (and empty otherwise)

Perhaps I need to do yet another load to merge those columns and name them.

marcus_sommer · ‎2022-09-16

If I understand you right you could get a working solution by applying an exists-condition with two parameters. In general that's a good solution - but you will lost the optimized mode by loading these data from a qvd or resident (if there were further transformations within the load this consideration could be neglected). If this load didn't hit any essential time-frame you could remain by it.

If instead the load-times are too long it could become difficult to optimize it further. One method may be to transfer the essential parts in beforehand by creating/including the (combined) information within the previous loads. Another approach could be to implement any incremental methods or more granular layers to applying the transformation against smaller datasets or to be able to transfer the transformation to parallel time-frames.

- Marcus

Ruhulessin · ‎2022-09-16

Hi @crossroadsit,

Instead of the WHERE NOT EXIST (consignment_id) you should make a variable containing all id's. Then use that variable in a WHERE NOT MATCH clause for loading in the QVD.

Try this:

FinalTable:
LOAD *
, Date(Today(1)) AS POS_Date
;
//SOURCE = POS System
LOAD * INLINE [
consignment_id, product_id, product_state
B, 3, received
C, 4, received
D, 5, pending
];

NOCONCATENATE
ConsigmentIdList:
LOAD
CHR(39) & Concat(consignment_id, CHR(39) & ',' & CHR(39)) & CHR(39) AS ConsignmentIdList
RESIDENT FinalTable
;

LET vConsignmentIdList = Fieldvalue('ConsignmentIdList', 1);
DROP TABLE ConsigmentIdList;

CONCATENATE(FinalTable)
LOAD *
, Date(Today(1)-1) AS POS_Date
WHERE NOT MATCH(consignment_id, $(vConsignmentIdList))
;
//SOURCE = Consignments QVD
LOAD * INLINE [
consignment_id, product_id, product_state
A, 1, pending
A, 2, received
B, 1, pending
B, 3, pending
B, 4, pending
C, 2, received
];

I have never done it, so I think I can.

crossroadsit · ‎2022-09-19

These additional insights are much appreciated @marcus_sommer

Yes, I have it working with a NOT Exists condition now and I think the loss of optimized load is acceptable in this instance.

crossroadsit · ‎2022-09-19

Thank you for sharing this @Ruhulessin

I actually tried something along these lines a while ago but must have been making an error somewhere as I ended up with predictable results. I have a solution working now using WHERE NOT EXISTS so I will probably stick with it for now.

I'm interested, however, if the approach you suggest with a variable would cope with an arbitrary number of (rather long) GUIDs?

Ruhulessin · ‎2022-09-19

@crossroadsit.

I do not see why that should be a problem. How long numbers are we talking about? Depends more on the max charachters a variable can handle, but according to this topic (Variable Max Limit ) a variable can handle quite some numbers.

-Ruben

I have never done it, so I think I can.

Incremental load - how to filter out superseded records

Advanced Authoring

Client Managed

General Question