Hard delete incremental load, reading from large t... - Qlik Community

ioannagr · ‎2021-05-05

Hi all.

So I've figured out a way of reading for the first time my large table. I'm reading it in parts, by each Year, so I have ended up with Qvd_2011, Qvd_2012, ... Qvd.2021, so the final table would be a concatenation of them.

My question is, how do I do the hard delete incremental loading without it taking AGAIN so long?

My script for incremental load is

NoConcatenate
TempYear:
load Distinct year(date(floor(datetime),'DD/MM/YYYY')) as Year;

SQL SELECT datetime
FROM public.tab1;

LET NumRows=NoOfRows('TempYear');
trace $(NumRows);
For n=1 To $(NumRows)

let vYear = Peek('Year',$(n)-1,'TempYear');
trace $(vYear);
TempSize:
NoConcatenate

LOAD
tab1_pk,
size,
year(date(floor(datetime),'DD/MM/YYYY')) as YearOfInterest;

SQL select
tab1.`pk` "tab1_pk",tab1.`datetime`,
tab2.`fk` "tab2_fk", tab2.`size` "size"
from
public.tab1 tab1
INNER JOIN
public.tab2 tab2
on tab2.`fk`=tab1.`pk`
where year(tab1.`datetime`)='$(vYear)' and tab1.`pk`>'$(MaxInitialID)'; //based on another table

Size:
LOAD
tab1_pk,
sum(size)
resident TempSize
group by tab1_pk;

DROP TABLE TempSize;

// let vFileName='Size-' &'$(vYear)';
// store Image_Size into '$(vPathQVD)/$(vFileName).qvd';//into $(vFileName).qvd (qvd);

next n;
drop table TempYear;

Concatenate
Load * from [$(vPathQVD)Initial.qvd] (qvd)
WHERE NOT EXISTS(tab1_pk);

Inner Join
Load tab1_pk as tab1_pk;
SQL SELECT tab1.`pk` "tab1_pk" from
public.tab1 tab1
INNER JOIN
public.tab2 tab2
on tab1.`pk`=tab2.`fk`;

How do i handle the red part, so that it doesn't take AGES to load? In fact it took at least 7 days to load and was still going... How do I do the year split in here as well?

Many many thanks in advance!

marcus_sommer · ‎2021-05-06

It's not really clear for me what do you want to do and I'm not sure if the red part isn't superfluous at all because above it the initial-data are filtered by a where exists(). If there is another filtering needed it shouldn't be done with an inner join (at least not by 1 billion records) else with a further load and a where exists() - whereby here I would suggest to consider to combine both filter-fields so that a single where exists() is enough.

I couldn't estimate how long your sql-parts run but an optimized load from such a qvd shouldn't take ages and I would expect a run-time significantly lesser as one hour - if your environment provides a gigabit storage/network and has enough RAM to load all data without swapping them to the virtual RAM.

- Marcus

ioannagr · ‎2021-05-07

hi @marcus_sommer , thanks for your time.

I'm following the incremental load approach for hard delete of data by Qlik Sense Help (see here for more: https://help.qlik.com/en-US/sense/February2021/Subsystems/Hub/Content/Sense_Hub/LoadData/use-QVD-fil... ). In this routine, there is a final inner join like the one in my red part, so as to "delete" the rows from the qvd. Last time I checked yesterday it did well up to Concatenation in 2 hours max, then when the inner join in red came up it never ended. Still going. That is not useful for the business as they need new data in of course less than 24 hours. Also this is a table that gets thousands of new entries by day, so it will only get bigger than 1 billion of rows over time.

I'm lost at what I have to do, can I like, do inner join then call this sql query in parts?

Inner Join
Load tab1_pk as tab1_pk;
SQL SELECT tab1.`pk` "tab1_pk" from
public.tab1 tab1
INNER JOIN
public.tab2 tab2
on tab1.`pk`=tab2.`fk`;

marcus_sommer · ‎2021-05-07

Regarding to the code-snippet from the qlik help I think it's really wrong because the inner join adds the same filter as before the where exists() does. Why should it be done twice? Means you could just remove it (or comment it within the first trials).

Beside this is the inner join rather not the preferred way to filter data within incremental logics because this feature is much heavier from a performance point of view as the where exists() approach. By small data it's a very practically method but not by huge datasets.

- Marcus

ioannagr · ‎2021-05-07

Pardon me, this inner goin happens to remove this data from dabatase ( see here as well at the last pages (no deletion flag process) : https://community.qlik.com/t5/Documents/Incremental-Load-Scenarios-pdf/ta-p/1492988) .

I've seen in small data sets (small as in <=50 millions) this routine to work for concrete deletion.

I'm interested in your point of view though, @marcus_sommer

marcus_sommer · ‎2021-05-07

For me it's not correct - at least not with the provided code-snippet. Maybe they had several versions of the logic and didn't remove all attempts from the final release of the suggestion and/or the snippet isn't complete or malformed by the simplification to provide only the essential steps.

Beside this it's not clear for me what do you want to do. The first inner join within the sql makes sense to filter there data to not to transfer all remaining records (in regard to the further where-clauses). But the second inner join directly below the concatenate-statement tries to join the afterwards sql with the big concatenated table and this doesn't makes sense.

- Marcus

ioannagr · ‎2021-05-07

The first inner join is within sql tables. I wanted to gather information going fron tab1 to tab2 where there was the filed size so i chose to do it via sql select and not join the huge tables for better performance (one is a billion lines and the other some millions). The filtering in that first join is to retrieve all the new data to do the incremental load for hard deletes.

The second inner join is part of the routine, that's why it is there.

What would you do ? Would you leave it up to concatenation? @marcus_sommer

marcus_sommer · ‎2021-05-07

Because of the fact that Qlik doesn't execute a sql else only transferring it per driver to the database and receiving just the results on the same way you couldn't mix Qlik data and database data - at least not on a table-level else just with variables which could be used to define small where-caluses but not much more.

This means it's mostly not possible to apply a perfect incremental approach - means to get the database just to return the current data. This would be only possible if you could write the Qlik data directly within your database or you store them as a csv and the database imports them before it executes the real query and applies it there for example with an inner join and/or you transfers the essential parts of the incremental logic to the database itself - means the database itself stored which records they had already transferred and used this information to output only newer records.

But this doesn't mean that it wouldn't be possible to use incremental logics - for example by returning always the last n days (maybe a month) which means there could be an overlapping between historic data and current data which are then removed by the where exists().

It may sound like a bit children playing but often it's quite helpful to simplify the use-case within a dummy-app and reducing the amount of data - in your case maybe to 1% which should speed up all your trials quite significantly and allows you really to play with the various measures and to log and to compare the number of records as well as the run-times.

- Marcus

Hard delete incremental load, reading from large table