Solved: Re: Get a full line (all columns for 1 row) based ... - Qlik Community

JeromeAIBI · ‎2021-02-16

Hi (sorry if my question is stupid I'm come from big data c# AI domain... so ... I'm new on QLIK)

I have something like 500 000 records in a table myData with 20 fields.

For each record (line) I have to check the content of each field to apply some validation rules.
I tried using PEEK but it's very slow.

//myData is already loaded from a QVD

vNbOfLine = = NoOfRows('myData');

for i = 0 to ($(vNbOfLine)-1)

LET v1 = Peek('field1', $(j),'myData');

// my code to check the v1 value (based also on value of other fields)

LET v2 = Peek('field2', $(j),'myData');

// my code to check the v2 value (based also on value of other fields)

// etc... to v20...

next i

I'm look for a way to get the line directly or anything to make that really faster.

Thanks for you help

Jerome

marcus_sommer · ‎2021-02-17

One addition - some preparation/pre-calculation - might be already done within the steps before in which the qvd was created. Means applying one or another converting/formatting, replacing NULL's or similar values, string-concatenation of several fields to be able to make validations on the field-value level within the following steps, using exists() to check for certain values, rangesum/avg/count() to n fields or similar, applying mapsubstring() which could be used for many scenarios in which regex would be used in other tools, ...

Such things might be capable to speed up following transformation-steps quit significantly without mandatory much more run-times during the creation because it depends on the biggest bottleneck within your environment. For example if the source is rather slow the computation of those preparations may not cause much delay.

Beside this you should consider to implement an incremental approach to have only one time a bigger initial run-time and afterwards you load only the new/changed records. Also dividing the task into multiple sub-tasks may be useful if they could run in parallel and/or distributed to various time-frames.

- Marcus

View solution in original post

marcus_sommer · ‎2021-02-17

I think it should be significantly faster if you do all your checks directly within a load like:

t: load if(Field1 = 'x' and Field2 = 'y', 'z', Field1) as Field1, ... from Source;

Depending on the amount and kind of the wanted checks and adjustments there may further measures possible to simplify and speed up such approach.

With a bit more elaboration what should be checked/adjusted you may get more hints to improve your task.

- Marcus

JeromeAIBI · ‎2021-02-17

Thanks 🙂 but I tried that
I have something like 30 to 50 millions records a day applying 90 rules (some call and cross-check data from other tables) ....

marcus_sommer · ‎2021-02-17

I think you need to group on which fields should be which rule(s) applied - maybe in a kind of matrix. It's important to see for each check which fields from which sources are involved and on what should be checked (value vs. NULL and/or numeric vs. string and/or len() and/or value exists / not exists anywhere else and/or certain values respectively parts of them needs to be replaced with ... and/or values are invalid because of ... or whatever). If it's known there might be other more advanced possibilities as querying it within multiple nested if-loops.

Further very important - at least by large datasets - is the differentiation between records and field-values. Are your sure that everything needs to be validated on a record-level?

- Marcus

JeromeAIBI · ‎2021-02-17

Thanks a lot for your accurate remarks.
Unfortunaly yes... I grouped rules. I did a matrix... Even if I'm a "beginner" on QLIk I used to work a lot on data (elasticsearch) and parser coding. It is more about the QLIK good practice? Are there a lot of Rules engines written in QLIK ? Am I not using QLIk for a work which is not for QLIK ...????

marcus_sommer · ‎2021-02-17

Until now it's quite unclear for me what do you mean with rules? Could you elaborate this a bit more - what should be checked/flagged and/or adjusted and why? How does the table-structure look like - normal tables or crosstables?

Good practice - and by larger datasets a must - is to load only field-values respectively to check against field-values. Qlik stores the distinct field-values of each field within an own symbol-table. Running through the symbol-table is therefore much faster as running through big data-tables. If it's fully/partly applicable in your case is unclear because on this level exists no relation to other fields.

Another very common measure to unify field-values and/or to check them against other values and/or to replace them is the use of mappings - which are quite fast and which could be also nested and/or applied on concatenated strings (which could be with subfield() divided again).

Reversed no go's are outside-loops though tables like in your example above and actually also the nested if-loops - both measures could be never fast.

- Marcus

JeromeAIBI · ‎2021-02-17

Rules is a condition (could be complicated one) applied to different fields regarding some technical aspects and some business aspects. As we do with a BRM (Business Rules Management).
The result for each record is a matrix which contents for each rule, if the rule was passed on not.
You made me happy because I did already what you suggest.

"Reversed no go's are outside-loops though tables like in your example above and actually also the nested if-loops - both measures could be never fast." Yes indeed I discovered that. When I did this kind of Stuff on C#/ABAP/Java, I did like this. DataSource->DataSet->Loop 🙂 using multithreading a lot. But on QLIK it is different :-).

To clarify the situation I'll do a test with QLIK approach (as you said) and with an C# multithreaded code on a big quantity of data.

Viele Danke. Sie haben mir helfen. (I expect my german not too much wrong)

marcus_sommer · ‎2021-02-17

One addition - some preparation/pre-calculation - might be already done within the steps before in which the qvd was created. Means applying one or another converting/formatting, replacing NULL's or similar values, string-concatenation of several fields to be able to make validations on the field-value level within the following steps, using exists() to check for certain values, rangesum/avg/count() to n fields or similar, applying mapsubstring() which could be used for many scenarios in which regex would be used in other tools, ...

Such things might be capable to speed up following transformation-steps quit significantly without mandatory much more run-times during the creation because it depends on the biggest bottleneck within your environment. For example if the source is rather slow the computation of those preparations may not cause much delay.

Beside this you should consider to implement an incremental approach to have only one time a bigger initial run-time and afterwards you load only the new/changed records. Also dividing the task into multiple sub-tasks may be useful if they could run in parallel and/or distributed to various time-frames.

- Marcus

JeromeAIBI · ‎2021-02-17

I did split some data in different QVD (I created something like hundred QVD per day, they are destroyed after a while), to make my application ODAG Friendly .... For "exist" I did not think about that. Thanks I'll try ..
Thanks for your advices.

Get a full line (all columns for 1 row) based on the index number

Data Model

General Question