Solved: Qlik Sense: Drop Fields in Script under a conditio... - Qlik Community

buchberger_a · ‎2019-11-11

Hi,

I m looking for a code to drop columns under a certain condition in the script load.
My data: 6Mio x 200 columns with lots of Null() and '' (empty string) values
Condition: Drop those columns where at least 1000 values are not null/empty.

My idea was to count the notNull/empty values, but this method is super slow. I need something with high performance.
for i=1 to NoOfFields('Table')
LET vFieldName = FieldName($(i),'Table');
tmp:
Load FieldName($(i),'Table')as Field,
Sum(len([$(vFieldName)])>0) as LenField
Resident Table;
Next;

Creating an if statement for each column is not an option. It should be dynamically and kind of fast.
Hopefully someone can help me

Cheers
Alex

marcus_sommer · ‎2019-11-14

Like you already noticed it couldn't be simply solved with the field-functions because it will be only distinct values stored - at least not with such a single-step approach for the density-check (means the pure check and not the nested loop to run through all tables and fetch all fields).

Nevertheless I think your task could be speed up. For example I could imagine that the following would be faster:

check: load $(vFieldName) resident YourTable where len(trim($(vFieldName));
if noofrows('check') > $(vThreshold) then
drop fields $(vFieldName);
end if

drop tables check;

Probably even better might be to do the check with exists() - especially if you could apply it with a single parameter, like:

check: load $(vFieldName) resident YourTable where exist($(vFieldName));
if noofrows('check') > $(vThreshold) then
drop fields $(vFieldName);
end if

drop tables check;

But it requires an additionally NULL handling - probably with the various NULL variables or a NULL mapping - to convert the empty fieldvalues into NULL (this approach might be in general beneficial - means an explicit decision which kind of NULL's should be in the data respectively to convert NULL into "real" values, like '<Null>'). Further working with exists() means that the order of load-statements within the script could have an impact.

Beside this I suggest also to consider the general possibility to apply the main-part of the checks already in the origin load, maybe with something like this:

table: load F1, rangesum(peek('F1_check'), -(len(trim(F1))>0)) as F1_check, F2, rangesum(peek('F2_check'), -(len(trim(F2))>0)) as F2_check …….

Of course it will make your origin-load heavier in efforts and run-times but by the overall task-time it might look different (personally I don't think that it would be better but it shouldn't be too easily discarded).

Another approach as the above mentioned measures would be to store the tables into separate qvd- or qvw-files and then to load from them the xml-header. The noofrecords() and the fieldvaluecount() could be easily read from the xml - but unfortunately I don't know how to read and/or to calculate the density, but it's possible because tools like Q-Eye show these information. Maybe someone could share here the knowledge how it's done.

Before all this I suggest to rethink your whole approach - NULL won't be stored in any way and empty values doesn't need much space - so is there really a significantly benefit of removing these columns respectively data in the file-size and the calculation-times? Especially as you don't remove only NULL and empty values else there will be a number of other values which are below the threshold - are they really unimportant?

Going on from my last suggestion and from some parts of your description I assume that your tables are crosstables containing various flag- and/or sensor-fields or something similar - so that it might be useful to transform the table into a stream table-structure. Of course a crosstable with 6 M of records and 200 fields is a heavy transformation but afterwards NULL / empty values could be easily removed - and in most cases have such stream-structures a lot of handling- and performance benefits within the following reports (even if the file-size might increase through the bigger pointer). Therefore take a look here: The-Crosstable-Load.

- Marcus

View solution in original post

treysmithdev · ‎2019-11-11

This should be much faster.

Note that this would only remove Null fields. If you need to remove blank fields or fields, with spaces you will need to add additional logic. Ideally with FieldValue.

For t = 0 to NoOfTables()-1

	Let vTable = TableName($(t));
    
    Let vFieldCount = Num(NoOfFields('$(vTable)')-1,'#,##0');
    
    Trace -------------------------- $(vTable): $(vFieldCount) Fields;
    
    For f = $(vFieldCount) to 0 Step -1
    
    	Let vField = FieldName($(f),'$(vTable)');
        
        If FieldValueCount('$(vField)') < 1 Then
        
        	Trace -------------------------- Dropping $(vField)...;
        
        	Drop Field [$(vField)];
        
        EndIf;
    	
        vField=;
        
    Next f

	vTable=;vFieldCount=;

Next t;

t=;

Blog: WhereClause Twitter: @treysmithdev

treysmithdev · ‎2019-11-11

Here is a solution accounting for blank fields as well:

For t = 0 to NoOfTables()-1

	Let vTable = TableName($(t));
    
    Let vFieldCount = Num(NoOfFields('$(vTable)')-1,'#,##0');
    
    Trace -------------------------- $(vTable): $(vFieldCount) Fields;
    
    For f = $(vFieldCount) to 0 Step -1
    
    	Let vField = FieldName($(f),'$(vTable)');
        
        If FieldValueCount('$(vField)') < 1 Then
        
        	Trace -------------------------- Dropping Null Field: $(vField)...;
        
        	Drop Field [$(vField)];
        
        ElseIf FieldValueCount('$(vField)') = 1 and Len(Trim(FieldValue('$(vField)',0))) = 0 Then
        
        	Trace -------------------------- Dropping Blank Field: $(vField)...;
        
        	Drop Field [$(vField)];
        
        EndIF;
    	
        vField=;
        
    Next f

	vTable=;vFieldCount=;

Next t;

t=;

Blog: WhereClause Twitter: @treysmithdev

buchberger_a · ‎2019-11-12

Hey, thank you very much for your solutions, treysmithdev. Pretty nice solutions. There is just a little mistake in your second code:

ElseIf FieldValueCount('$(vField)') = 1 and Len(Trim(FieldValue('$(vField)',0))) = 0 Then

FieldValue('$(vField)',1)); otherwise it would be always true, even if there is one distinct value in the field.

Unfortunately my condition with less than 1000 values in one Field is not occurring in the code, so i ll check for a good solution

marcus_sommer · ‎2019-11-13

I'm not sure if I understand your task - you want to remove columns with many data and just to keep those ones which have less data? And this should happens in regard to the distinct field-values or in regard to the number of records in this table?

- Marcus

buchberger_a · ‎2019-11-13

Hi, I m sry for misunderstanding here, I want to drop a field/column, if there are less than 1000 values.
Distinct values are a problem, because if a field just has 0 and 1s then distinct would be 2, and therefore i do not know how many values are there, which are not null.
So yeah,i need to take the count of each field, like:

For t = 0 to NoOfTables()-1
	Let vTable = TableName($(t));
        Let vFieldCount = Num(NoOfFields('$(vTable)')-1,'#,##0');

  For f = $(vFieldCount) to 0 Step -1
       Let vField = FieldName($(f),'$(vTable)');
       If Count($(vField))<1000 Then //Count exclude null() anyway
            Drop Field $(vField); 
       EndIF;

       vField=; 
    Next f
    vTable=;vFieldCount=;
Next t;
t=;

marcus_sommer · ‎2019-11-14

Like you already noticed it couldn't be simply solved with the field-functions because it will be only distinct values stored - at least not with such a single-step approach for the density-check (means the pure check and not the nested loop to run through all tables and fetch all fields).

Nevertheless I think your task could be speed up. For example I could imagine that the following would be faster:

check: load $(vFieldName) resident YourTable where len(trim($(vFieldName));
if noofrows('check') > $(vThreshold) then
drop fields $(vFieldName);
end if

drop tables check;

Probably even better might be to do the check with exists() - especially if you could apply it with a single parameter, like:

check: load $(vFieldName) resident YourTable where exist($(vFieldName));
if noofrows('check') > $(vThreshold) then
drop fields $(vFieldName);
end if

drop tables check;

But it requires an additionally NULL handling - probably with the various NULL variables or a NULL mapping - to convert the empty fieldvalues into NULL (this approach might be in general beneficial - means an explicit decision which kind of NULL's should be in the data respectively to convert NULL into "real" values, like '<Null>'). Further working with exists() means that the order of load-statements within the script could have an impact.

Beside this I suggest also to consider the general possibility to apply the main-part of the checks already in the origin load, maybe with something like this:

table: load F1, rangesum(peek('F1_check'), -(len(trim(F1))>0)) as F1_check, F2, rangesum(peek('F2_check'), -(len(trim(F2))>0)) as F2_check …….

Of course it will make your origin-load heavier in efforts and run-times but by the overall task-time it might look different (personally I don't think that it would be better but it shouldn't be too easily discarded).

Another approach as the above mentioned measures would be to store the tables into separate qvd- or qvw-files and then to load from them the xml-header. The noofrecords() and the fieldvaluecount() could be easily read from the xml - but unfortunately I don't know how to read and/or to calculate the density, but it's possible because tools like Q-Eye show these information. Maybe someone could share here the knowledge how it's done.

Before all this I suggest to rethink your whole approach - NULL won't be stored in any way and empty values doesn't need much space - so is there really a significantly benefit of removing these columns respectively data in the file-size and the calculation-times? Especially as you don't remove only NULL and empty values else there will be a number of other values which are below the threshold - are they really unimportant?

Going on from my last suggestion and from some parts of your description I assume that your tables are crosstables containing various flag- and/or sensor-fields or something similar - so that it might be useful to transform the table into a stream table-structure. Of course a crosstable with 6 M of records and 200 fields is a heavy transformation but afterwards NULL / empty values could be easily removed - and in most cases have such stream-structures a lot of handling- and performance benefits within the following reports (even if the file-size might increase through the bigger pointer). Therefore take a look here: The-Crosstable-Load.

- Marcus

buchberger_a · ‎2019-11-19

Thanks for the ideas,

but in the end i chose to did it with the R Extension. As i head problems with loading all column into R, i decided to do it column-wise. If someone is interested, thats kind of my solution

For h=NoOfFields('Table') to 1 step -1
Let vField =FieldName($(h),'CompInfo');

Tmp:
Load * 
Extension R.ScriptEval('
dat <- q;
colSums(!(is.na(dat)|dat==""|dat=="NA"))
',sampleTable{ $(vField) } );

Let vDrop= Peek('Field1',-1,'Tmp');
Let nrow=NoOfRows('Table');

If Num($(vDrop)) < 1000 Then  
          Drop Field $(vField);
EndIF;
vDrop=;
Drop Table Tmp;	
Next;

Qlik Sense: Drop Fields in Script under a condition

Conditional Column

Drop fields

Qlik Sense

Script