Smooth values based on neighbouring bins

prayner — Tue, 19 Nov 2024 16:19:07 GMT

I have a gridded map table produced by the GeoOperations binning function. However, some of the values in the map might be outliers relative to neighbours, so I have written a smoothing function in Qlik load editor to replace any outliers with the average value of the neighbouring bins. However, I have to run this function for hundreds of thousands of bins, multiplied by each dimension (product, date). How can I optimise the load script? It would be good to avoid the `for each` but I am not sure how.

(image illustrates how some values appear higher or lower than neighbouring bins. They look like dark and light spots. using a smoothing function, we can eliminate these outliers).

binning_table_t2:
load *,
SubField(bin_id, '|', 2) as x,
SubField(bin_id, '|', 3) as y
resident binning_table_t
;
drop table binning_table_t;

[t_distinct_bin_ids]:
load distinct bin_id as distinct_bin_ids resident binning_table_t2;

// Create a temporary table with neighbor BinIds
neighbour_check:
LOAD * INLINE [
tmp_field
];

for each vBinId in FieldValueList('distinct_bin_ids')

//Identify each surrounding neighbour for a bin id which has format 'R|X|Y'

LET vX=SubField('$(vBinId)', '|', 2);
LET vY=SubField('$(vBinId)', '|', 3);
LET vYPlus1 = $(vY) + 1;
LET vYMinus1 = $(vY) - 1;
LET vXMinus1 = $(vX) - 1;
LET vXPlus1 = $(vX) + 1;

[binning_table_t2_bin_id]:
NoConcatenate
load date, product, avg_index_value resident binning_table_t2
WHERE
((x=$(vX) and y=$(vYPlus1)) or
(x=$(vX) and y=$(vYMinus1)) or
(x=$(vXMinus1) and y=$(vY)) or
(x=$(vXPlus1) and y=$(vY)) or
(x=$(vXMinus1) and y=$(vYPlus1)) or //top left
(x=$(vXPlus1) and y=$(vYPlus1)) or //top right
(x=$(vXMinus1) and y=$(vYMinus1)) or //bottom left
(x=$(vXPlus1) and y=$(vYMinus1))) //bottom right
;

trace $(vBinId);

// Join the temporary table with your main data table
[neighbour_check]:
Concatenate(neighbour_check)
LOAD '$(vBinId)' as bin_id, product, date, avg(avg_index_value) as neighbouring_avg_index_value
resident binning_table_t2_bin_id
group by product, date
;

drop table binning_table_t2_bin_id;

next vBinId

Sample data :

[binning_table_t]:

load * inline [
bin_id , avg_index_value, date, product
R|114|838 , 0.60129568 , 30/10/2024 , apple
R|114|838 , 0.60129568 , 12/11/2024 , apple
R|114|839 , 0.47632898 , 12/11/2024 , apple
R|114|839 , 0.66667490883028 , 30/10/2024 , apple
R|114|843 , 0.55975765889839 , 12/11/2024 , apple
R|114|843 , 0.78789984277157 , 30/10/2024 , apple
R|114|844 , 0.67144721805877 , 12/11/2024 , apple
R|114|844 , 0.83291251451518 , 30/10/2024 , apple
R|114|846 , 0.5460232 , 12/11/2024 , apple
R|114|846 , 0.7981770082408 , 30/10/2024 , apple
R|114|847 , 0.63275596 , 12/11/2024 , apple
R|114|847 , 0.83393975481718 , 30/10/2024 , apple
R|115|838 , 0.50620004195143 , 12/11/2024 , apple
R|115|838 , 0.61589052 , 30/10/2024 , apple
R|115|839 , 0.54598076193291 , 12/11/2024 , apple
R|115|839 , 0.73205414 , 30/10/2024 , apple
R|115|842 , 0.70264421 , 12/11/2024 , apple
R|115|842 , 0.95325742113421 , 30/10/2024 , apple
R|115|843 , 0.65527531952943 , 12/11/2024 , apple
R|115|843 , 0.90651484 , 30/10/2024 , apple
R|115|844 , 0.61245206101525 , 12/11/2024 , apple
R|115|844 , 0.83911441747261 , 30/10/2024 , apple
R|115|846 , 0.75535312 , 12/11/2024 , apple
R|115|846 , 0.88929316987812 , 30/10/2024 , apple
R|115|847 , 0.6306660543884 , 12/11/2024 , apple
R|115|847 , 0.82988369243856 , 30/10/2024 , apple
]
;

Re: Smooth values based on neighbouring bins

marcus_sommer — Tue, 19 Nov 2024 17:10:04 GMT

Such approach must be slow because in each loop iteration is a table-load initialized which reads a larger resident-load with a heavy where-clause.

Personally I would try a different way by accumulating the relevant neighbours with the help of interrecord-functions within an appropriate sorted resident load. Here a simplified example:

load Key, Value, rangeavg(Value, previous(Value), peek('Value', -2)) as ValueAVG
resident Source order by Key;

This means the essential part is the right ordering against one or several fields. Depending on the scenario it might be necessary to embed the accumulation-part within n (nested) if-loops to ensure that the interrecord-logic is restricted to the wanted areas and/or to run it twice - forwards and backwards - and/or to create n different avg-fields and/or to calculate the offset from the current value against the avg and/or to fetch outliers within further range-min/max and/or further grouping/flagging-fields or similar stuff.

topic Re: Smooth values based on neighbouring bins in App Development

Smooth values based on neighbouring bins

Re: Smooth values based on neighbouring bins