Solved: Re: Inconsistent Cluster Filtering with KMeans usi... - Qlik Community

gileswalker

Hi - thanks for clicking on my post. I hope you can help?

(I'm using Qlik May'23, Client Managed)

I am building an app, and have used KMeans2D for the first time to cluster data. I have used the out-the-box function from the insights advisor to get started.

My app uses drop down variable inputs to select both measures and dimensions I have set up, to make the user experience very fluid and dynamic, allowing them to change measures and dimensions to create reporting.

The KMeans clustering, used in a scatterplot, is driven by these user selections. Both my chart, and the KMeans AGGR formula use the same user driven inputs from 2 user chosen measures, plus also 1 user chosen dimension to generate (1) the chart and (2) the coloured clustering, and it works fine. The clustering formula used to colour the clusters is this:

=aggr(KMeans2D($(vClusterQty),[🛠️ DIY Measure 12],[🛠️ DIY Measure 13], 'zscore')+1, $(vDimension) )

The resulting chart displays data as expected and the KMeans clusters are shown with colour, and the legend shows those colours along with the cluster number as expected:

If you click on a cluster from the chart Legend, the data will reduce to the clustered selection. Perfect.

However, to expand capability I have tried to get clever and creative with this solution, and I've literally hit a wall and need some help please.

I believe that where real value comes in, is by allowing dynamic extensions of that single dimension, by allowing a second dimension to be used as well, and combining them. What that does is enriches the single first dimension by joining it to the second one, therefore creating a kind of hybrid dimension. It naturally extends and changes the data outcomes in the chart, allowing you to see enriched clustering, ie additional dimensions are taken into consideration by Qlik and allows a cluster to recalibrate as the 2nd dimension is selected.

2 things are set up to facilitate this dimension 'extension':

(1) my chart dimension is dynamic, ie when 2 dimensions are selected by the user, it will combine the 2 dimensions together to create the hybrid dimension value, vDim1 & '+' & vDim2

(2) I have set the AGGR dimension for KMeans clustering to do much the same, ie adding the 2nd dimension to the AGGR's grouping dimension. To clarify the vDimension from the above long AGGR formula has a formula behind it to combine dimensions if more than 1 is chosen. But the main difference is that to make that AGGR formula work, the dimensions must be separated by commas ie vDimension is vDim1 & ',' & vDim2. It won't cluster correctly if this is not done. When this is done using the comma to separate the dimensions, the chart seems to respond correctly, clustering the data much like I expected.

BUT the big issue then comes after the chart has rendered, specifically after you select a cluster from the chart legend. The results you get after selection aren't what I expect or need. Here is an example of what happens.

In the chart below I've initially clicked on Cluster 10, and before I confirm that choice, Qlik highlights in the chart what represents cluster 10 (top left):

The act of selecting the cluster value natively applies filters against the 2 dimensions used, individually . The 2 filters get displayed in the selections bar like this:

When I confirm my selection, this is the resulting output Qlik gives:

The blue data points in the bright green circle are the 'cluster 10' items I wanted, but the act of selecting the cluster value from the chart legend has natively applied the 2 individual/independent filters, meaning additional green data points in the red rectangle are also showing, but they were never part of 'cluster 10'. This is 'not technically wrong', as those data points do comply with how the filters have worked based on the coding, but we wanted Cluster 10 results only, and the above outcome does not provide that. We got "Cluster 10 +"......!

Interestingly, the above behaviour Qlik shows us is in direct contrast to using the lasso in the scatterplot to select a cluster. When you use the lasso and select the actual coloured dots in the chart that are part of a cluster, the filtered result is perfect, seemingly because you are choosing the hybrid dimension values directly, not the cluster number. Further, in doing it that way, the filter shown at the top of the screen does not display the 2 filtered dimensions, it shows 1 filter which is derived from the chart dimension:

Despite the lasso being a great option, it isn't the solution we need. This is partly as my clusters are very large, but more importantly we need to be able to use the cluster number to grasp the cluster entirely and quickly, as having to use a lasso to get to a cluster is rather limiting if we want to use the Cluster number elsewhere in our app.

From subsequent trial and error work to determine route cause, I have come to the conclusion that the issue might be in the AGGR grouping dimension. Researching AGGR function on Qlik Help page, seems to indicate that you can't use a calculated dimension with AGGR, so I can't combine 2 dimensions dynamically inside the formula.

To prove a point that this was the issue, I did do a trial where I manually created the exact hybrid dimension I wanted in the Load script, and then replaced the vDimension with it, then selected the same 2 individual dimensions from my drop down menus.......and then selected random resulting clusters, and sure enough everything filtered perfectly from the cluster selections. The manually created hybrid field and the 2 selected dimensions were in perfect alignment and everything behaved correctly.

With my knowledge and capabilities, it seems that the means to construct the hybrid dimension field on the fly from selectable variables can't then create the necessary combined dimension for the AGGR dimension to use, and allow precise filtering using the resulting cluster.

It seems that this dimension 'enrichment' process needs a means to synthetically create a hybrid dimension on the fly (using vDim1 and vDim 2 etc) that the AGGR can leverage to create a consistent set of outcomes once data is selected.

(I also have a 3rd selectable dimension which I want to use like vDim1 & '+' & vDim2 & '+' & vDim3)

Disclaimer: I have tried to be factual and descriptive here - I realise that Qlik is most likely operating in a manner governed by my coding and set up. Clearly I need to know more to overcome what appears to be a constraint, albeit from my perspective an unexpected constraint.

I have done a lot of solution searching but no luck. If anyone has any advice, I'd be very grateful.

Regards Giles

gileswalker

Update 19/07/24:

So after more researching on this issue, I am certain the crux of the problem is linked to the lack of calculation ability of the dimension in the AGGR formula.

Extending my prior testing (mentioned above), using manually created additional hybrid dimensions (ie [Dim1] & [Dim2] & [Dim3] as [Dim123] etc), and with a tweak to my variables, I got 3 extra permutations of hybrid dimensions functioning as required in the KMeans chart clustering. My variables passed the concatenated hybrid dimension to the AGGR dimension, and the necessary hybrid field was retrieved from the load script. So the proof of concept was a 'pass'.

So to proceed with this idea, my list of 17 selectable dimension fields that could be used to create permutations needed to be added in the load script.

However that presents its own issues....

Without labouring the maths too heavily, as I have 3 sets of selectable dimensions linked to my variables, there are LOADS of field permutations that could theoretically be chosen to provide the enrichment I talked about above.

But I continued......loading the necessary dim combinations (ie [Dim1] & [Dim2] & [Dim3] as [Dim123] etc) into the load script, required over 4000 additional fields (thanks to ChatGPT to create the field combination syntax). From there the data load was extended hideously (+25 minutes), and the app once opened, it could barely function when looking at the KMeans chart. I had feared as much.

So with only my testing to go on (maybe my way of doing this was poor), I have stopped the effort required to progress this. My conclusion is that the technical exercise passed proof of concept, but could not be created in reality in a manner that was executable with the resources within Qlik on premise.

I still strongly believe this idea of data enrichment within KMeans2D is totally valid, but my way didn't deliver what was required.

View solution in original post

gileswalker

@MWiciok I saw a comment against one of @hic 's posts that you made (https://community.qlik.com/t5/Design/Calculated-Dimensions/ba-p/1472813/page/3#comments).

I think you were maybe referring to what I am seeing in my post above? I appreciate any further comments or insights. Thanks

gileswalker

Update 19/07/24:

So after more researching on this issue, I am certain the crux of the problem is linked to the lack of calculation ability of the dimension in the AGGR formula.

Extending my prior testing (mentioned above), using manually created additional hybrid dimensions (ie [Dim1] & [Dim2] & [Dim3] as [Dim123] etc), and with a tweak to my variables, I got 3 extra permutations of hybrid dimensions functioning as required in the KMeans chart clustering. My variables passed the concatenated hybrid dimension to the AGGR dimension, and the necessary hybrid field was retrieved from the load script. So the proof of concept was a 'pass'.

So to proceed with this idea, my list of 17 selectable dimension fields that could be used to create permutations needed to be added in the load script.

However that presents its own issues....

Without labouring the maths too heavily, as I have 3 sets of selectable dimensions linked to my variables, there are LOADS of field permutations that could theoretically be chosen to provide the enrichment I talked about above.

But I continued......loading the necessary dim combinations (ie [Dim1] & [Dim2] & [Dim3] as [Dim123] etc) into the load script, required over 4000 additional fields (thanks to ChatGPT to create the field combination syntax). From there the data load was extended hideously (+25 minutes), and the app once opened, it could barely function when looking at the KMeans chart. I had feared as much.

So with only my testing to go on (maybe my way of doing this was poor), I have stopped the effort required to progress this. My conclusion is that the technical exercise passed proof of concept, but could not be created in reality in a manner that was executable with the resources within Qlik on premise.

I still strongly believe this idea of data enrichment within KMeans2D is totally valid, but my way didn't deliver what was required.

Inconsistent Cluster Filtering with KMeans using AGGR and Dynamic Dimensions

Client Managed

dimension

expression

Qlik Sense

Variables

Visualization