k-Means clustering using Qlik's Nebula.js - Qlik Community

Dipankar_Mazumdar · ‎2021-08-26

In the past few years, Qlik Sense has introduced a solid range of advanced analytics capabilities that compliments the Data Analytics platform. This includes using techniques such as Machine Learning, Natural Language Processing, etc. to help analysts/scientists explore data in a better way, get insights into any hidden patterns & take necessary actions. Bringing these methods together with the Data Analytics platform is often termed ‘Augmented Intelligence’. For a more detailed description of the what’s & why’s, please refer to this link.

Problem:
Consider the following scenario. An analyst needs to explore geographical data for a variety of neighborhoods in Toronto to help the city’s crime department set up some hotspots to monitor the neighborhoods and analyze the criminal activities in an effective way. Segregating the neighborhoods into five different clusters based on some kind of similarities between them would be the first step.

Qlik Sense client approach:

A solution to this from the Qlik Sense client perspective would be to use the KMeans2D chart function that applies k-Means clustering internally and calculates the cluster_id for each of the neighborhoods. The results can then be visualized in the form of a Scatter plot with latitude on the X-axis, longitude on the Y-axis, and bubbles representing each neighborhood’s ID. Here’s a snippet that shows the data configuration.

To visually explain the five clusters, the chart can be ‘colored by dimension’ based on the cluster_id’s. Please note that we use color by dimension (and use a dimension expression) and not color by expression. Here’s where we define our dimension expression.

Expression:

=pick(aggr(KMeans2D(vDistClusters,only(Lat),only(Long)),FID)+1, 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4','Cluster 5')

Embedded Analytics approach:

Now, consider that you are a full-stack Qlik developer and you leverage the open-sourced Nebula.js library to build your analytics portal. What would be an easy way for you to apply k-Means clustering to this dataset without relying on any 3rd-party libraries?

Solution:

Technically there are two ways to achieve this using Nebula.js.

Develop the visualization in Qlik Sense client & embed it to your page.
Develop a visualization on the fly & use the kMeans2D chart function.

The first point is fairly simple & straightforward. You can create your Scatter plot in the Qlik Sense client and just call the <object_id> using Nebula’s render( ) function like this.

n.render({
  element,
  id: '<ObjectID>',
});

In this blog post, we will keep our focus on Point 2. i.e. we will develop a visualization on the fly and then apply the clustering algorithm using the chart function to achieve our goal.

Implementation:

Before we start developing the visualization, let’s recap one of the amazing things about Nebula.js. It is a collection of JavaScript libraries, visualizations, and CLIs that helps developers build and integrate visualizations on top of Qlik's Associative Engine. Essentially it serves as a wrapper on top of the Engine allowing us to use expressions as we can inside the Qlik Sense client.

Since Nebula sits on top of the Engine, we also have direct access to the data (dimensions & measures) from the Qlik Sense app that is connected to our Nebula app. This makes building an existing or custom visualization using Nebula easier & quicker from a data structure perspective. It also goes without saying that Qlik Sense ‘associations’ works seamlessly when using Nebula.

Now, let’s quickly build an on-the-fly scatter plot and use the kMeans2D chart function to color each point/bubble on our plot.

Step 1: Define the qAttributeDimensions for the ‘dimension expression’.

qAttributeDimensions: [
          {
            qDef:
              "=pick(aggr(KMeans2D(vDistClusters, only(Lat), only(Long)), FID)+1, 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5')",
            qAttribute: true,
            id: "colorByAlternative",
            label: "Cluster id"
          }
        ]
      }
    ],

Since we would like to use color by dimension as a coloring technique in our scatter plot, we have to explicitly specify it using id:"colorByAlternative" in the qDef as shown in the above code. An alternative to this would be id: "colorByExpression" if we wanted to color by expression.

Also, we use a dimension expression (not just a dimension field) in the qDef property of qAttributeDimensions. Therefore, the dimension expression has to be defined inside the qDef property (as shown in the code).

Step 2: Define the properties.

Next, we define the required properties in the root of the object. An important property here is the color. We tell Nebula to not color our chart automatically first (code below) so we can override using our expression’s color. Also, we set have to set the right color mode inside the color. In our case, it is “byDimension”.

The other crucial property within color is byDimDef. It is used to configure the settings for coloring the chart by dimension. byDimDef consists of three properties -

type - either ‘expression’ | ‘libraryItem’
key - if it is a ‘libraryItem’, then the libraryId | dimension expression if using an ‘expression’
label - Label displayed for coloring (in the legend). Can be string | expression

The code for defining the properties is below.

  properties: {
      title: "k-Means clustering",
      color: {
        auto: false,
        mode: "byDimension",
        byDimDef: {
          type: "expression",
          key:
            "=pick(aggr(KMeans2D(vDistClusters, only(Lat), only(Long)), FID)+1, 'Cluster 1', 'Cluster 2', 'Cluster 3', 'Cluster 4', 'Cluster 5')",
          label: "Cluster id"
        }
      }
    }

That is pretty much it! To elucidate the clustering method, we also embed a scatter plot to our Nebula app with non-clustered data.

Let us take a look at the two plots -

Using the KMeans2D function the neighborhoods have now been segregated into five different clusters based on their similarity. The color by cluster helps us in interpreting the five groups distinctly.

Before we end this post, let’s take a look at the clustering function once and understand the parameters.

KMeans2D(num_clusters, coordinate_1, coordinate_2 [, norm])

num_clusters - implies the no. of cluster we would like to have. Typically this value is calculated using the elbow curve or silhouette analysis methods. Read more here.
coordinate_1, coordinate_2 - indicates the columns used by the clustering algorithm. These are both aggregations.
norm - This is an optional normalization method applied to datasets before k-Means clustering. Possible values are - (0/‘none’ - no normalization, 1/ ‘zscore’ -zscore normalization, 2/‘minmax’ - min-max normalization).

In our current Nebula application, we play around with the num_clusters parameter to understand the differences such as neighborhood overlaps, etc. We embed three action-buttons in our dashboard and the final result can be seen below.

The complete project can be found on either my Github repo or this Glitch.

Let me know if you have any interesting ideas to apply clustering or any advanced analytical methods using Nebula.js.

~Dipankar