3 Replies Latest reply: Nov 14, 2017 3:18 AM by Simone Trabattoni RSS

    Qlik AAI: Clustering in R, kmeans

    Simone Trabattoni

      Hi community,

      I'm playing with R and Qlik Sense Desktop, using this very excellent example, the kmeans app given by deh, . Here the original app and the data, you can find a modified app attached also to my post.

       

      https://community.qlik.com/docs/DOC-18787#comment-63411

       

       

       

      The app uses the famous iris data, to perform in Qlik Sense a kmeans clustering of the observations (each observation is have four quantitative continue variables occurrencies, and a qualitative variable occurency, the species of the iris flowers): the result is a nice scatterplot that have on the axis a pair of the quantitative variables, as dots the observations, and as colour the cluster.

       

      My goal is to have on the dots, the species of the iris (3 species), clearly with a reasonable number of cluster (1). As you can understand my goal is not to have an analysis with a meaning, but test the sistem to see how much is flexible.

       

      So I simply put the species in the dimension panel in Qlik and the result is an error, something like "the client has not a valid argument".

       

      Looking the SSEtoRServe:

      clust.PNG

       

      I decided to transplant the problem in R.

      Working on data in R without using the observation as measure (like in Qlik Sense), the result is the same, here my code:


      Iris <- read.csv('yourpath\\Iris.csv',sep=',')

      nrow(Iris)

      ncol(Iris)

      head(Iris)

       

      clusters <- kmeans((Iris[,2:5]), 3, nstart = 20)$cluster


      head(Iris[,2:5])

      Iris2 <- cbind(Iris,clusters)

       

      library(ggplot2)

       

      p <- ggplot(Iris2, aes(petal.length, petal.width))

      p + geom_point(aes(colour = factor(clusters)), size = 3) + geom_text(aes(label=observation),size = 3)

       

      clusteR.PNG

       

      And that's great!

      So I tried do create the error that I have, using the species as dimension (I've clean up all the memory etc. in RStudio):

       

       

      data <- read.table("yourpath\\Iris.csv",sep=",",header=TRUE)

      head(data)

      data <- data[,2:6]

      head(data)

      library(plyr)

       

       

      #  grouping

      data <- ddply(data,

              ~iris.species,

              summarise,

              sepal.length = mean(sepal.length),

              sepal.width = mean(sepal.width),

              petal.length = mean(petal.length),

              petal.width = mean(petal.width)

                  )

       

       

      # it does not work

      kmeans(data,1)

       

       

      # it works!

      kmeans(data[,2:5],1)

      So R use also the iris species in the kmeans, if you  do not remove it explicitly.

       

      The question is: how can I make it work with the species as dimension, i.e. a non numeric dimension? I cannot understand why it does not work.

      Thanks in advance.

       

      Added the app modded (I modified also de ER, nothing of exagerate).

       

      EDIT: Added the app with the aggr()