Unlock a world of possibilities! Login now and discover the exclusive benefits awaiting you.
Hi Qlik Community,
I have quite a complex task that I need to complete. I have been asked to calculate the KS (Kolmogorov-Smirnov) statistic using the Qlikview tool. For those of you unfamiliar with it, the KS statistic is a measure of separation between two distributions. It is often used in the financial industry to measure the effectiveness of a particular score at separating risk. Read more about it on Wikipedia here: Kolmogorov–Smirnov test - Wikipedia, the free encyclopedia
Basically, I am trying to calculate the maximum value of the difference between two cumulative density functions. I am trying to store this value as a variable. As you can see in the example below, I can calculate the CDFs themselves using the “full accumulation” checkbox and plotting the Score (or a binned score, using round), and
1. For the “bads” CDF, the number of bads in a bin over the total number of bads in the entire population:
Sum( {<Booked={1}>} If(Bad=1,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=1,1,0))
2. For the “goods” CDF, the number of goods in a bin over the total number of goods in the entire population
Sum( {<Booked={1}>} If(Bad=0,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=0,1,0))
You can see here that I’m only capturing booked accounts (Booked=1), as those are the only accounts which can truly be good or bad.
I can also calculate the different between the two distributions over the score by plotting the difference of the above functions, again with the “full accumulation” box checked.
Sum( {<Booked={1}>} If(Bad=1,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>} If(Bad=0,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=0,1,0))
Now, I’d like to find the maximum value of that expression over ALL of the scores to calculate KS, so I use the amazing Aggr() function that qlikview has, and create a new variable called “vKS” that has the definition:
Max(Aggr(
Sum( {<Booked={1}>} If(Bad=1,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>} If(Bad=0,1,0))/Sum( {<Booked={1}>} TOTAL If(Bad=0,1,0))
,Score))
However: There is no “Full Accumulation” checkbox anymore, so it will calculate only the max difference in the PDF! I cannot think of a workaround for this. I would be most grateful for any suggestions.
Thank you and happy holidays.
Jesse
Any thoughts on this? I have been trying to solve this for a couple more hours to no avail...
Hello,
Using the rank and quantiles function seems easier:
$ (f): the field / function concerned
in dimension, the function, here $ (f)
in expressions the distribution functions:
$ f (): quantile = rank (- $ (f), 4) / count (total $ (f))
the second distribution function to compare
for example: Fy = normdist (avg ($ (f)), mu, sigma) for a normal distribution
the gap = fabs (Fy - quantile)
the max gap =
max (total aggr (fabs (normdist (avg ($ (f)), mu, sigma)
- rank (- $ (f), 4) / count (total $ (f))), $ (f)))
on this example same result as R (ks.test) at the 4th decimal place
cdt
Note: Message edited by Community Moderator to include English translation as a courtesy.
- - - - - -
Hello,
En utilisant la fonction rank et les quantiles cela semble plus simple :
$(f) : le champ/la fonction concernée
en dimension, la fonction , ici $(f)
en expressions les fonctions de répartition :
de $f() : quantile = rank(-$(f),4) / count(total $(f))
de la deuxième fonction de répartition à comparer
par ex : Fy = normdist(avg( $(f)),mu,sigma) pour une loi normale
l'écart = fabs(Fy - quantile)
l'écart max =
max(total aggr( fabs(normdist(avg( $(f)),mu,sigma)
- rank(-$(f),4) / count(total $(f))) ,$(f) ))
sur cet exemple même résultat que R (ks.test) à la 4ème décimale
cdt