Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Calculating KS Statistic (or, Full Accumulation in Variables)

Hi Qlik Community,

I have quite a complex task that I need to complete. I have been asked to calculate the KS (Kolmogorov-Smirnov) statistic using the Qlikview tool. For those of you unfamiliar with it, the KS statistic is a measure of separation between two distributions. It is often used in the financial industry to measure the effectiveness of a particular score at separating risk. Read more about it on Wikipedia here: Kolmogorov–Smirnov test - Wikipedia, the free encyclopedia

Basically, I am trying to calculate the maximum value of the difference between two cumulative density functions. I am trying to store this value as a variable. As you can see in the example below, I can calculate the CDFs themselves using the “full accumulation” checkbox and plotting the Score (or a binned score, using round), and

1. For the “bads” CDF, the number of bads in a bin over the total number of bads in the entire population:

Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))

2. For the “goods” CDF, the number of goods in a bin over the total number of goods in the entire population

Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))

You can see here that I’m only capturing booked accounts (Booked=1), as those are the only accounts which can truly be good or bad.

I can also calculate the different between the two distributions over the score by plotting the difference of the above functions, again with the “full accumulation” box checked.

Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))

Now, I’d like to find the maximum value of that expression over ALL of the scores to calculate KS, so I use the amazing Aggr() function that qlikview has, and create a new variable called “vKS” that has the definition:

Max(Aggr(
Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))
,
Score))

However: There is no “Full Accumulation” checkbox anymore, so it will calculate only the max difference in the PDF! I cannot think of a workaround for this. I would be most grateful for any suggestions.

Thank you and happy holidays.

Jesse

2 Replies
Not applicable
Author

Any thoughts on this? I have been trying to solve this for a couple more hours to no avail...

ogautier62
Specialist II
Specialist II

Hello,


Using the rank and quantiles function seems easier:

$ (f): the field / function concerned


in dimension, the function, here $ (f)

in expressions the distribution functions:

$ f (): quantile = rank (- $ (f), 4) / count (total $ (f))

the second distribution function to compare

for example: Fy = normdist (avg ($ (f)), mu, sigma) for a normal distribution



the gap = fabs (Fy - quantile)

the max gap =

max (total aggr (fabs (normdist (avg ($ (f)), mu, sigma)

   - rank (- $ (f), 4) / count (total $ (f))), $ (f)))


on this example same result as R (ks.test) at the 4th decimal place

cdt



Note: Message edited by Community Moderator to include English translation as a courtesy.


- - - - - -

Hello,

En utilisant la fonction rank et les quantiles cela semble plus simple :

$(f) : le champ/la fonction concernée

en dimension, la fonction , ici $(f)

en expressions les fonctions de répartition :

de $f() :      quantile =  rank(-$(f),4) / count(total $(f))

de la deuxième fonction de répartition à comparer

par ex : Fy = normdist(avg( $(f)),mu,sigma) pour une loi normale


l'écart = fabs(Fy - quantile)

l'écart max =

max(total aggr( fabs(normdist(avg( $(f)),mu,sigma)
  - rank(-$(f),4) / count(total $(f)))   ,$(f) ))


sur cet exemple même résultat que R (ks.test) à la 4ème décimale


cdt