Do not input private or sensitive data. View Qlik Privacy & Cookie Policy.
Skip to main content

Announcements
Qlik Open Lakehouse is Now Generally Available! Discover the key highlights and partner resources here.
cancel
Showing results for 
Search instead for 
Did you mean: 
Not applicable

Calculating KS Statistic (or, Full Accumulation in Variables)

Hi Qlik Community,

I have quite a complex task that I need to complete. I have been asked to calculate the KS (Kolmogorov-Smirnov) statistic using the Qlikview tool. For those of you unfamiliar with it, the KS statistic is a measure of separation between two distributions. It is often used in the financial industry to measure the effectiveness of a particular score at separating risk. Read more about it on Wikipedia here: Kolmogorov–Smirnov test - Wikipedia, the free encyclopedia

Basically, I am trying to calculate the maximum value of the difference between two cumulative density functions. I am trying to store this value as a variable. As you can see in the example below, I can calculate the CDFs themselves using the “full accumulation” checkbox and plotting the Score (or a binned score, using round), and

1. For the “bads” CDF, the number of bads in a bin over the total number of bads in the entire population:

Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))

2. For the “goods” CDF, the number of goods in a bin over the total number of goods in the entire population

Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))

You can see here that I’m only capturing booked accounts (Booked=1), as those are the only accounts which can truly be good or bad.

I can also calculate the different between the two distributions over the score by plotting the difference of the above functions, again with the “full accumulation” box checked.

Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))

Now, I’d like to find the maximum value of that expression over ALL of the scores to calculate KS, so I use the amazing Aggr() function that qlikview has, and create a new variable called “vKS” that has the definition:

Max(Aggr(
Sum( {<Booked={1}>}  If(Bad=1,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=1,1,0))
-
Sum( {<Booked={1}>}  If(Bad=0,1,0))/Sum( {<Booked={1}>}  TOTAL If(Bad=0,1,0))
,
Score))

However: There is no “Full Accumulation” checkbox anymore, so it will calculate only the max difference in the PDF! I cannot think of a workaround for this. I would be most grateful for any suggestions.

Thank you and happy holidays.

Jesse

2 Replies
Not applicable
Author

Any thoughts on this? I have been trying to solve this for a couple more hours to no avail...

ogautier62
Specialist II
Specialist II

Hello,


Using the rank and quantiles function seems easier:

$ (f): the field / function concerned


in dimension, the function, here $ (f)

in expressions the distribution functions:

$ f (): quantile = rank (- $ (f), 4) / count (total $ (f))

the second distribution function to compare

for example: Fy = normdist (avg ($ (f)), mu, sigma) for a normal distribution



the gap = fabs (Fy - quantile)

the max gap =

max (total aggr (fabs (normdist (avg ($ (f)), mu, sigma)

   - rank (- $ (f), 4) / count (total $ (f))), $ (f)))


on this example same result as R (ks.test) at the 4th decimal place

cdt



Note: Message edited by Community Moderator to include English translation as a courtesy.


- - - - - -

Hello,

En utilisant la fonction rank et les quantiles cela semble plus simple :

$(f) : le champ/la fonction concernée

en dimension, la fonction , ici $(f)

en expressions les fonctions de répartition :

de $f() :      quantile =  rank(-$(f),4) / count(total $(f))

de la deuxième fonction de répartition à comparer

par ex : Fy = normdist(avg( $(f)),mu,sigma) pour une loi normale


l'écart = fabs(Fy - quantile)

l'écart max =

max(total aggr( fabs(normdist(avg( $(f)),mu,sigma)
  - rank(-$(f),4) / count(total $(f)))   ,$(f) ))


sur cet exemple même résultat que R (ks.test) à la 4ème décimale


cdt