Running Regressions - Qlik Community

Report Inappropriate Content · ‎2009-09-03

Hi all,

I have a large set of data and I want to see which variables are highly correlated and what the R2 value is. Is there a good clean way to get QlikView to run through all the permutations of correlations on my columns and spit out the top pairings and their respective R2 values?

My understanding would be to run a loop on each of the variables, can I get some help with that code?

Thanks.

Sameer

Oleg_Troyansky · ‎2009-09-03

I'm not so strong in statistics, so I can't quite offer a solution. Let me just mention that running loops in the script over a large data set is a very slow exercise, I would recommend to look for other ways...

Oleg

Oleg Troyansky
Call me for expert support, performance tuning, advanced training and more!

johnw · ‎2009-09-04

It's been a long time since I've had statistics myself, but I can read Wikipedia and the QlikView help text and pretend to know what I'm talking about. 😉

QlikView has a function for the correlation coefficient, so if we're talking about a linear regression, just square it and you should be good. So you're looking at sqr(correl(field1,field2)), I believe. As for looping through the fields, I'd probably just build the expression in a chart rather than doing it in the script.

I'm having multiple problems, though.

First, I'd like to build the field list directly from $Field. I tried this:

[Fields]:
LOAD fieldvalue('$Field',iterno()) as field1
AUTOGENERATE 1
WHILE len(fieldvalue('$Field',iterno()))
;

But I get nothing. If I replace $Field with one of my fields, I DO get all the values, so I'm thinking that $Field isn't established until the end of the script?

That's probably not critical, as surely we could just list out all the fields we want to correlate, as with the method above, we'd have to list out the fields we DON'T want to correlate, so it's kind of all the same.

But I'm also having a hard time converting back and forth between data and metadata. I want to have two columns, field1 and field2. Then I simply want to do sqr(correl(field1,field2)). But of course I don't want to correlate field1 and field2. I want to correlate the Fields which are the VALUES in those fields for that row on the table.

Now, if I actually SELECT single values for field1 and field2, then I can get it with sqr(correl($(=field1),$(=field2))). But this doesn't work for building a table, as with multiple values, $(=field1) and $(=field2) evaluate to null, not to the specific row on the table.

Anyway, here's what I have so far if someone wants to take a stab at solving one or the other of those problems. I may or may not be able to keep working on it.

johnw · ‎2009-09-04

OK, here's an ugly answer. I figure it's acceptable since there are no lists of combinations of fields to edit, only lists of fields, so the complexity is O(N) instead of O(N^2). I figure that will do. I'd like a better answer, but maybe it's enough to get you up and running.

[Data]:
LOAD
ceil(rand()*10) as A
,ceil(rand()*10) as B
,ceil(rand()*10) as C
,ceil(rand()*10) as D
,ceil(rand()*10) as E
AUTOGENERATE 10
;
[Fields]:
LOAD * INLINE [
field1
A
B
C
D
E
];
LEFT JOIN ([Fields])
LOAD field1 as field2
RESIDENT [Fields]
;
INNER JOIN ([Fields])
LOAD *
RESIDENT [Fields]
WHERE field1 < field2
;

=sqr(correl(pick(match(field1,'A','B','C','D','E'),A,B,C,D,E)
,pick(match(field2,'A','B','C','D','E'),A,B,C,D,E)))